将字符串切成固定宽度字符元素的向量
我有一个包含文本字符串的对象:
x <- "xxyyxyxy"
我想将其拆分为一个向量,每个元素包含两个字母:
[1] "xx" "yy" "xy" "xy"
看起来 strsplit
应该是我的票,但因为我没有正则表达式 foo ,我不知道如何让这个函数按照我想要的方式将字符串切成块。我该怎么做?
I have an object containing a text string:
x <- "xxyyxyxy"
and I want to split that into a vector with each element containing two letters:
[1] "xx" "yy" "xy" "xy"
It seems like the strsplit
should be my ticket, but since I have no regular expression foo, I can't figure out how to make this function chop the string up into chunks the way I want it. How should I do this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(13)
使用
substring
是最好的方法:但这里有一个 plyr 的解决方案:
Using
substring
is the best approach:But here's a solution with plyr:
这是一个快速解决方案,将字符串拆分为字符,然后将偶数元素和奇数元素粘贴在一起。
基准设置:
基准 1:
基准 2:
现在,使用更大的数据。
Here is a fast solution that splits the string into characters, then pastes together the even elements and the odd elements.
Benchmark Setup:
Benchmark 1:
Benchmark 2:
Now, with bigger data.
怎么样
基本上,添加一个分隔符(此处为“”),然后然后使用
strsplit
How about
Basically, add a separator (here " ") and then use
strsplit
strsplit 将会有问题,看看这样的正则表达式,
它会在正确的点处分割,但什么也没有留下。
您可以使用子字符串 &朋友们
strsplit is going to be problematic, look at a regexp like this
it will split at the right points but nothing is left.
You could use substring & friends
这是一种方法,但不使用正则表达式:
Here's one way, but not using regexen:
注意对于子字符串,如果字符串长度不是您请求的长度的倍数,那么您将在第二个序列中需要一个 +(n-1) :
ATTENTION with substring, if string length is not a multiple of your requested length, then you will need a +(n-1) in the second sequence:
完全黑客,JD,但它完成了
Total hack, JD, but it gets it done
辅助函数:
A helper function:
使用 C++ 甚至可以更快。与GSee版本比较:
Using C++ one can be even faster. Comparing with GSee's version:
好吧,我使用以下伪代码来完成此任务:
在代码中,我做了
这会返回一个内部包含分割向量的列表,但不是向量。
Well, I used the following pseudo-code to fulfill this task:
In code, I did
This returns a list with the split vector inside, though, not a vector.
根据我的测试,下面的代码比之前进行基准测试的方法更快。 stri_sub 相当快,并且 seq.int 比 seq 更好。通过将所有 2L 更改为其他值,也可以轻松更改琴弦的大小。
当字符串块长度为 2 个字符时,我没有注意到差异,但对于更大的块,这会稍微好一些。
From my testing, the code below is faster than the previous methods that were benchmarked. stri_sub is pretty fast, and seq.int is better than seq. It's also easy to change the size of the strings by changing all the 2Ls to something else.
I didn't notice a difference when string chunks were 2 characters long, but for bigger chunks this is slightly better.
我开始寻找一个矢量化的解决方案,以避免
lapply()
跨长向量的单字符串解决方案之一。失败为了找到现有的解决方案,我不知何故掉进了一个兔子洞
煞费苦心地用 C 语言写了一个。相比之下,它最终变得非常复杂
此处显示的许多单行 R 解决方案(不,感谢我决定也
想要处理 Unicode 字符串以匹配 R 版本),但我想我会
分享结果,以防有一天它能以某种方式帮助某人。这是什么
最终变成了这样:
然后我将这个怪物放入一个名为
str_chunk.c
的文件中,并使用R CMD SHLIB str_chunk.c
进行编译。为了尝试一下,我们需要在 R 端进行一些设置:
所以我们在 C 版本中实现的是获取向量输入并返回一个列表:
现在我们开始进行基准测试。
我们以 200 倍的改进开始,对于长向量
短弦:
……然后缩小到明显不那么令人印象深刻的 3 倍改进
大字符串。
那么,值得吗?好吧,绝对不考虑花了多长时间
实际上可以正常工作 - 但如果这是在一个包中,它就会
在我的用例中节省了大量时间(短字符串,长向量)。
I set out looking for a vectorised solution to this, in order to avoid
lapply()
ing one of the single string solutions across long vectors. Failingto find an existing solution, I somehow fell down a rabbit hole of
painstakingly writing one in C. It ended up hilariously complicated compared
to the many one-line R solutions shown here (no thanks to me deciding to also
want to handle Unicode strings to match the R versions), but I thought I’d
share the result, in case it somehow someday helps somebody. Here’s what
eventually became of that:
I then put this monstrosity into a file called
str_chunk.c
, and compiled withR CMD SHLIB str_chunk.c
.To try it out, we need some set-up on the R side:
So what we’ve achieved with the C version is to take a vector inputs and return a list:
Now off we go with benchmarking.
We start off strong with a 200x improvement for a long(ish) vector of
short strings:
… which then shrinks to a distinctly less impressive 3x improvement for
large strings.
So, was it worth it? Well, absolutely not considering how long it took to
actually get working properly – But if this was in a package, it would have
saved quite a lot of time in my use-case (short strings, long vectors).
这是使用
stringi::stri_sub()
的一个选项。尝试:Here is one option using
stringi::stri_sub()
. Try: