快速计数字符向量中的字符
为数百万)
我有一个很长的单个字符向量,即 somechars<-c("A","B","C","A"...)
(长度 是我计算该向量中“A”和“B”的总出现次数的最快方法吗? 我尝试过使用 grep 和 lapply 但它们都需要很长时间才能执行。
我当前的解决方案是:
tmp<-table(somechars)
sum(tmp["A"],tmp["B"])
但这仍然需要一段时间来计算。有什么更快的方法可以做到这一点吗?或者是否有任何我可以使用的软件包可以更快地完成此操作?我研究了 stringr
包,但他们使用了一个简单的 grep。
I have a very long vector of single characters i.e. somechars<-c("A","B","C","A"...)
(length is somewhere in the millions)
what is the fastest way I can count the total occurrences of say "A" and "B" in this vector?
I have tried using grep
and lapply
but they all take so long to execute.
My current solution is:
tmp<-table(somechars)
sum(tmp["A"],tmp["B"])
But this still takes a while to compute. Is there some faster way I can be doing this? Or are there any packages I can be using to that does this already faster? I've looked into the stringr
package but they use a simple grep.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我认为这将是最快的...
并且,它比...快
,但不比...
快,但这取决于您进行的比较次数...这让我回到了我的第一个猜测。一旦您想要对 2 个以上的字母求和,使用 %in% 版本是最快的。
I thought that this would be fastest...
And, it is faster than...
But not faster than...
But this is qualified by how many comparisons you make... which brings me back to my first guess. Once you want to sum more than 2 letters using the %in% version is the fastest.
正则表达式很昂贵。你可以通过精确的比较得到你的问题的结果。
已更新以包括 OP 和其他答案的时间安排。还包括一个比 2 字符大小写更大的测试。
Regular expressions are expensive. You can get the result in your question with exact comparison.
UPDATED to include timings from the OP and other answers. Also included a test larger than the 2-character case.
根据我的预期,
sum(x=='A') + sum(x=='B')
是最快的。与此处提出的其他解决方案不同,它不必执行任何其他不必要的操作,例如使用
c(..)
或|
连接中间结果。 它只是进行计数 - 这是唯一真正需要的事情!R 2.13.1:
但真正有趣的是
sum(x %in% c('A','B'))
与第一个最快的解决方案的比较。在 R 2.13.1 中需要相同的时间,在 R 2.11.1 中,它要慢得多(与 John 报告的结果相同)!所以我建议使用第一个解决方案:sum(x=='A')+sum(x=='B')
。According to my expectations,
sum(x=='A') + sum(x=='B')
is the fastest.Unlike the other solutions proposed here it doesn't have to do any other unnecessary operation like concatenating the intermediate results using
c(..)
or|
. It does just the counting - the only thing which is really needed!R 2.13.1:
But really interesting is comparison of
sum(x %in% c('A','B'))
with the first, fastest solution. In R 2.13.1 it takes the same time, in R 2.11.1, it is much slower (same result as John reported)! So I'd recommend to use the first solution:sum(x=='A')+sum(x=='B')
.我最喜欢的工具,尽管我没有根据托马斯的解决方案对它进行时间检查,
它肯定是最简单的解决方案:-)。
My favorite tool, tho' I didn't time-check it against Tomas' solutions, is
It's certainly the simplest solution :-) .