快速计数字符向量中的字符

发布于 2024-12-11 04:59:49 字数 360 浏览 0 评论 0原文

为数百万）

我有一个很长的单个字符向量，即 somechars<-c("A","B","C","A"...) （长度是我计算该向量中“A”和“B”的总出现次数的最快方法吗？我尝试过使用 grep 和 lapply 但它们都需要很长时间才能执行。

我当前的解决方案是：

tmp<-table(somechars)
sum(tmp["A"],tmp["B"])

但这仍然需要一段时间来计算。有什么更快的方法可以做到这一点吗？或者是否有任何我可以使用的软件包可以更快地完成此操作？我研究了 stringr 包，但他们使用了一个简单的 grep。

原文

I have a very long vector of single characters i.e. somechars<-c("A","B","C","A"...) (length is somewhere in the millions)

what is the fastest way I can count the total occurrences of say "A" and "B" in this vector?
I have tried using grep and lapply but they all take so long to execute.

My current solution is:

tmp<-table(somechars)
sum(tmp["A"],tmp["B"])

But this still takes a while to compute. Is there some faster way I can be doing this? Or are there any packages I can be using to that does this already faster? I've looked into the stringr package but they use a simple grep.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

一世旳自豪 2024-12-18 04:59:49

我认为这将是最快的...

sum(somechars %in% c('A', 'B'))

并且，它比...快

sum(c(somechars=="A",somechars=="B"))

，但不比...

sum(somechars=="A"|somechars=="B")

快，但这取决于您进行的比较次数...这让我回到了我的第一个猜测。一旦您想要对 2 个以上的字母求和，使用 %in% 版本是最快的。

I thought that this would be fastest...

sum(somechars %in% c('A', 'B'))

And, it is faster than...

sum(c(somechars=="A",somechars=="B"))

But not faster than...

sum(somechars=="A"|somechars=="B")

But this is qualified by how many comparisons you make... which brings me back to my first guess. Once you want to sum more than 2 letters using the %in% version is the fastest.

回复收藏 0 原文

隔岸观火 2024-12-18 04:59:49

正则表达式很昂贵。你可以通过精确的比较得到你的问题的结果。

> somechars <- sample(LETTERS, 5e6, TRUE)
> sum(c(somechars=="A",somechars=="B"))
[1] 385675
> system.time(sum(c(somechars=="A",somechars=="B")))
   user  system elapsed 
  0.416   0.072   0.487

已更新以包括 OP 和其他答案的时间安排。还包括一个比 2 字符大小写更大的测试。

> library(rbenchmark)
> benchmark( replications=5, order="relative",
+   grep = sum(grepl("A|B",somechars)),
+   table = sum(table(somechars)[c("A","B")]),
+   c = sum(c(somechars=="A",somechars=="B")),
+   OR = sum(somechars=="A"|somechars=="B"),
+   IN = sum(somechars %in% c("A","B")),
+   plus = sum(somechars=="A")+sum(somechars=="B") )
   test replications elapsed relative user.self sys.self user.child sys.child
6  plus            5   4.289 1.000000     3.836    0.436          0         0
3     c            5   4.991 1.163675     4.156    0.804          0         0
5    IN            5   5.480 1.277687     4.549    0.880          0         0
4    OR            5   5.574 1.299604     5.000    0.544          0         0
1  grep            5  16.426 3.829797    16.205    0.172          0         0
2 table            5  17.834 4.158079    12.793    4.884          0         0
> 
> benchmark( replications=5, order="relative",
+   grep = sum(grepl("A|B|C|D",somechars)),
+   table = sum(table(somechars)[c("A","B","C","D")]),
+   c = sum(c(somechars=="A",somechars=="B",
+             somechars=="C",somechars=="D")),
+   OR = sum(somechars=="A"|somechars=="B"|
+            somechars=="C"|somechars=="D"),
+   IN = sum(somechars %in% c("A","B","C","D")),
+   plus = sum(somechars=="A")+sum(somechars=="B")+
+          sum(somechars=="C")+sum(somechars=="D") )
   test replications elapsed relative user.self sys.self user.child sys.child
5    IN            5   5.513 1.000000     4.464    1.004          0         0
6  plus            5   8.603 1.560493     7.705    0.860          0         0
3     c            5  10.283 1.865228     8.648    1.560          0         0
4    OR            5  12.348 2.239797    10.849    1.464          0         0
2 table            5  17.960 3.257754    12.877    4.921          0         0
1  grep            5  21.692 3.934700    21.405    0.192          0         0

Regular expressions are expensive. You can get the result in your question with exact comparison.

> somechars <- sample(LETTERS, 5e6, TRUE)
> sum(c(somechars=="A",somechars=="B"))
[1] 385675
> system.time(sum(c(somechars=="A",somechars=="B")))
   user  system elapsed 
  0.416   0.072   0.487

UPDATED to include timings from the OP and other answers. Also included a test larger than the 2-character case.

> library(rbenchmark)
> benchmark( replications=5, order="relative",
+   grep = sum(grepl("A|B",somechars)),
+   table = sum(table(somechars)[c("A","B")]),
+   c = sum(c(somechars=="A",somechars=="B")),
+   OR = sum(somechars=="A"|somechars=="B"),
+   IN = sum(somechars %in% c("A","B")),
+   plus = sum(somechars=="A")+sum(somechars=="B") )
   test replications elapsed relative user.self sys.self user.child sys.child
6  plus            5   4.289 1.000000     3.836    0.436          0         0
3     c            5   4.991 1.163675     4.156    0.804          0         0
5    IN            5   5.480 1.277687     4.549    0.880          0         0
4    OR            5   5.574 1.299604     5.000    0.544          0         0
1  grep            5  16.426 3.829797    16.205    0.172          0         0
2 table            5  17.834 4.158079    12.793    4.884          0         0
> 
> benchmark( replications=5, order="relative",
+   grep = sum(grepl("A|B|C|D",somechars)),
+   table = sum(table(somechars)[c("A","B","C","D")]),
+   c = sum(c(somechars=="A",somechars=="B",
+             somechars=="C",somechars=="D")),
+   OR = sum(somechars=="A"|somechars=="B"|
+            somechars=="C"|somechars=="D"),
+   IN = sum(somechars %in% c("A","B","C","D")),
+   plus = sum(somechars=="A")+sum(somechars=="B")+
+          sum(somechars=="C")+sum(somechars=="D") )
   test replications elapsed relative user.self sys.self user.child sys.child
5    IN            5   5.513 1.000000     4.464    1.004          0         0
6  plus            5   8.603 1.560493     7.705    0.860          0         0
3     c            5  10.283 1.865228     8.648    1.560          0         0
4    OR            5  12.348 2.239797    10.849    1.464          0         0
2 table            5  17.960 3.257754    12.877    4.921          0         0
1  grep            5  21.692 3.934700    21.405    0.192          0         0

回复收藏 0 原文

心奴独伤 2024-12-18 04:59:49

根据我的预期， `sum(x=='A') + sum(x=='B')` 是最快的。

与此处提出的其他解决方案不同，它不必执行任何其他不必要的操作，例如使用 c(..) 或 | 连接中间结果。 它只是进行计数 - 这是唯一真正需要的事情！

R 2.13.1：

> x <- sample(letters, 1e7, TRUE)
> system.time(sum(x=='A') + sum(x=='B'))
   user  system elapsed 
   1.75    0.16    1.98 
> system.time(sum(c(x=='A', x=='B')))
   user  system elapsed 
   2.40    0.23    4.27 
> system.time(sum(x=='A' | x=='B'))
   user  system elapsed 
   2.25    0.19    2.54

但真正有趣的是 sum(x %in% c('A','B')) 与第一个最快的解决方案的比较。在 R 2.13.1 中需要相同的时间，在 R 2.11.1 中，它要慢得多（与 John 报告的结果相同）！所以我建议使用第一个解决方案：sum(x=='A')+sum(x=='B')。

According to my expectations, `sum(x=='A') + sum(x=='B')` is the fastest.

Unlike the other solutions proposed here it doesn't have to do any other unnecessary operation like concatenating the intermediate results using c(..) or |. It does just the counting - the only thing which is really needed!

R 2.13.1:

> x <- sample(letters, 1e7, TRUE)
> system.time(sum(x=='A') + sum(x=='B'))
   user  system elapsed 
   1.75    0.16    1.98 
> system.time(sum(c(x=='A', x=='B')))
   user  system elapsed 
   2.40    0.23    4.27 
> system.time(sum(x=='A' | x=='B'))
   user  system elapsed 
   2.25    0.19    2.54

But really interesting is comparison of sum(x %in% c('A','B')) with the first, fastest solution. In R 2.13.1 it takes the same time, in R 2.11.1, it is much slower (same result as John reported)! So I'd recommend to use the first solution: sum(x=='A')+sum(x=='B').

回复收藏 0 原文