如何计算单词/令牌的出现在每一行tibble中

发布于 2025-01-20 06:46:44 字数 564 浏览 2 评论 0原文

您好,我从tidytext :: Unnest_tokens()count(类别,word,word,name =“ count”)的管道中有一个tibble。看起来这个示例。

owl <- tibble(category = c(0, 1, 2, -1, 0, 1, 2),
              word = c(rep("hello", 3), rep("world", 4)),
              count = sample(1:100, 7))

我想用一个额外的列获得这个tibble,该列给出了单词出现的类别数,即单词出现时的数字相同。

我尝试了以下在本金中有效的代码。结果就是我想要的。

owl %>% mutate(sum_t = sapply(1:nrow(.), function(x) {filter(., word == .$word[[x]]) %>% nrow()}))

但是,看到我的数据有10千行这需要很长时间。是否有更有效的方法来实现这一目标?

Hello I have a tibble through a pipe from tidytext::unnest_tokens() and count(category, word, name = "count"). It looks like this example.

owl <- tibble(category = c(0, 1, 2, -1, 0, 1, 2),
              word = c(rep("hello", 3), rep("world", 4)),
              count = sample(1:100, 7))

and I would like to get this tibble with an additional column that gives the number of categories the word appears in, i.e. the same number for each time the word appears.

I tried the following code that works in principal. The result is what I want.

owl %>% mutate(sum_t = sapply(1:nrow(.), function(x) {filter(., word == .$word[[x]]) %>% nrow()}))

However, seeing that my data has 10s of thousands of rows this takes a rather long time. Is there a more efficient way to achieve this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

萧瑟寒风 2025-01-27 06:46:44

我们可以使用 add_count

library(dplyr)

 owl %>% 
   add_count(word)

输出:

  category word  count     n
     <dbl> <chr> <int> <int>
1        0 hello    98     3
2        1 hello    30     3
3        2 hello    37     3
4       -1 world    22     4
5        0 world    80     4
6        1 world    18     4
7        2 world    19     4

We could use add_count:

library(dplyr)

 owl %>% 
   add_count(word)

output:

  category word  count     n
     <dbl> <chr> <int> <int>
1        0 hello    98     3
2        1 hello    30     3
3        2 hello    37     3
4       -1 world    22     4
5        0 world    80     4
6        1 world    18     4
7        2 world    19     4
长不大的小祸害 2025-01-27 06:46:44

我玩了一些解决方案和微问题。我在基准中添加了Tarjae的主张。我还想使用奇妙的ave函数,只是为了查看与dplyr解决方案的比较。

library(microbenchmark)

n <- 500

owl2 <- tibble(
  category = sample(-10:10, n , replace = TRUE),
  word = sample(stringi::stri_rand_strings(5, 10), n, replace = TRUE),
  count = sample(1:100, n, replace = TRUE))

mb <- microbenchmark(
  op = owl2 %>% mutate(sum_t = sapply(1:nrow(.), function(x) {filter(., word == .$word[[x]]) %>% nrow()})),
  group_by = owl2 %>% group_by(word) %>% mutate(n = n()), 
  add_count = owl2 %>% add_count(word), 
  ave = cbind(owl2, n = ave(owl2$word, owl2$word, FUN = length)), 
  times = 50L)

autoplot(mb) + theme_bw()

结论是,使用add_count的优雅解决方案将为您节省大量时间,并且AVE加快了很多过程。

I played around with a few solutions and microbenchmark. I added TarJae's proposition to the benchmark. I also wanted to use the fantastic ave function just to see how it would compare to a dplyr solution.

library(microbenchmark)

n <- 500

owl2 <- tibble(
  category = sample(-10:10, n , replace = TRUE),
  word = sample(stringi::stri_rand_strings(5, 10), n, replace = TRUE),
  count = sample(1:100, n, replace = TRUE))

mb <- microbenchmark(
  op = owl2 %>% mutate(sum_t = sapply(1:nrow(.), function(x) {filter(., word == .$word[[x]]) %>% nrow()})),
  group_by = owl2 %>% group_by(word) %>% mutate(n = n()), 
  add_count = owl2 %>% add_count(word), 
  ave = cbind(owl2, n = ave(owl2$word, owl2$word, FUN = length)), 
  times = 50L)

autoplot(mb) + theme_bw()

The conclusion is that the elegant solution using add_count will save you a lot of time, and, ave speeds up a lot the process.

Benchmark

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文