如何计算单词/令牌的出现在每一行tibble中

发布于 2025-01-20 06:46:44 字数 564 浏览 2 评论 0原文

您好，我从tidytext :: Unnest_tokens（）和count（类别，word，word，name =“ count”）的管道中有一个tibble。看起来这个示例。

owl <- tibble(category = c(0, 1, 2, -1, 0, 1, 2),
              word = c(rep("hello", 3), rep("world", 4)),
              count = sample(1:100, 7))

我想用一个额外的列获得这个tibble，该列给出了单词出现的类别数，即单词出现时的数字相同。

我尝试了以下在本金中有效的代码。结果就是我想要的。

owl %>% mutate(sum_t = sapply(1:nrow(.), function(x) {filter(., word == .$word[[x]]) %>% nrow()}))

但是，看到我的数据有10千行这需要很长时间。是否有更有效的方法来实现这一目标？

原文

Hello I have a tibble through a pipe from tidytext::unnest_tokens() and count(category, word, name = "count"). It looks like this example.

owl <- tibble(category = c(0, 1, 2, -1, 0, 1, 2),
              word = c(rep("hello", 3), rep("world", 4)),
              count = sample(1:100, 7))

and I would like to get this tibble with an additional column that gives the number of categories the word appears in, i.e. the same number for each time the word appears.

I tried the following code that works in principal. The result is what I want.

owl %>% mutate(sum_t = sapply(1:nrow(.), function(x) {filter(., word == .$word[[x]]) %>% nrow()}))

However, seeing that my data has 10s of thousands of rows this takes a rather long time. Is there a more efficient way to achieve this?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

萧瑟寒风 2025-01-27 06:46:44

我们可以使用 add_count：

library(dplyr)

 owl %>% 
   add_count(word)

输出：

  category word  count     n
     <dbl> <chr> <int> <int>
1        0 hello    98     3
2        1 hello    30     3
3        2 hello    37     3
4       -1 world    22     4
5        0 world    80     4
6        1 world    18     4
7        2 world    19     4

We could use add_count:

library(dplyr)

 owl %>% 
   add_count(word)

output:

  category word  count     n
     <dbl> <chr> <int> <int>
1        0 hello    98     3
2        1 hello    30     3
3        2 hello    37     3
4       -1 world    22     4
5        0 world    80     4
6        1 world    18     4
7        2 world    19     4

回复收藏 0 原文

长不大的小祸害 2025-01-27 06:46:44

我玩了一些解决方案和微问题。我在基准中添加了Tarjae的主张。我还想使用奇妙的ave函数，只是为了查看与dplyr解决方案的比较。

library(microbenchmark)

n <- 500

owl2 <- tibble(
  category = sample(-10:10, n , replace = TRUE),
  word = sample(stringi::stri_rand_strings(5, 10), n, replace = TRUE),
  count = sample(1:100, n, replace = TRUE))

mb <- microbenchmark(
  op = owl2 %>% mutate(sum_t = sapply(1:nrow(.), function(x) {filter(., word == .$word[[x]]) %>% nrow()})),
  group_by = owl2 %>% group_by(word) %>% mutate(n = n()), 
  add_count = owl2 %>% add_count(word), 
  ave = cbind(owl2, n = ave(owl2$word, owl2$word, FUN = length)), 
  times = 50L)

autoplot(mb) + theme_bw()

结论是，使用add_count的优雅解决方案将为您节省大量时间，并且AVE加快了很多过程。

I played around with a few solutions and microbenchmark. I added TarJae's proposition to the benchmark. I also wanted to use the fantastic ave function just to see how it would compare to a dplyr solution.

library(microbenchmark)

n <- 500

owl2 <- tibble(
  category = sample(-10:10, n , replace = TRUE),
  word = sample(stringi::stri_rand_strings(5, 10), n, replace = TRUE),
  count = sample(1:100, n, replace = TRUE))

mb <- microbenchmark(
  op = owl2 %>% mutate(sum_t = sapply(1:nrow(.), function(x) {filter(., word == .$word[[x]]) %>% nrow()})),
  group_by = owl2 %>% group_by(word) %>% mutate(n = n()), 
  add_count = owl2 %>% add_count(word), 
  ave = cbind(owl2, n = ave(owl2$word, owl2$word, FUN = length)), 
  times = 50L)

autoplot(mb) + theme_bw()

The conclusion is that the elegant solution using add_count will save you a lot of time, and, ave speeds up a lot the process.