如何计算单词/令牌的出现在每一行tibble中
您好,我从tidytext :: Unnest_tokens()
和count(类别,word,word,name =“ count”)
的管道中有一个tibble。看起来这个示例。
owl <- tibble(category = c(0, 1, 2, -1, 0, 1, 2),
word = c(rep("hello", 3), rep("world", 4)),
count = sample(1:100, 7))
我想用一个额外的列获得这个tibble,该列给出了单词出现的类别数,即单词出现时的数字相同。
我尝试了以下在本金中有效的代码。结果就是我想要的。
owl %>% mutate(sum_t = sapply(1:nrow(.), function(x) {filter(., word == .$word[[x]]) %>% nrow()}))
但是,看到我的数据有10千行这需要很长时间。是否有更有效的方法来实现这一目标?
Hello I have a tibble through a pipe from tidytext::unnest_tokens()
and count(category, word, name = "count")
. It looks like this example.
owl <- tibble(category = c(0, 1, 2, -1, 0, 1, 2),
word = c(rep("hello", 3), rep("world", 4)),
count = sample(1:100, 7))
and I would like to get this tibble with an additional column that gives the number of categories the word appears in, i.e. the same number for each time the word appears.
I tried the following code that works in principal. The result is what I want.
owl %>% mutate(sum_t = sapply(1:nrow(.), function(x) {filter(., word == .$word[[x]]) %>% nrow()}))
However, seeing that my data has 10s of thousands of rows this takes a rather long time. Is there a more efficient way to achieve this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我们可以使用
add_count
:输出:
We could use
add_count
:output:
我玩了一些解决方案和微问题。我在基准中添加了Tarjae的主张。我还想使用奇妙的
ave
函数,只是为了查看与dplyr
解决方案的比较。结论是,使用
add_count
的优雅解决方案将为您节省大量时间,并且AVE加快了很多过程。I played around with a few solutions and microbenchmark. I added TarJae's proposition to the benchmark. I also wanted to use the fantastic
ave
function just to see how it would compare to adplyr
solution.The conclusion is that the elegant solution using
add_count
will save you a lot of time, and, ave speeds up a lot the process.