R-如何计算DF行中的所有单词并将输出添加到新列中?理想情况下,使用整理或平淡的文本

发布于 2025-02-04 07:44:21 字数 1395 浏览 2 评论 0原文

我正在尝试在文本中找到单词的位置,也是同一文本的总词尺寸。

# library(tidyverse)
# library(tidytext)
txt<-tibble(text=c("we're meeting here today to talk about our earnings. we will also discuss global_warming.", "hi all, global_warming and the on-going strike is at the top of our agenda, because unionizing threatens our revenue goals.", "we will discuss global_warming tomorrow, today the focus is our Q3 earnings"))
dict <- tibble(words=c("global_warming"))
x<-txt %>% unnest_tokens(output = "words",
                          input = "text",
                          drop = FALSE) %>%
  group_by(text) %>%
  mutate(word_loc = row_number()) %>%
  ungroup() %>%
  inner_join(dict)

这给了我以下输出:

# A tibble: 3 x 3
  text                                                                                        words        word_loc
  <chr>                                                                                       <chr>           <int>
1 we're meeting here today to talk about our earnings. we will also discuss global_warming.   global_warm…       14
2 hi all, global_warming and the on-going strike is at the top of our agenda, because unioni… global_warm…        3
3 we will discuss global_warming tomorrow, today the focus is our Q3 earnings                 global_warm…        4

如何添加一列,这为每行的总字数计算吗?

I'm trying to find the location of words in a text, and also the total wordcount of the same text.

# library(tidyverse)
# library(tidytext)
txt<-tibble(text=c("we're meeting here today to talk about our earnings. we will also discuss global_warming.", "hi all, global_warming and the on-going strike is at the top of our agenda, because unionizing threatens our revenue goals.", "we will discuss global_warming tomorrow, today the focus is our Q3 earnings"))
dict <- tibble(words=c("global_warming"))
x<-txt %>% unnest_tokens(output = "words",
                          input = "text",
                          drop = FALSE) %>%
  group_by(text) %>%
  mutate(word_loc = row_number()) %>%
  ungroup() %>%
  inner_join(dict)

This gives me the following output:

# A tibble: 3 x 3
  text                                                                                        words        word_loc
  <chr>                                                                                       <chr>           <int>
1 we're meeting here today to talk about our earnings. we will also discuss global_warming.   global_warm…       14
2 hi all, global_warming and the on-going strike is at the top of our agenda, because unioni… global_warm…        3
3 we will discuss global_warming tomorrow, today the focus is our Q3 earnings                 global_warm…        4

How can I add one column, that gives me the total word count for each row?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

哽咽笑 2025-02-11 07:44:21

我们可以使用str_count获取每个字符串的总单词总数,其中\\ s+计数非空间字符上的所有序列。

library(tidyverse)

x %>%
  mutate(count = str_count(text, "\\S+"))

或使用base r:

x$count <- lengths(gregexpr("\\S+", x$text))

输出

  text                                           words word_loc count
  <chr>                                          <chr>    <int> <int>
1 we're meeting here today to talk about our ea… glob…       14    14
2 hi all, global_warming and the on-going strik… glob…        3    20
3 we will discuss global_warming tomorrow, toda… glob…        4    12

的另一个选项,或者如果要计算收缩,言语等。那么您可以使用\\ w+

x %>%
  mutate(count = str_count(text, "\\w+"))

  text                                           words word_loc count
  <chr>                                          <chr>    <int> <int>
1 we're meeting here today to talk about our ea… glob…       14    15
2 hi all, global_warming and the on-going strik… glob…        3    21
3 we will discuss global_warming tomorrow, toda… glob…        4    12

We can use str_count to get the total number of words for each string, where \\S+ counts all sequences on non-space characters.

library(tidyverse)

x %>%
  mutate(count = str_count(text, "\\S+"))

Or another option using base R:

x$count <- lengths(gregexpr("\\S+", x$text))

Output

  text                                           words word_loc count
  <chr>                                          <chr>    <int> <int>
1 we're meeting here today to talk about our ea… glob…       14    14
2 hi all, global_warming and the on-going strik… glob…        3    20
3 we will discuss global_warming tomorrow, toda… glob…        4    12

Or if you want to count contractions, words with hypens, etc. then you can use \\w+ instead:

x %>%
  mutate(count = str_count(text, "\\w+"))

  text                                           words word_loc count
  <chr>                                          <chr>    <int> <int>
1 we're meeting here today to talk about our ea… glob…       14    15
2 hi all, global_warming and the on-going strik… glob…        3    21
3 we will discuss global_warming tomorrow, toda… glob…        4    12
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文