在 R 中分割字符串并生成频率表

发布于 2024-12-23 05:26:15 字数 414 浏览 1 评论 0原文

我在 R 数据框中有一列公司名称，内容如下：

"ABC Industries"  
"ABC Enterprises"  
"123 and 456 Corporation"  
"XYZ Company"

等等。我正在尝试生成此列中出现的每个单词的频率表，例如，如下所示：

Industries   10  
Corporation  31  
Enterprise   40  
ABC          30  
XYZ          40

我对 R 相对较新，所以我想知道一种好方法来处理这。我应该拆分字符串并将每个不同的单词放入新列中吗？有没有一种方法可以将多字行拆分为多行，其中只有一个字？

原文

I have a column of firm names in an R dataframe that goes something like this:

"ABC Industries"  
"ABC Enterprises"  
"123 and 456 Corporation"  
"XYZ Company"

And so on. I'm trying to generate frequency tables of every word that appears in this column, so for example, something like this:

Industries   10  
Corporation  31  
Enterprise   40  
ABC          30  
XYZ          40

I'm relatively new to R, so I was wondering of a good way to approach this. Should I be splitting the strings and placing every distinct word into a new column? Is there a way to split up a multi-word row into multiple rows with one word?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

骄傲 2024-12-30 05:26:15

如果你愿意，你可以用一句话来完成：

R> text <- c("ABC Industries", "ABC Enterprises", 
+            "123 and 456 Corporation", "XYZ Company")
R> table(do.call(c, lapply(text, function(x) unlist(strsplit(x, " ")))))

        123         456         ABC         and     Company 
          1           1           2           1           1 
Corporation Enterprises  Industries         XYZ 
          1           1           1           1 
R>

这里我使用 strsplit() 来打破每个条目介绍组件；这将返回一个列表（列表内）。我使用 do.call()，因此只需将所有结果列表连接到一个向量中，由 table() 进行汇总。

If you wanted to, you could do it in a one-liner:

R> text <- c("ABC Industries", "ABC Enterprises", 
+            "123 and 456 Corporation", "XYZ Company")
R> table(do.call(c, lapply(text, function(x) unlist(strsplit(x, " ")))))

        123         456         ABC         and     Company 
          1           1           2           1           1 
Corporation Enterprises  Industries         XYZ 
          1           1           1           1 
R>

Here I use strsplit() to break each entry intro components; this returns a list (within a list). I use do.call() so simply concatenate all result lists into one vector, which table() summarises.

回复收藏 0 原文

寄风 2024-12-30 05:26:15

这是另一句俏皮话。它使用 paste() 将所有列条目组合成一个长文本字符串，然后将其拆分并制成表格：

text <- c("ABC Industries", "ABC Enterprises", 
         "123 and 456 Corporation", "XYZ Company")

table(strsplit(paste(text, collapse=" "), " "))

Here is another one-liner. It uses paste() to combine all of the column entries into a single long text string, which it then splits apart and tabulates:

text <- c("ABC Industries", "ABC Enterprises", 
         "123 and 456 Corporation", "XYZ Company")

table(strsplit(paste(text, collapse=" "), " "))

回复收藏 0 原文

清醇 2024-12-30 05:26:15

您可以使用 tidytext 和 dplyr 包：

set.seed(42)

text <- c("ABC Industries", "ABC Enterprises", 
       "123 and 456 Corporation", "XYZ Company")

data <- data.frame(category = sample(text, 100, replace = TRUE),
                   stringsAsFactors = FALSE)

library(tidytext)
library(dplyr)

data %>%
  unnest_tokens(word, category) %>%
  group_by(word) %>%
  count()

#> # A tibble: 9 x 2
#> # Groups:   word [9]
#>          word     n
#>         <chr> <int>
#> 1         123    29
#> 2         456    29
#> 3         abc    45
#> 4         and    29
#> 5     company    26
#> 6 corporation    29
#> 7 enterprises    21
#> 8  industries    24
#> 9         xyz    26

You can use the package tidytext and dplyr:

set.seed(42)

text <- c("ABC Industries", "ABC Enterprises", 
       "123 and 456 Corporation", "XYZ Company")

data <- data.frame(category = sample(text, 100, replace = TRUE),
                   stringsAsFactors = FALSE)

library(tidytext)
library(dplyr)

data %>%
  unnest_tokens(word, category) %>%
  group_by(word) %>%
  count()

#> # A tibble: 9 x 2
#> # Groups:   word [9]
#>          word     n
#>         <chr> <int>
#> 1         123    29
#> 2         456    29
#> 3         abc    45
#> 4         and    29
#> 5     company    26
#> 6 corporation    29
#> 7 enterprises    21
#> 8  industries    24
#> 9         xyz    26

回复收藏 0 原文

~没有更多了~