R：通过标签组合不同长度的频率列表？

发布于 2024-12-26 05:47:34 字数 1828 浏览 1 评论 0原文

我是 R 的新手，但非常喜欢它并希望不断改进。现在，经过一段时间的搜索，我需要向您寻求帮助。

这是给定的情况：

1）我有句子（sentence.1和sentence.2 - 所有单词都已经小写）并创建它们的单词的排序频率列表：

sentence.1 <- "bob buys this car, although his old car is still fine." # saves the sentence into sentence.1
sentence.2 <- "a car can cost you very much per month."

sentence.1.list <- strsplit(sentence.1, "\\W+", perl=T) #(I have these following commands thanks to Stefan Gries) we split the sentence at non-word characters
sentence.2.list <- strsplit(sentence.2, "\\W+", perl=T)

sentence.1.vector <- unlist(sentence.1.list) # then we create a vector of the list
sentence.2.vector <- unlist(sentence.2.list) # vectorizes the list

sentence.1.freq <- table(sentence.1.vector) # and finally create the frequency lists for 
sentence.2.freq <- table(sentence.2.vector)

这些是结果：

sentence.1.freq:
although      bob     buys      car     fine      his       is      old    still     this 
       1        1        1        2        1        1        1        1        1        1

sentence.2.freq:
a   can   car  cost month  much   per  very   you 
1     1     1     1     1     1     1     1     1

现在，请问，我怎么能结合这两个频率列表，我将得到以下内容：

 a  although  bob  buys  can  car  cost fine his  is  month much old per still this very you
NA         1    1     1   NA    2    NA    1   1   1     NA   NA   1  NA     1    1   NA  NA
 1        NA   NA    NA    1    1     1   NA  NA  NA      1    1  NA   1    NA   NA    1   1

因此，这个“表”应该是“灵活的”，以便在输入带有单词（例如“and”）的新句子时，表将添加带有标签的列“a”和“虽然”之间的“and”。

我想到只是将新句子添加到新行中，并将所有尚未在列表中的单词按列放入（这里，“and”将位于“you”的右侧），然后再次对列表进行排序。然而，我还没有做到这一点，因为根据现有标签对新句子的单词频率进行排序还没有起作用（当再次出现例如“汽车”时，新句子的汽车频率应该写入新句子的行和“car”列，但当第一次出现例如“you”时，其频率应写入新句子的行和标记为“you”的新列）。

原文

I'm a newbie to R, but really like it and want to improve constantly. Now, after searching for a while, I need to ask you for help.

This is the given case:

1) I have sentences (sentence.1 and sentence.2 - all words are already lower-case) and create the sorted frequency lists of their words:

sentence.1 <- "bob buys this car, although his old car is still fine." # saves the sentence into sentence.1
sentence.2 <- "a car can cost you very much per month."

sentence.1.list <- strsplit(sentence.1, "\\W+", perl=T) #(I have these following commands thanks to Stefan Gries) we split the sentence at non-word characters
sentence.2.list <- strsplit(sentence.2, "\\W+", perl=T)

sentence.1.vector <- unlist(sentence.1.list) # then we create a vector of the list
sentence.2.vector <- unlist(sentence.2.list) # vectorizes the list

sentence.1.freq <- table(sentence.1.vector) # and finally create the frequency lists for 
sentence.2.freq <- table(sentence.2.vector)

These are the results:

sentence.1.freq:
although      bob     buys      car     fine      his       is      old    still     this 
       1        1        1        2        1        1        1        1        1        1

sentence.2.freq:
a   can   car  cost month  much   per  very   you 
1     1     1     1     1     1     1     1     1

Now, please, how could I combine these two frequency lists that I will have the following:

 a  although  bob  buys  can  car  cost fine his  is  month much old per still this very you
NA         1    1     1   NA    2    NA    1   1   1     NA   NA   1  NA     1    1   NA  NA
 1        NA   NA    NA    1    1     1   NA  NA  NA      1    1  NA   1    NA   NA    1   1

Thus, this "table" should be "flexible" so that in case of entering a new sentence with the word, e.g. "and", the table would add the column with the label "and" between "a" and "although".

I thought of just adding new sentences into a new row and putting all not word that are not yet in the list column-wise (here, "and" would be to the right of "you") and sort the list again. However, I haven't managed this as already the sorting of the new sentence's words' frequencies according to the existing labels haven't been working (when there is e.g., "car" again, the new sentence's frequency of car should be written into the new sentence's row and the column of "car", but when there is e.g. "you" for the 1st time, its frequency should be written into the new sentence's row and a new column labeled "you").

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

悟红尘 2025-01-02 05:47:34

这并不完全是你所描述的，但是你的目标对我来说按行组织比按列更有意义（无论如何，R 处理以这种方式组织的数据更容易一些）。

#Convert tables to data frames
a1 <- as.data.frame(sentence.1.freq)
a2 <- as.data.frame(sentence.2.freq)

#There are other options here, see note below
colnames(a1) <- colnames(a2) <- c('word','freq')
#Then merge
merge(a1,a2,by = "word",all = TRUE)
       word freq.x freq.y
1  although      1     NA
2       bob      1     NA
3      buys      1     NA
4       car      2      1
5      fine      1     NA
6       his      1     NA
7        is      1     NA
8       old      1     NA
9     still      1     NA
10     this      1     NA
11        a     NA      1
12      can     NA      1
13     cost     NA      1
14    month     NA      1
15     much     NA      1
16      per     NA      1
17     very     NA      1
18      you     NA      1

然后，您可以继续使用merge来添加更多句子。为了简单起见，我转换了列名称，但还有其他选项。在 merge 中使用 by.x 和 by.y 参数而不是仅使用 by 可以指示特定的列合并如果每个数据框中的名称不相同，则打开。此外，merge 中的suffix 参数将控制如何为计数列指定唯一名称。默认情况下附加 .x 和 .y 但您可以更改它。

This isn't exactly what you describe, but what you're aiming for makes more sense to me organized by row, rather than by column (and R handles data organized this way a bit more easily anyway).

#Convert tables to data frames
a1 <- as.data.frame(sentence.1.freq)
a2 <- as.data.frame(sentence.2.freq)

#There are other options here, see note below
colnames(a1) <- colnames(a2) <- c('word','freq')
#Then merge
merge(a1,a2,by = "word",all = TRUE)
       word freq.x freq.y
1  although      1     NA
2       bob      1     NA
3      buys      1     NA
4       car      2      1
5      fine      1     NA
6       his      1     NA
7        is      1     NA
8       old      1     NA
9     still      1     NA
10     this      1     NA
11        a     NA      1
12      can     NA      1
13     cost     NA      1
14    month     NA      1
15     much     NA      1
16      per     NA      1
17     very     NA      1
18      you     NA      1

You can then keep using merge to add more sentences. I converted the column names for simplicity, but there are other options. Using the by.x and by.y arguments instead of just by in merge can indicate the specific columns merge on if the names aren't the same in each data frame. Also, the suffix argument in merge will control how the count columns are given unique names. The default is to append .x and .y but you can change that.