R:通过标签组合不同长度的频率列表?
我是 R 的新手,但非常喜欢它并希望不断改进。现在,经过一段时间的搜索,我需要向您寻求帮助。
这是给定的情况:
1)我有句子(sentence.1和sentence.2 - 所有单词都已经小写)并创建它们的单词的排序频率列表:
sentence.1 <- "bob buys this car, although his old car is still fine." # saves the sentence into sentence.1
sentence.2 <- "a car can cost you very much per month."
sentence.1.list <- strsplit(sentence.1, "\\W+", perl=T) #(I have these following commands thanks to Stefan Gries) we split the sentence at non-word characters
sentence.2.list <- strsplit(sentence.2, "\\W+", perl=T)
sentence.1.vector <- unlist(sentence.1.list) # then we create a vector of the list
sentence.2.vector <- unlist(sentence.2.list) # vectorizes the list
sentence.1.freq <- table(sentence.1.vector) # and finally create the frequency lists for
sentence.2.freq <- table(sentence.2.vector)
这些是结果:
sentence.1.freq:
although bob buys car fine his is old still this
1 1 1 2 1 1 1 1 1 1
sentence.2.freq:
a can car cost month much per very you
1 1 1 1 1 1 1 1 1
现在,请问,我怎么能结合这两个频率列表,我将得到以下内容:
a although bob buys can car cost fine his is month much old per still this very you
NA 1 1 1 NA 2 NA 1 1 1 NA NA 1 NA 1 1 NA NA
1 NA NA NA 1 1 1 NA NA NA 1 1 NA 1 NA NA 1 1
因此,这个“表”应该是“灵活的”,以便在输入带有单词(例如“and”)的新句子时,表将添加带有标签的列“a”和“虽然”之间的“and”。
我想到只是将新句子添加到新行中,并将所有尚未在列表中的单词按列放入(这里,“and”将位于“you”的右侧),然后再次对列表进行排序。然而,我还没有做到这一点,因为根据现有标签对新句子的单词频率进行排序还没有起作用(当再次出现例如“汽车”时,新句子的汽车频率应该写入新句子的行和“car”列,但当第一次出现例如“you”时,其频率应写入新句子的行和标记为“you”的新列)。
I'm a newbie to R, but really like it and want to improve constantly. Now, after searching for a while, I need to ask you for help.
This is the given case:
1) I have sentences (sentence.1 and sentence.2 - all words are already lower-case) and create the sorted frequency lists of their words:
sentence.1 <- "bob buys this car, although his old car is still fine." # saves the sentence into sentence.1
sentence.2 <- "a car can cost you very much per month."
sentence.1.list <- strsplit(sentence.1, "\\W+", perl=T) #(I have these following commands thanks to Stefan Gries) we split the sentence at non-word characters
sentence.2.list <- strsplit(sentence.2, "\\W+", perl=T)
sentence.1.vector <- unlist(sentence.1.list) # then we create a vector of the list
sentence.2.vector <- unlist(sentence.2.list) # vectorizes the list
sentence.1.freq <- table(sentence.1.vector) # and finally create the frequency lists for
sentence.2.freq <- table(sentence.2.vector)
These are the results:
sentence.1.freq:
although bob buys car fine his is old still this
1 1 1 2 1 1 1 1 1 1
sentence.2.freq:
a can car cost month much per very you
1 1 1 1 1 1 1 1 1
Now, please, how could I combine these two frequency lists that I will have the following:
a although bob buys can car cost fine his is month much old per still this very you
NA 1 1 1 NA 2 NA 1 1 1 NA NA 1 NA 1 1 NA NA
1 NA NA NA 1 1 1 NA NA NA 1 1 NA 1 NA NA 1 1
Thus, this "table" should be "flexible" so that in case of entering a new sentence with the word, e.g. "and", the table would add the column with the label "and" between "a" and "although".
I thought of just adding new sentences into a new row and putting all not word that are not yet in the list column-wise (here, "and" would be to the right of "you") and sort the list again. However, I haven't managed this as already the sorting of the new sentence's words' frequencies according to the existing labels haven't been working (when there is e.g., "car" again, the new sentence's frequency of car should be written into the new sentence's row and the column of "car", but when there is e.g. "you" for the 1st time, its frequency should be written into the new sentence's row and a new column labeled "you").
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这并不完全是你所描述的,但是你的目标对我来说按行组织比按列更有意义(无论如何,R 处理以这种方式组织的数据更容易一些) 。
然后,您可以继续使用
merge
来添加更多句子。为了简单起见,我转换了列名称,但还有其他选项。在merge
中使用by.x
和by.y
参数而不是仅使用by
可以指示特定的列合并如果每个数据框中的名称不相同,则打开。此外,merge
中的suffix
参数将控制如何为计数列指定唯一名称。默认情况下附加.x
和.y
但您可以更改它。This isn't exactly what you describe, but what you're aiming for makes more sense to me organized by row, rather than by column (and R handles data organized this way a bit more easily anyway).
You can then keep using
merge
to add more sentences. I converted the column names for simplicity, but there are other options. Using theby.x
andby.y
arguments instead of justby
inmerge
can indicate the specific columns merge on if the names aren't the same in each data frame. Also, thesuffix
argument inmerge
will control how the count columns are given unique names. The default is to append.x
and.y
but you can change that.