Quanteda:显示文本之间的实际差异
我设法用余弦方法计算两个文本之间的差异。通过以下内容:
library("quanteda")
dfmat <- corpus_subset(corpusnew) %>%
tokens(remove_punct = TRUE) %>%
tokens_remove(stopwords("portuguese")) %>%
dfm()
(tstat1 <- textstat_simil(dfmat, method = "cosine", margin = "documents"))
as.matrix(tstat1)
我得到以下矩阵:
text1 text2 text3 text4 text5
text1 1.000 0.801 0.801 0.801 0.798
但是,我想知道解释差异的实际单词和不它们的差异或相似程度。有办法吗?
谢谢
I managed to calculate the difference between two texts with the cosine method. With the following:
library("quanteda")
dfmat <- corpus_subset(corpusnew) %>%
tokens(remove_punct = TRUE) %>%
tokens_remove(stopwords("portuguese")) %>%
dfm()
(tstat1 <- textstat_simil(dfmat, method = "cosine", margin = "documents"))
as.matrix(tstat1)
And I get the following matrix:
text1 text2 text3 text4 text5
text1 1.000 0.801 0.801 0.801 0.798
However, I would like to know the actual words that account for the difference and not by how much they differ or are alike. Is there a way?
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
使用
setdiff()
比较令牌怎么样?How about comparing tokens using
setdiff()
?这个问题只有成对的答案,因为每次相似度计算都发生在一对文档之间。目前还不完全清楚您想要看到什么输出,因此我将做出最好的猜测并演示一些可能性。
因此,例如,如果您想要 text1 和 text2 之间最不同的特征,您可以从 dfm 中对要比较的文档进行切片,然后更改
margin = "features"
以获得以下相似度文档跨功能。现在我们可以通过将相似度矩阵转换为 data.frame 并对其进行排序来检查成对相似度(最大和最小)。
由 reprex 包 (v2.0.1) 于 2022 年 3 月 8 日创建
还有其他方法可以比较文档之间最不同的单词,例如“keyness” - 例如 text1 和 text1 之间的
quanteda.textstats::textstat_keyness()
text2,其中生成的 data.frame 的头部和尾部将告诉您最不相似的特征。This question only has pairwise answers, since each computation of similarity occurs between a single pair of documents. It's also not entirely clear what output you want to see, so I'll take my best guess and demonstrate a few possibilities.
So if you wanted to the features most different between text1 and text2, for instance, you could slice the documents you want to compare from the dfm, and then change
margin = "features"
to get the similarity of the document across features.Now we can examine the pairwise similarities (greatest and smallest) by converting the similarity matrix to a data.frame, and sorting it.
Created on 2022-03-08 by the reprex package (v2.0.1)
There are other ways to compare the words most different between documents, such as "keyness" - for instance
quanteda.textstats::textstat_keyness()
between text1 and text2, where the head and tail of the resulting data.frame will tell you the most dissimilar features.