Quanteda:显示文本之间的实际差异

发布于 2025-01-12 20:10:52 字数 524 浏览 1 评论 0原文

我设法用余弦方法计算两个文本之间的差异。通过以下内容:

    library("quanteda")
dfmat <- corpus_subset(corpusnew) %>%
    tokens(remove_punct = TRUE) %>%
    tokens_remove(stopwords("portuguese")) %>%
    dfm()
(tstat1 <- textstat_simil(dfmat, method = "cosine", margin = "documents"))
as.matrix(tstat1)

我得到以下矩阵:

       text1 text2 text3 text4 text5 
text1 1.000 0.801 0.801 0.801 0.798 

但是,我想知道解释差异的实际单词它们的差异或相似程度。有办法吗?

谢谢

I managed to calculate the difference between two texts with the cosine method. With the following:

    library("quanteda")
dfmat <- corpus_subset(corpusnew) %>%
    tokens(remove_punct = TRUE) %>%
    tokens_remove(stopwords("portuguese")) %>%
    dfm()
(tstat1 <- textstat_simil(dfmat, method = "cosine", margin = "documents"))
as.matrix(tstat1)

And I get the following matrix:

       text1 text2 text3 text4 text5 
text1 1.000 0.801 0.801 0.801 0.798 

However, I would like to know the actual words that account for the difference and not by how much they differ or are alike. Is there a way?

Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

別甾虛僞 2025-01-19 20:10:52

使用 setdiff() 比较令牌怎么样?

require(quanteda)
toks <- tokens(corpus(c("a b c d", "a e")))
toks
#> Tokens consisting of 2 documents.
#> text1 :
#> [1] "a" "b" "c" "d"
#> 
#> text2 :
#> [1] "a" "e"

setdiff(toks[[1]], toks[[2]])
#> [1] "b" "c" "d"
setdiff(toks[[2]], toks[[1]])
#> [1] "e"

How about comparing tokens using setdiff()?

require(quanteda)
toks <- tokens(corpus(c("a b c d", "a e")))
toks
#> Tokens consisting of 2 documents.
#> text1 :
#> [1] "a" "b" "c" "d"
#> 
#> text2 :
#> [1] "a" "e"

setdiff(toks[[1]], toks[[2]])
#> [1] "b" "c" "d"
setdiff(toks[[2]], toks[[1]])
#> [1] "e"
小霸王臭丫头 2025-01-19 20:10:52

这个问题只有成对的答案,因为每次相似度计算都发生在一对文档之间。目前还不完全清楚您想要看到什么输出,因此我将做出最好的猜测并演示一些可能性。

因此,例如,如果您想要 text1 和 text2 之间最不同的特征,您可以从 dfm 中对要比较的文档进行切片,然后更改 margin = "features" 以获得以下相似度文档跨功能。

library("quanteda")
#> Package version: 3.2.1
#> Unicode version: 13.0
#> ICU version: 69.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.

dfmat <- tokens(data_corpus_inaugural[1:5], remove_punct = TRUE) %>%
    tokens_remove(stopwords("en")) %>%
    dfm()

library("quanteda.textstats")
sim <- textstat_simil(dfmat[1:2, ], margin = "features", method = "cosine")

现在我们可以通过将相似度矩阵转换为 data.frame 并对其进行排序来检查成对相似度(最大和最小)。

# most similar features
as.data.frame(sim) %>%
    dplyr::arrange(desc(cosine)) %>%
    dplyr::filter(cosine < 1) %>%
    head(10)
#>    feature1   feature2    cosine
#> 1   present        may 0.9994801
#> 2   country        may 0.9994801
#> 3       may government 0.9991681
#> 4   present   citizens 0.9988681
#> 5   country   citizens 0.9988681
#> 6   present     people 0.9988681
#> 7   country     people 0.9988681
#> 8   present     united 0.9988681
#> 9   country     united 0.9988681
#> 10  present government 0.9973337
    
# most different features
as.data.frame(sim) %>%
    dplyr::arrange(cosine) %>%
    head(10)
#>      feature1   feature2    cosine
#> 1  government       upon 0.1240347
#> 2  government      chief 0.1240347
#> 3  government magistrate 0.1240347
#> 4  government     proper 0.1240347
#> 5  government     arrive 0.1240347
#> 6  government   endeavor 0.1240347
#> 7  government    express 0.1240347
#> 8  government       high 0.1240347
#> 9  government      sense 0.1240347
#> 10 government  entertain 0.1240347

reprex 包 (v2.0.1) 于 2022 年 3 月 8 日创建

还有其他方法可以比较文档之间最不同的单词,例如“keyness” - 例如 text1 和 text1 之间的 quanteda.textstats::textstat_keyness() text2,其中生成的 data.frame 的头部和尾部将告诉您最不相似的特征。

This question only has pairwise answers, since each computation of similarity occurs between a single pair of documents. It's also not entirely clear what output you want to see, so I'll take my best guess and demonstrate a few possibilities.

So if you wanted to the features most different between text1 and text2, for instance, you could slice the documents you want to compare from the dfm, and then change margin = "features" to get the similarity of the document across features.

library("quanteda")
#> Package version: 3.2.1
#> Unicode version: 13.0
#> ICU version: 69.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.

dfmat <- tokens(data_corpus_inaugural[1:5], remove_punct = TRUE) %>%
    tokens_remove(stopwords("en")) %>%
    dfm()

library("quanteda.textstats")
sim <- textstat_simil(dfmat[1:2, ], margin = "features", method = "cosine")

Now we can examine the pairwise similarities (greatest and smallest) by converting the similarity matrix to a data.frame, and sorting it.

# most similar features
as.data.frame(sim) %>%
    dplyr::arrange(desc(cosine)) %>%
    dplyr::filter(cosine < 1) %>%
    head(10)
#>    feature1   feature2    cosine
#> 1   present        may 0.9994801
#> 2   country        may 0.9994801
#> 3       may government 0.9991681
#> 4   present   citizens 0.9988681
#> 5   country   citizens 0.9988681
#> 6   present     people 0.9988681
#> 7   country     people 0.9988681
#> 8   present     united 0.9988681
#> 9   country     united 0.9988681
#> 10  present government 0.9973337
    
# most different features
as.data.frame(sim) %>%
    dplyr::arrange(cosine) %>%
    head(10)
#>      feature1   feature2    cosine
#> 1  government       upon 0.1240347
#> 2  government      chief 0.1240347
#> 3  government magistrate 0.1240347
#> 4  government     proper 0.1240347
#> 5  government     arrive 0.1240347
#> 6  government   endeavor 0.1240347
#> 7  government    express 0.1240347
#> 8  government       high 0.1240347
#> 9  government      sense 0.1240347
#> 10 government  entertain 0.1240347

Created on 2022-03-08 by the reprex package (v2.0.1)

There are other ways to compare the words most different between documents, such as "keyness" - for instance quanteda.textstats::textstat_keyness() between text1 and text2, where the head and tail of the resulting data.frame will tell you the most dissimilar features.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文