Quanteda：显示文本之间的实际差异

发布于 2025-01-12 20:10:52 字数 524 浏览 1 评论 0原文

我设法用余弦方法计算两个文本之间的差异。通过以下内容：

    library("quanteda")
dfmat <- corpus_subset(corpusnew) %>%
    tokens(remove_punct = TRUE) %>%
    tokens_remove(stopwords("portuguese")) %>%
    dfm()
(tstat1 <- textstat_simil(dfmat, method = "cosine", margin = "documents"))
as.matrix(tstat1)

我得到以下矩阵：

       text1 text2 text3 text4 text5 
text1 1.000 0.801 0.801 0.801 0.798

但是，我想知道解释差异的实际单词和不它们的差异或相似程度。有办法吗？

谢谢

原文

I managed to calculate the difference between two texts with the cosine method. With the following:

    library("quanteda")
dfmat <- corpus_subset(corpusnew) %>%
    tokens(remove_punct = TRUE) %>%
    tokens_remove(stopwords("portuguese")) %>%
    dfm()
(tstat1 <- textstat_simil(dfmat, method = "cosine", margin = "documents"))
as.matrix(tstat1)

And I get the following matrix:

       text1 text2 text3 text4 text5 
text1 1.000 0.801 0.801 0.801 0.798

However, I would like to know the actual words that account for the difference and not by how much they differ or are alike. Is there a way?

Thanks

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

別甾虛僞 2025-01-19 20:10:52

使用 setdiff() 比较令牌怎么样？

require(quanteda)
toks <- tokens(corpus(c("a b c d", "a e")))
toks
#> Tokens consisting of 2 documents.
#> text1 :
#> [1] "a" "b" "c" "d"
#> 
#> text2 :
#> [1] "a" "e"

setdiff(toks[[1]], toks[[2]])
#> [1] "b" "c" "d"
setdiff(toks[[2]], toks[[1]])
#> [1] "e"

How about comparing tokens using setdiff()?

require(quanteda)
toks <- tokens(corpus(c("a b c d", "a e")))
toks
#> Tokens consisting of 2 documents.
#> text1 :
#> [1] "a" "b" "c" "d"
#> 
#> text2 :
#> [1] "a" "e"

setdiff(toks[[1]], toks[[2]])
#> [1] "b" "c" "d"
setdiff(toks[[2]], toks[[1]])
#> [1] "e"

回复收藏 0 原文

小霸王臭丫头 2025-01-19 20:10:52

这个问题只有成对的答案，因为每次相似度计算都发生在一对文档之间。目前还不完全清楚您想要看到什么输出，因此我将做出最好的猜测并演示一些可能性。

因此，例如，如果您想要 text1 和 text2 之间最不同的特征，您可以从 dfm 中对要比较的文档进行切片，然后更改 margin = "features" 以获得以下相似度文档跨功能。

library("quanteda")
#> Package version: 3.2.1
#> Unicode version: 13.0
#> ICU version: 69.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.

dfmat <- tokens(data_corpus_inaugural[1:5], remove_punct = TRUE) %>%
    tokens_remove(stopwords("en")) %>%
    dfm()

library("quanteda.textstats")
sim <- textstat_simil(dfmat[1:2, ], margin = "features", method = "cosine")

现在我们可以通过将相似度矩阵转换为 data.frame 并对其进行排序来检查成对相似度（最大和最小）。

# most similar features
as.data.frame(sim) %>%
    dplyr::arrange(desc(cosine)) %>%
    dplyr::filter(cosine < 1) %>%
    head(10)
#>    feature1   feature2    cosine
#> 1   present        may 0.9994801
#> 2   country        may 0.9994801
#> 3       may government 0.9991681
#> 4   present   citizens 0.9988681
#> 5   country   citizens 0.9988681
#> 6   present     people 0.9988681
#> 7   country     people 0.9988681
#> 8   present     united 0.9988681
#> 9   country     united 0.9988681
#> 10  present government 0.9973337
    
# most different features
as.data.frame(sim) %>%
    dplyr::arrange(cosine) %>%
    head(10)
#>      feature1   feature2    cosine
#> 1  government       upon 0.1240347
#> 2  government      chief 0.1240347
#> 3  government magistrate 0.1240347
#> 4  government     proper 0.1240347
#> 5  government     arrive 0.1240347
#> 6  government   endeavor 0.1240347
#> 7  government    express 0.1240347
#> 8  government       high 0.1240347
#> 9  government      sense 0.1240347
#> 10 government  entertain 0.1240347

^{由 reprex 包 (v2.0.1) 于 2022 年 3 月 8 日创建}

还有其他方法可以比较文档之间最不同的单词，例如“keyness” - 例如 text1 和 text1 之间的 quanteda.textstats::textstat_keyness() text2，其中生成的 data.frame 的头部和尾部将告诉您最不相似的特征。

This question only has pairwise answers, since each computation of similarity occurs between a single pair of documents. It's also not entirely clear what output you want to see, so I'll take my best guess and demonstrate a few possibilities.

So if you wanted to the features most different between text1 and text2, for instance, you could slice the documents you want to compare from the dfm, and then change margin = "features" to get the similarity of the document across features.

library("quanteda")
#> Package version: 3.2.1
#> Unicode version: 13.0
#> ICU version: 69.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.

dfmat <- tokens(data_corpus_inaugural[1:5], remove_punct = TRUE) %>%
    tokens_remove(stopwords("en")) %>%
    dfm()

library("quanteda.textstats")
sim <- textstat_simil(dfmat[1:2, ], margin = "features", method = "cosine")

Now we can examine the pairwise similarities (greatest and smallest) by converting the similarity matrix to a data.frame, and sorting it.

# most similar features
as.data.frame(sim) %>%
    dplyr::arrange(desc(cosine)) %>%
    dplyr::filter(cosine < 1) %>%
    head(10)
#>    feature1   feature2    cosine
#> 1   present        may 0.9994801
#> 2   country        may 0.9994801
#> 3       may government 0.9991681
#> 4   present   citizens 0.9988681
#> 5   country   citizens 0.9988681
#> 6   present     people 0.9988681
#> 7   country     people 0.9988681
#> 8   present     united 0.9988681
#> 9   country     united 0.9988681
#> 10  present government 0.9973337
    
# most different features
as.data.frame(sim) %>%
    dplyr::arrange(cosine) %>%
    head(10)
#>      feature1   feature2    cosine
#> 1  government       upon 0.1240347
#> 2  government      chief 0.1240347
#> 3  government magistrate 0.1240347
#> 4  government     proper 0.1240347
#> 5  government     arrive 0.1240347
#> 6  government   endeavor 0.1240347
#> 7  government    express 0.1240347
#> 8  government       high 0.1240347
#> 9  government      sense 0.1240347
#> 10 government  entertain 0.1240347

^{Created on 2022-03-08 by the reprex package (v2.0.1)}

There are other ways to compare the words most different between documents, such as "keyness" - for instance quanteda.textstats::textstat_keyness() between text1 and text2, where the head and tail of the resulting data.frame will tell you the most dissimilar features.

回复收藏 0 原文

~没有更多了~