如何使用Quanteda计算两组单个文档之间的余弦相似性?
我有两套文档:一组约。 580篇新闻文章,其中一篇约为。 560政治决定。我想找出个人新闻文章与政治决定之间是否有相似之处。这意味着应使用余弦相似性将每个新闻文章与560个政治决定中的每一个进行比较。我正在使用Quanteda软件包。
这是我到目前为止尝试的:
news_articles <- readtext(paste0(txt_directory, "*"), encoding = "UTF-8")
news_articles_corpus <- corpus(news_articles)
pol_decisions <- readtext(paste0(txt_directory, "*"), encoding = "UTF-8")
pol_decisions_corpus <- corpus(pol_decisions)
news_articles_toks <- tokens(
news_articles_corpus,
what = "word",
remove_punct = TRUE,
remove_symbols = TRUE,
remove_numbers = TRUE,
remove_url = TRUE,
remove_separators = TRUE,
verbose = TRUE)
news_articles_toks <- tokens_tolower(news_articles_toks, keep_acronyms = FALSE)
news_articles_toks <- tokens_select(news_articles_toks, stopwords("danish"), selection = "remove")
news_articles_toks <- tokens_wordstem(news_articles_toks)
pol_decisions_toks <- tokens(
pol_decisions_corpus,
what = "word",
remove_punct = TRUE,
remove_symbols = TRUE,
remove_numbers = TRUE,
remove_url = TRUE,
remove_separators = TRUE,
verbose = TRUE)
pol_decisions_toks <- tokens_tolower(pol_decisions_toks, keep_acronyms = FALSE)
pol_decisions_toks <- tokens_select(pol_decisions_toks, stopwords("danish"), selection = "remove")
pol_decisions_toks <- tokens_wordstem(pol_decisions_toks)
news_articles_dfm <- dfm(news_articles_toks)
pol_decisions_dfm <- dfm(pol_decisions_toks)
cosine <- textstat_simil(
news_articles_dfm,
y = pol_decisions_dfm,
selection = NULL,
margin = c("documents"),
method = c("cosine"))
cosine <- as.data.frame(cosine)
cosine <- cosine[order(-cosine$cosine),]
write_xlsx(cosine, "Test.xlsx")
我的问题是,当我运行textStat_simil函数时,r返回所有组合的余弦值,包括两组文档之间和之间。但是我不想知道两篇新闻文章或两个政治决定之间的余弦相似性。我只想知道新闻文章和政治决定之间的余弦相似性。
有什么方法可以解决这个问题吗?
I have two sets of documents: One with approx. 580 news articles and one with approx. 560 political decisions. I want to find out whether there are similarities between the individual news articles and the political decisions. This means that each individual news article should be compared with each of the 560 political decisions, using cosine similarity. I am using the quanteda package.
This is what I have tried so far:
news_articles <- readtext(paste0(txt_directory, "*"), encoding = "UTF-8")
news_articles_corpus <- corpus(news_articles)
pol_decisions <- readtext(paste0(txt_directory, "*"), encoding = "UTF-8")
pol_decisions_corpus <- corpus(pol_decisions)
news_articles_toks <- tokens(
news_articles_corpus,
what = "word",
remove_punct = TRUE,
remove_symbols = TRUE,
remove_numbers = TRUE,
remove_url = TRUE,
remove_separators = TRUE,
verbose = TRUE)
news_articles_toks <- tokens_tolower(news_articles_toks, keep_acronyms = FALSE)
news_articles_toks <- tokens_select(news_articles_toks, stopwords("danish"), selection = "remove")
news_articles_toks <- tokens_wordstem(news_articles_toks)
pol_decisions_toks <- tokens(
pol_decisions_corpus,
what = "word",
remove_punct = TRUE,
remove_symbols = TRUE,
remove_numbers = TRUE,
remove_url = TRUE,
remove_separators = TRUE,
verbose = TRUE)
pol_decisions_toks <- tokens_tolower(pol_decisions_toks, keep_acronyms = FALSE)
pol_decisions_toks <- tokens_select(pol_decisions_toks, stopwords("danish"), selection = "remove")
pol_decisions_toks <- tokens_wordstem(pol_decisions_toks)
news_articles_dfm <- dfm(news_articles_toks)
pol_decisions_dfm <- dfm(pol_decisions_toks)
cosine <- textstat_simil(
news_articles_dfm,
y = pol_decisions_dfm,
selection = NULL,
margin = c("documents"),
method = c("cosine"))
cosine <- as.data.frame(cosine)
cosine <- cosine[order(-cosine$cosine),]
write_xlsx(cosine, "Test.xlsx")
My problem is that when I run the textstat_simil function, R returns cosine values for all combinations - both within and between the two sets of documents. But I don't want to know the cosine similarity between two news articles or between two political decisions. I only want to know the cosine similarity between a news article and a political decision.
Is there any way to solve this issue?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
仅使用
x
和y
intextstat_simil()
。由
Only use
x
andy
intextstat_simil()
.Created on 2022-06-25 by the reprex package (v2.0.1)