r:与其他所有行相比,r:不是全行
我对R中的pairwise_sibility函数的理解是,它将每个项目与其他项目进行了比较。
因此,例如,如果您有3个文本项目:
项目1将与项目2和3
进行比较。- 第3项将项目2进行比较。
- 第3项将项目3进行比较。
但是这似乎没有发生:
这是我的数据:
d <- data.frame(column_id=1:3, description= c("red and yellow", "yellow and blue", "green and black"))
d
column_id description
1 red and yellow
2 yellow and blue
3 green and black # notice how item 3 has no common words with the other two
# unnest the words and remove stop words
d_un_nest <- d %>%
tidytext::unnest_tokens(output = "word",
input = "description",
token = "words") %>%
dplyr::anti_join(tidytext::stop_words) %>%
dplyr::count(column_id, word, sort = TRUE) %>%
tidytext::bind_tf_idf(word, column_id, n)
# complete pairwise similarity
d_similarity <- widyr::pairwise_similarity(d_un_nest, column_id, word, tf_idf)
d_similarity
# A tibble: 2 × 3
item1 item2 similarity
2 1 0.120
1 2 0.120
请注意,如何将项目3与1和2进行比较?为什么这是?如果我将一个单词添加到第3项和3的单词,它确实会增加一些比较,但是不是全部:
d <- data.frame(column_id=1:3, description= c("red and yellow", "yellow and blue", "blue and black"))
d
column_id description
1 red and yellow
2 yellow and blue
3 blue and black
d_un_nest <- d %>%
tidytext::unnest_tokens(output = "word",
input = "description",
token = "words") %>%
dplyr::anti_join(tidytext::stop_words) %>%
dplyr::count(column_id, word, sort = TRUE) %>%
tidytext::bind_tf_idf(word, column_id, n)
d_similarity <- widyr::pairwise_similarity(d_un_nest, column_id, word, tf_idf)
d_similarity
# A tibble: 4 × 3
item1 item2 similarity
2 1 0.245
1 2 0.245
3 2 0.245 # 3 not compared to 1 at any point - why?
2 3 0.245
我对成对相似性缺乏的理解吗?除非默认情况下,如果两个文本块的共同单词为零,那么它们的相似性为零,则省略了该行?有人知道这是否可以是答案吗?
My understanding of the pairwise_similarity function in R was that it compared every item to every other.
So for example, if you had 3 text items:
Item 1 would be compared to item 2 and 3
Item 2 would be compared to item 1 and 3
Item 3 would be compared to item 1 and 2
However this does not seem to happen here:
Here is my data:
d <- data.frame(column_id=1:3, description= c("red and yellow", "yellow and blue", "green and black"))
d
column_id description
1 red and yellow
2 yellow and blue
3 green and black # notice how item 3 has no common words with the other two
# unnest the words and remove stop words
d_un_nest <- d %>%
tidytext::unnest_tokens(output = "word",
input = "description",
token = "words") %>%
dplyr::anti_join(tidytext::stop_words) %>%
dplyr::count(column_id, word, sort = TRUE) %>%
tidytext::bind_tf_idf(word, column_id, n)
# complete pairwise similarity
d_similarity <- widyr::pairwise_similarity(d_un_nest, column_id, word, tf_idf)
d_similarity
# A tibble: 2 × 3
item1 item2 similarity
2 1 0.120
1 2 0.120
Notice how item 3 is not compared to 1 and 2? Why is this? If I add in a word to item 3 which is common to 1 and 3, it does add in a few more comparisons, but again not all:
d <- data.frame(column_id=1:3, description= c("red and yellow", "yellow and blue", "blue and black"))
d
column_id description
1 red and yellow
2 yellow and blue
3 blue and black
d_un_nest <- d %>%
tidytext::unnest_tokens(output = "word",
input = "description",
token = "words") %>%
dplyr::anti_join(tidytext::stop_words) %>%
dplyr::count(column_id, word, sort = TRUE) %>%
tidytext::bind_tf_idf(word, column_id, n)
d_similarity <- widyr::pairwise_similarity(d_un_nest, column_id, word, tf_idf)
d_similarity
# A tibble: 4 × 3
item1 item2 similarity
2 1 0.245
1 2 0.245
3 2 0.245 # 3 not compared to 1 at any point - why?
2 3 0.245
Is my understanding of pairwise similarity lacking? Unless as a default, if two text chunks have zero words in common, so their similarity is zero, the row is omitted? Does anyone know if this could be the answer?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我找不到有关此文件的文档。
它不是“相似性== 0”,这使行消失了。
所有项目中存在的单词具有
IDF
= 0,因此tf-idf
也为零。因此,如果我们在所有三个项目中添加一个“常见”单词,例如pink :如果
我们替换为“常见” pink ,用“唯一” 棕色,这样
第三项对项目1或项目2没有通用单词:
给予:
I was unable to find documentation for this.
It is not "similarity == 0", that makes the rows disappear.
Words that are present in all items have
idf
= 0, hencetf-idf
is zero as well. So, if we add a "common" word, e.g. pink to all three items:Gives:
If we replace the "common" pink with the "unique" brown, such
that the 3rd item has no common words with item 1 or item 2:
Gives: