r:与其他所有行相比,r:不是全行

发布于 2025-02-09 13:27:21 字数 2289 浏览 2 评论 0原文

我对R中的pairwise_sibility函数的理解是,它将每个项目与其他项目进行了比较。

因此,例如,如果您有3个文本项目:

  • 项目1将与项目2和3

    进行比较。
  • 第3项将项目2进行比较。
  • 第3项将项目3进行比较。

但是这似乎没有发生:

这是我的数据:

d <- data.frame(column_id=1:3, description= c("red and yellow", "yellow and blue", "green and black"))

d

 column_id     description
         1    red and yellow
         2    yellow and blue
         3    green and black   # notice how item 3 has no common words with the other two


# unnest the words and remove stop words 

d_un_nest  <- d %>%
              tidytext::unnest_tokens(output = "word",
                                      input = "description",
                                      token = "words") %>%
                        dplyr::anti_join(tidytext::stop_words) %>%
                        dplyr::count(column_id, word, sort = TRUE) %>%
                        tidytext::bind_tf_idf(word, column_id, n)

# complete pairwise similarity

d_similarity <- widyr::pairwise_similarity(d_un_nest, column_id, word, tf_idf)


d_similarity

# A tibble: 2 × 3

  item1 item2 similarity
    2     1      0.120
    1     2      0.120

请注意,如何将项目3与1和2进行比较?为什么这是?如果我将一个单词添加到第3项和3的单词,它确实会增加一些比较,但是不是全部:

d <- data.frame(column_id=1:3, description= c("red and yellow", "yellow and blue", "blue and black"))

d


column_id     description
        1     red and yellow
        2     yellow and blue
        3     blue and black

d_un_nest  <- d %>%
              tidytext::unnest_tokens(output = "word",
                                      input = "description",
                                      token = "words") %>%
                        dplyr::anti_join(tidytext::stop_words) %>%
                        dplyr::count(column_id, word, sort = TRUE) %>%
                        tidytext::bind_tf_idf(word, column_id, n)



d_similarity <- widyr::pairwise_similarity(d_un_nest, column_id, word, tf_idf)


d_similarity

# A tibble: 4 × 3
  item1 item2 similarity
    2     1      0.245
    1     2      0.245
    3     2      0.245   # 3 not compared to 1 at any point - why?
    2     3      0.245

我对成对相似性缺乏的理解吗?除非默认情况下,如果两个文本块的共同单词为零,那么它们的相似性为零,则省略了该行?有人知道这是否可以是答案吗?

My understanding of the pairwise_similarity function in R was that it compared every item to every other.

So for example, if you had 3 text items:

  • Item 1 would be compared to item 2 and 3

  • Item 2 would be compared to item 1 and 3

  • Item 3 would be compared to item 1 and 2

However this does not seem to happen here:

Here is my data:

d <- data.frame(column_id=1:3, description= c("red and yellow", "yellow and blue", "green and black"))

d

 column_id     description
         1    red and yellow
         2    yellow and blue
         3    green and black   # notice how item 3 has no common words with the other two


# unnest the words and remove stop words 

d_un_nest  <- d %>%
              tidytext::unnest_tokens(output = "word",
                                      input = "description",
                                      token = "words") %>%
                        dplyr::anti_join(tidytext::stop_words) %>%
                        dplyr::count(column_id, word, sort = TRUE) %>%
                        tidytext::bind_tf_idf(word, column_id, n)

# complete pairwise similarity

d_similarity <- widyr::pairwise_similarity(d_un_nest, column_id, word, tf_idf)


d_similarity

# A tibble: 2 × 3

  item1 item2 similarity
    2     1      0.120
    1     2      0.120

Notice how item 3 is not compared to 1 and 2? Why is this? If I add in a word to item 3 which is common to 1 and 3, it does add in a few more comparisons, but again not all:

d <- data.frame(column_id=1:3, description= c("red and yellow", "yellow and blue", "blue and black"))

d


column_id     description
        1     red and yellow
        2     yellow and blue
        3     blue and black

d_un_nest  <- d %>%
              tidytext::unnest_tokens(output = "word",
                                      input = "description",
                                      token = "words") %>%
                        dplyr::anti_join(tidytext::stop_words) %>%
                        dplyr::count(column_id, word, sort = TRUE) %>%
                        tidytext::bind_tf_idf(word, column_id, n)



d_similarity <- widyr::pairwise_similarity(d_un_nest, column_id, word, tf_idf)


d_similarity

# A tibble: 4 × 3
  item1 item2 similarity
    2     1      0.245
    1     2      0.245
    3     2      0.245   # 3 not compared to 1 at any point - why?
    2     3      0.245

Is my understanding of pairwise similarity lacking? Unless as a default, if two text chunks have zero words in common, so their similarity is zero, the row is omitted? Does anyone know if this could be the answer?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

┊风居住的梦幻卍 2025-02-16 13:27:21

我找不到有关此文件的文档。

它不是“相似性== 0”,这使行消失了。
所有项目中存在的单词具有IDF = 0,因此tf-idf也为零。因此,如果我们在所有三个项目中添加一个“常见”单词,例如pink :

######################################################
######################################################
d <- data.frame(column_id = 1:3, 
                description = c("red and yellow pink", 
                                "yellow and blue pink", 
                                "green and black pink"))   ### here
d_un_nest <- d %>%
  tidytext::unnest_tokens(output = "word",
                          input = "description",
                          token = "words") %>%
  dplyr::anti_join(tidytext::stop_words) %>%
  dplyr::count(column_id, word, sort = TRUE) %>%
  tidytext::bind_tf_idf(word, column_id, n)
(d_similarity <- widyr::pairwise_similarity(d_un_nest, column_id, word, tf_idf))

如果

# A tibble: 6 × 3
  item1 item2 similarity
  <int> <int>      <dbl>
1     2     1      0.120
2     3     1      0    
3     1     2      0.120
4     3     2      0    
5     1     3      0    
6     2     3      0   

我们替换为“常见” pink ,用“唯一” 棕色,这样
第三项对项目1或项目2没有通用单词:

######################################################
######################################################
d <- data.frame(column_id = 1:3, 
                description = c("red and yellow pink", 
                                "yellow and blue pink", 
                                "green and black brown")) ### here

d_un_nest <- d %>%
  tidytext::unnest_tokens(output = "word",
                          input = "description",
                          token = "words") %>%
  dplyr::anti_join(tidytext::stop_words) %>%
  dplyr::count(column_id, word, sort = TRUE) %>%
  tidytext::bind_tf_idf(word, column_id, n)

(d_similarity <- widyr::pairwise_similarity(d_un_nest, column_id, word, tf_idf))

给予:

# A tibble: 2 × 3
  item1 item2 similarity
  <int> <int>      <dbl>
1     2     1      0.214
2     1     2      0.214

I was unable to find documentation for this.

It is not "similarity == 0", that makes the rows disappear.
Words that are present in all items have idf = 0, hence tf-idf is zero as well. So, if we add a "common" word, e.g. pink to all three items:

######################################################
######################################################
d <- data.frame(column_id = 1:3, 
                description = c("red and yellow pink", 
                                "yellow and blue pink", 
                                "green and black pink"))   ### here
d_un_nest <- d %>%
  tidytext::unnest_tokens(output = "word",
                          input = "description",
                          token = "words") %>%
  dplyr::anti_join(tidytext::stop_words) %>%
  dplyr::count(column_id, word, sort = TRUE) %>%
  tidytext::bind_tf_idf(word, column_id, n)
(d_similarity <- widyr::pairwise_similarity(d_un_nest, column_id, word, tf_idf))

Gives:

# A tibble: 6 × 3
  item1 item2 similarity
  <int> <int>      <dbl>
1     2     1      0.120
2     3     1      0    
3     1     2      0.120
4     3     2      0    
5     1     3      0    
6     2     3      0   

If we replace the "common" pink with the "unique" brown, such
that the 3rd item has no common words with item 1 or item 2:

######################################################
######################################################
d <- data.frame(column_id = 1:3, 
                description = c("red and yellow pink", 
                                "yellow and blue pink", 
                                "green and black brown")) ### here

d_un_nest <- d %>%
  tidytext::unnest_tokens(output = "word",
                          input = "description",
                          token = "words") %>%
  dplyr::anti_join(tidytext::stop_words) %>%
  dplyr::count(column_id, word, sort = TRUE) %>%
  tidytext::bind_tf_idf(word, column_id, n)

(d_similarity <- widyr::pairwise_similarity(d_un_nest, column_id, word, tf_idf))

Gives:

# A tibble: 2 × 3
  item1 item2 similarity
  <int> <int>      <dbl>
1     2     1      0.214
2     1     2      0.214
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文