不同长度向量的余弦相似度?

发布于 2024-09-06 17:58:08 字数 930 浏览 5 评论 0原文

我正在尝试使用 TF-IDF 对文档进行分类。我已经计算了一些文档的 tf_idf ,但是现在当我尝试计算其中两个文档之间的余弦相似度时,我得到一条回溯:

#len(u)==201, len(v)==246

cosine_distance(u, v)
ValueError: objects are not aligned

#this works though:
cosine_distance(u[:200], v[:200])
>> 0.52230249969265641

正在切片向量,以便 len(u)==len(v) 是正确的方法?我认为余弦相似度适用于不同长度的向量。

我正在使用 此函数

def cosine_distance(u, v):
    """
    Returns the cosine of the angle between vectors v and u. This is equal to
    u.v / |u||v|.
    """
    return numpy.dot(u, v) / (math.sqrt(numpy.dot(u, u)) * math.sqrt(numpy.dot(v, v))) 

另外 - 向量中 tf_idf 值的顺序重要吗?是否应该对它们进行排序——或者对于此计算来说它们并不重要?

I'm trying to use TF-IDF to sort documents into categories. I've calculated the tf_idf for some documents, but now when I try to calculate the Cosine Similarity between two of these documents I get a traceback saying:

#len(u)==201, len(v)==246

cosine_distance(u, v)
ValueError: objects are not aligned

#this works though:
cosine_distance(u[:200], v[:200])
>> 0.52230249969265641

Is slicing the vector so that len(u)==len(v) the right approach? I would think that cosine similarity would work with vectors of different lengths.

I'm using this function:

def cosine_distance(u, v):
    """
    Returns the cosine of the angle between vectors v and u. This is equal to
    u.v / |u||v|.
    """
    return numpy.dot(u, v) / (math.sqrt(numpy.dot(u, u)) * math.sqrt(numpy.dot(v, v))) 

Also -- is the order of the tf_idf values in the vectors important? Should they be sorted -- or is it of no importance for this calculation?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

暖阳 2024-09-13 17:58:09

您正在计算术语向量的余弦相似度吗?术语向量应该具有相同的长度。如果文档中不存在单词,则该术语的值应为 0。

我不太确定您要应用余弦相似度的向量是什么,但是在进行余弦相似度时,您的向量应该始终具有相同的长度,并且顺序非常重要。

示例:

Term | Doc1 | Doc2
Foo     .3     .7
Bar  |  0   |  8
Baz  |  1   |  1

这里有两个向量 (.3,0,1) 和 (.7,8,1),可以计算它们之间的余弦相似度。如果您比较 (.3,1) 和 (.7,8),您会将 Baz 的 Doc1 分数与 Bar 的 Doc2 分数进行比较,这是没有意义的。

Are you computing the cosine similarity of term vectors? Term vectors should be the same length. If words aren't present in a document then it should have a value of 0 for that term.

I'm not exactly sure what vectors you're applying cosine similarity for but when doing cosine similarity then your vectors should always be the same length and order very much does matter.

Example:

Term | Doc1 | Doc2
Foo     .3     .7
Bar  |  0   |  8
Baz  |  1   |  1

Here you have two vectors (.3,0,1) and (.7,8,1) and can compute the cosine similarity between them. If you compared (.3,1) and (.7,8) you'd be comparing the Doc1 score of Baz against the Doc2 score of Bar which wouldn't make sense.

梦里的微风 2024-09-13 17:58:09

在将向量输入到 cosine_distance 函数之前尝试构建向量:

import math
from collections import Counter
from nltk import cluster

def buildVector(iterable1, iterable2):
    counter1 = Counter(iterable1)
    counter2= Counter(iterable2)
    all_items = set(counter1.keys()).union( set(counter2.keys()) )
    vector1 = [counter1[k] for k in all_items]
    vector2 = [counter2[k] for k in all_items]
    return vector1, vector2


l1 = "Julie loves me more than Linda loves me".split()
l2 = "Jane likes me more than Julie loves me or".split()


v1,v2= buildVector(l1, l2)
print(cluster.util.cosine_distance(v1,v2))

Try building the vectors before feeding them to the cosine_distance function:

import math
from collections import Counter
from nltk import cluster

def buildVector(iterable1, iterable2):
    counter1 = Counter(iterable1)
    counter2= Counter(iterable2)
    all_items = set(counter1.keys()).union( set(counter2.keys()) )
    vector1 = [counter1[k] for k in all_items]
    vector2 = [counter2[k] for k in all_items]
    return vector1, vector2


l1 = "Julie loves me more than Linda loves me".split()
l2 = "Jane likes me more than Julie loves me or".split()


v1,v2= buildVector(l1, l2)
print(cluster.util.cosine_distance(v1,v2))
献世佛 2024-09-13 17:58:08

您需要将向量中相应单词的条目相乘,因此单词应该有一个全局顺序。这意味着理论上你的向量应该是相同的长度。

实际上,如果一个文档在另一个文档之前被看到,则第二个文档中的单词可能在第一个文档被看到后被添加到全局顺序中,因此即使向量具有相同的顺序,第一个文档也可能更短,因为它没有该向量中不存在的单词的条目。

文件1:敏捷的棕色狐狸跳过了懒狗。

Global order:     The quick brown fox jumped over the lazy dog
Vector for Doc 1:  1    1     1    1     1     1    1   1   1

文件 2: 跑步者跑得很快。

Global order:     The quick brown fox jumped over the lazy dog runner was
Vector for Doc 1:  1    1     1    1     1     1    1   1   1
Vector for Doc 2:  1    1     0    0     0     0    0   0   0    1     1

在这种情况下,理论上您需要在文档 1 向量的末尾填充零。实际上,在计算点积时,只需将元素相乘到向量 1 的末尾(因为省略向量 2 的额外元素与将它们乘以零完全相同,但访问额外元素的速度较慢)。

然后,您可以分别计算每个向量的大小,并且向量不需要具有相同的长度。

You need multiply the entries for corresponding words in the vector, so there should be a global order for the words. This means that in theory your vectors should be the same length.

In practice, if one document was seen before the other, words in the second document may have been added to the global order after the first document was seen, so even though the vectors have the same order, the first document may be shorter, since it doesn't have entries for the words that weren't in that vector.

Document 1: The quick brown fox jumped over the lazy dog.

Global order:     The quick brown fox jumped over the lazy dog
Vector for Doc 1:  1    1     1    1     1     1    1   1   1

Document 2: The runner was quick.

Global order:     The quick brown fox jumped over the lazy dog runner was
Vector for Doc 1:  1    1     1    1     1     1    1   1   1
Vector for Doc 2:  1    1     0    0     0     0    0   0   0    1     1

In this case, in theory you need to pad the Document 1 vector with zeroes on the end. In practice, when computing the dot product, you only need to multiply elements up to the end of Vector 1 (since omitting the extra elements of vector 2 and multiplying them by zero are exactly the same, but visiting the extra elements is slower).

Then you can compute the magnitude of each vector separately, and for that the vectors don't need to be of the same length.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文