为什么两个向量之间的余弦相似度可以为负?

发布于 2024-11-18 20:35:05 字数 621 浏览 7 评论 0原文

我有 2 个 11 维向量。

a <- c(-0.012813841, -0.024518383, -0.002765056,  0.079496744,  0.063928973,
        0.476156960,  0.122111977,  0.322930189,  0.400701256,  0.454048860,
        0.525526219)

b <- c(0.64175768,  0.54625694,  0.40728261,  0.24819750,  0.09406221, 
       0.16681692, -0.04211932, -0.07130129, -0.08182200, -0.08266852,
       -0.07215885)

cosine_sim <- cosine(a,b)

它返回:

-0.05397935

我使用了 lsa 包中的 cosine()

对于某些值,我得到像给定值一样的负 cosine_sim 。我不确定相似度怎么会是负数。它应该在 0 和 1 之间。

谁能解释一下这里发生了什么。

I have 2 vectors with 11 dimentions.

a <- c(-0.012813841, -0.024518383, -0.002765056,  0.079496744,  0.063928973,
        0.476156960,  0.122111977,  0.322930189,  0.400701256,  0.454048860,
        0.525526219)

b <- c(0.64175768,  0.54625694,  0.40728261,  0.24819750,  0.09406221, 
       0.16681692, -0.04211932, -0.07130129, -0.08182200, -0.08266852,
       -0.07215885)

cosine_sim <- cosine(a,b)

which returns:

-0.05397935

I used cosine() from lsa package.

for some values i am getting negative cosine_sim like the given one. I am not sure how the similarity can be negative. It should be between 0 and 1.

Can anyone explain what is going on here.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

故人爱我别走 2024-11-25 20:35:05

R 的好处是您经常可以深入研究函数并亲眼看看发生了什么。如果您输入余弦(不带任何括号、参数等),则 R 会打印出函数体。仔细研究它(这需要一些练习),您可以看到有一堆机制用于计算矩阵列的成对相似度(即,包装在 if (is.matrix(x) 中的位) && is.null(y)) 条件,但该函数的关键行是

crossprod(x, y)/sqrt(crossprod(x) * crossprod(y))

让我们将其取出并将其应用到您的示例中:

> crossprod(a,b)/sqrt(crossprod(a)*crossprod(b))
            [,1]
[1,] -0.05397935
> crossprod(a)
     [,1]
[1,]    1
> crossprod(b)
     [,1]
[1,]    1

因此,您使用的是已经标准化的向量,因此您只是有 这相当于

> sum(a*b)
[1] -0.05397935

(对于真正的矩阵运算,crossprod 比手动构造等效运算要高效得多)。

在您的情况下, 答案说,两个向量的点积(即 length(a)*length(b)*cos(a,b))可以为负......

就其价值而言,我怀疑余弦代码> 函数在对于矩阵参数,lsa 可能更容易/更有效地实现为 as.dist(crossprod(x)) ...

edit:在评论中在下面现已删除的答案中,我建议如果想要在 [0,1] 上进行相似性度量,则余弦距离度量的平方可能是合适的 - 这类似于使用系数确定性 (r^2) 而不是相关系数(r)——但是也可能值得回过头来更仔细地思考所使用的​​相似性度量的目的/含义......

The nice thing about R is that you can often dig into the functions and see for yourself what is going on. If you type cosine (without any parentheses, arguments, etc.) then R prints out the body of the function. Poking through it (which takes some practice), you can see that there is a bunch of machinery for computing the pairwise similarities of the columns of the matrix (i.e., the bit wrapped in the if (is.matrix(x) && is.null(y)) condition, but the key line of the function is

crossprod(x, y)/sqrt(crossprod(x) * crossprod(y))

Let's pull this out and apply it to your example:

> crossprod(a,b)/sqrt(crossprod(a)*crossprod(b))
            [,1]
[1,] -0.05397935
> crossprod(a)
     [,1]
[1,]    1
> crossprod(b)
     [,1]
[1,]    1

So, you're using vectors that are already normalized, so you just have crossprod to look at. In your case this is equivalent to

> sum(a*b)
[1] -0.05397935

(for real matrix operations, crossprod is much more efficient than constructing the equivalent operation by hand).

As @Jack Maney's answer says, the dot product of two vectors (which is length(a)*length(b)*cos(a,b)) can be negative ...

For what it's worth, I suspect that the cosine function in lsa might be more easily/efficiently implemented for matrix arguments as as.dist(crossprod(x)) ...

edit: in comments on a now-deleted answer below, I suggested that the square of the cosine-distance measure might be appropriate if one wants a similarity measure on [0,1] -- this would be analogous to using the coefficient of determination (r^2) rather than the correlation coefficient (r) -- but that it might also be worth going back and thinking more carefully about the purpose/meaning of the similarity measures to be used ...

执手闯天涯 2024-11-25 20:35:05

cosine 函数返回

crossprod(a, b)/sqrt(crossprod(a) * crossprod(b))

在本例中,分母中的两项均为 1,但 crossprod(a, b) 为 -0.05。

The cosine function returns

crossprod(a, b)/sqrt(crossprod(a) * crossprod(b))

In this case, both the terms in the denominator are 1, but crossprod(a, b) is -0.05.

生生不灭 2024-11-25 20:35:05

余弦函数可以取负值。

The cosine function can take on negative values.

嗼ふ静 2024-11-25 20:35:05

虽然两个向量的余弦可以取 -1 到 +1 之间的任何值,但余弦相似度(在文档检索中)用于从 [0,1] 区间取值。原因很简单:WordxDocument矩阵中没有负值,因此两个向量的最大夹角为90度,余弦为0。

While cosine of two vectors can take any value between -1 and +1, cosine similarity (in dicument retreival) used to take values from the [0,1] interval. The reason is simple: in the WordxDocument matrix there are no negative values, so the maximum angle of two vectors is 90 degrees, for wich the cosine is 0.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文