为什么两个向量之间的余弦相似度可以为负?
我有 2 个 11 维向量。
a <- c(-0.012813841, -0.024518383, -0.002765056, 0.079496744, 0.063928973,
0.476156960, 0.122111977, 0.322930189, 0.400701256, 0.454048860,
0.525526219)
b <- c(0.64175768, 0.54625694, 0.40728261, 0.24819750, 0.09406221,
0.16681692, -0.04211932, -0.07130129, -0.08182200, -0.08266852,
-0.07215885)
cosine_sim <- cosine(a,b)
它返回:
-0.05397935
我使用了 lsa
包中的 cosine()
。
对于某些值,我得到像给定值一样的负 cosine_sim 。我不确定相似度怎么会是负数。它应该在 0 和 1 之间。
谁能解释一下这里发生了什么。
I have 2 vectors with 11 dimentions.
a <- c(-0.012813841, -0.024518383, -0.002765056, 0.079496744, 0.063928973,
0.476156960, 0.122111977, 0.322930189, 0.400701256, 0.454048860,
0.525526219)
b <- c(0.64175768, 0.54625694, 0.40728261, 0.24819750, 0.09406221,
0.16681692, -0.04211932, -0.07130129, -0.08182200, -0.08266852,
-0.07215885)
cosine_sim <- cosine(a,b)
which returns:
-0.05397935
I used cosine()
from lsa
package.
for some values i am getting negative cosine_sim like the given one. I am not sure how the similarity can be negative. It should be between 0 and 1.
Can anyone explain what is going on here.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
R 的好处是您经常可以深入研究函数并亲眼看看发生了什么。如果您输入余弦(不带任何括号、参数等),则 R 会打印出函数体。仔细研究它(这需要一些练习),您可以看到有一堆机制用于计算矩阵列的成对相似度(即,包装在 if (is.matrix(x) 中的位) && is.null(y)) 条件,但该函数的关键行是
让我们将其取出并将其应用到您的示例中:
因此,您使用的是已经标准化的向量,因此您只是有 这相当于
(对于真正的矩阵运算,crossprod 比手动构造等效运算要高效得多)。
在您的情况下, 答案说,两个向量的点积(即 length(a)*length(b)*cos(a,b))可以为负......
就其价值而言,我怀疑余弦代码> 函数在对于矩阵参数,
lsa
可能更容易/更有效地实现为as.dist(crossprod(x))
...edit:在评论中在下面现已删除的答案中,我建议如果想要在 [0,1] 上进行相似性度量,则余弦距离度量的平方可能是合适的 - 这类似于使用系数确定性 (r^2) 而不是相关系数(r)——但是也可能值得回过头来更仔细地思考所使用的相似性度量的目的/含义......
The nice thing about R is that you can often dig into the functions and see for yourself what is going on. If you type
cosine
(without any parentheses, arguments, etc.) then R prints out the body of the function. Poking through it (which takes some practice), you can see that there is a bunch of machinery for computing the pairwise similarities of the columns of the matrix (i.e., the bit wrapped in theif (is.matrix(x) && is.null(y))
condition, but the key line of the function isLet's pull this out and apply it to your example:
So, you're using vectors that are already normalized, so you just have
crossprod
to look at. In your case this is equivalent to(for real matrix operations,
crossprod
is much more efficient than constructing the equivalent operation by hand).As @Jack Maney's answer says, the dot product of two vectors (which is length(a)*length(b)*cos(a,b)) can be negative ...
For what it's worth, I suspect that the
cosine
function inlsa
might be more easily/efficiently implemented for matrix arguments asas.dist(crossprod(x))
...edit: in comments on a now-deleted answer below, I suggested that the square of the cosine-distance measure might be appropriate if one wants a similarity measure on [0,1] -- this would be analogous to using the coefficient of determination (r^2) rather than the correlation coefficient (r) -- but that it might also be worth going back and thinking more carefully about the purpose/meaning of the similarity measures to be used ...
cosine
函数返回在本例中,分母中的两项均为 1,但
crossprod(a, b)
为 -0.05。The
cosine
function returnsIn this case, both the terms in the denominator are 1, but
crossprod(a, b)
is -0.05.余弦函数可以取负值。
The cosine function can take on negative values.
虽然两个向量的余弦可以取 -1 到 +1 之间的任何值,但余弦相似度(在文档检索中)用于从 [0,1] 区间取值。原因很简单:WordxDocument矩阵中没有负值,因此两个向量的最大夹角为90度,余弦为0。
While cosine of two vectors can take any value between -1 and +1, cosine similarity (in dicument retreival) used to take values from the [0,1] interval. The reason is simple: in the WordxDocument matrix there are no negative values, so the maximum angle of two vectors is 90 degrees, for wich the cosine is 0.