当前位置：文江博客话题详情

有人可以用非常简单的图形方式给出余弦相似度的例子吗？

发布于 2024-08-11 14:02:22 字数 1455 浏览 1 评论 0原文

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

不必了 2024-08-18 14:02:22

这里有两个非常短的文本可供比较：

朱莉爱我胜过琳达爱我
< code>Jane 喜欢我多于 Julie 爱我

我们想知道这些文本的相似度，纯粹从字数来看（并忽略词序）。我们首先列出两个文本中的单词：

me Julie loves Linda than more likes Jane

现在我们计算每个单词在每个文本中出现的次数：

   me   2   2
 Jane   0   1
Julie   1   1
Linda   1   0
likes   0   1
loves   2   1
 more   1   1
 than   1   1

但我们对单词本身不感兴趣。我们只感兴趣
这两个垂直的计数向量。例如，有两个实例
每个文本中的“我”。我们将决定这两个文本的接近程度
另一种方法是计算这两个向量的一个函数，即余弦
它们之间的角度。

这两个向量又是：

a: [2, 0, 1, 1, 0, 2, 1, 1]

b: [2, 1, 1, 0, 1, 1, 1, 1]

它们之间角度的余弦约为 0.822。

这些向量是 8 维的。使用余弦相似度的优点很明显
它将一个超出人类想象能力的问题转化为一个问题
可以的。在这种情况下，您可以将其视为大约 35 度的角度
度数与零或完全一致有一定的“距离”。

Here are two very short texts to compare:

Julie loves me more than Linda loves me
Jane likes me more than Julie loves me

We want to know how similar these texts are, purely in terms of word counts (and ignoring word order). We begin by making a list of the words from both texts:

me Julie loves Linda than more likes Jane

Now we count the number of times each of these words appears in each text:

   me   2   2
 Jane   0   1
Julie   1   1
Linda   1   0
likes   0   1
loves   2   1
 more   1   1
 than   1   1

We are not interested in the words themselves though. We are interested only in
those two vertical vectors of counts. For instance, there are two instances of
'me' in each text. We are going to decide how close these two texts are to each
other by calculating one function of those two vectors, namely the cosine of
the angle between them.

The two vectors are, again:

a: [2, 0, 1, 1, 0, 2, 1, 1]

b: [2, 1, 1, 0, 1, 1, 1, 1]

The cosine of the angle between them is about 0.822.

These vectors are 8-dimensional. A virtue of using cosine similarity is clearly
that it converts a question that is beyond human ability to visualise to one
that can be. In this case you can think of this as the angle of about 35
degrees which is some 'distance' from zero or perfect agreement.

回复收藏 0 原文

不寐倦长更 2024-08-18 14:02:22

我猜您更感兴趣的是深入了解“为什么”余弦相似性有效（为什么它可以很好地指示相似性），而不是“如何”它是计算出来的（用于计算的具体操作）。如果您对后者感兴趣，请参阅 Daniel 在这篇文章中指出的参考文献，以及相关 SO问题。

为了解释如何以及为什么，首先简化问题并仅在二维中工作是有用的。一旦你在二维中得到它，就更容易在三个维度上思考它，当然在更多维度上想象它会更困难，但到那时我们可以使用线性代数进行数值计算，并帮助我们用术语思考n 维中的线/向量/“平面”/“球体”，即使我们无法绘制它们。

因此，在二维中：关于文本相似性，这意味着我们将关注两个不同的术语，比如“伦敦”和“巴黎”，并且我们会计算每个术语出现的次数这些词可以在我们想要比较的两个文档中找到。对于每个文档，这为我们提供了 xy 平面上的一个点。例如，如果 Doc1 有一次巴黎，四次伦敦，则 (1,4) 处的点将呈现此文档（关于文档的这种小型评估）。或者，用向量来说，这个 Doc1 文档将是一个从原点到点 (1,4) 的箭头。考虑到此图像，让我们考虑一下两个文档相似意味着什么以及这与向量有何关系。

非常相似的文件（同样是关于这组有限的维度）将具有相同数量的巴黎参考文献，以及相同数量的伦敦参考文献，或者它们可能具有相同的参考文献比例。一份文档 Doc2，包含 2 个巴黎参考文献和 8 个伦敦参考文献，也非常相似，只是文本可能更长，或者以某种方式更多地重复城市名称，但比例相同。也许这两份文件都是关于伦敦的指南，只是顺便提到了巴黎（以及那个城市是多么不酷；-）开玩笑！！！。

现在，不太相似的文件也可能包含对这两个城市的引用，但比例不同。也许 Doc2 只会引用一次巴黎，七次引用伦敦。

回到我们的 xy 平面，如果我们绘制这些假设的文档，我们会发现当它们非常相似时，它们的向量会重叠（尽管有些向量可能更长），并且随着它们的共同点开始减少，这些向量开始发散，它们之间的角度更大。

通过测量向量之间的角度，我们可以很好地了解它们的相似性，并且为了让事情变得更简单，通过取这个角度的余弦，我们有一个很好的 0 到 1 或 -1 到 1 值表示这种相似性，具体取决于我们考虑的内容和方式。角度越小，余弦值越大（越接近1），相似度也越高。

在极端情况下，如果 Doc1 只引用巴黎，而 Doc2 只引用伦敦，那么这些文件绝对没有任何共同点。 Doc1 的向量位于 x 轴，Doc2 的向量位于 y 轴，角度为 90 度，余弦为 0。在这种情况下，我们会说这些文档彼此正交。

添加维度：
有了这种以小角度（或大余弦）表示的相似性的直观感觉，我们现在可以在 3 维中想象事物，例如将“阿姆斯特丹”一词混合在一起，并很好地可视化具有两个引用的文档如何将有一个向量朝特定方向移动，我们可以看到这个方向与分别引用巴黎和伦敦三次但不是阿姆斯特丹等的文档相比如何。如上所述，我们可以尝试想象这个奇特的空间为 10或 100 个城市。画起来很难，但概念化很容易。

我将通过关于公式本身的几句话来结束。正如我所说，其他参考资料提供了有关计算的良好信息。

首先是二维。两个向量之间角度的余弦公式是从三角差（角度 a 和角度 b 之间）导出的：

cos(a - b) = (cos(a) * cos(b)) + (sin (a) * sin(b))

该公式看起来与点积公式非常相似：

Vect1 . Vect2 =  (x1 * x2) + (y1 * y2)

其中 cos(a) 对应对于第一个向量，x 值和 sin(a) 是 y 值，等等。唯一的问题是 x、y等并不完全是cos和sin值，因为这些值需要在设备上读取圆圈。这就是公式的分母发挥作用的地方：通过除以这些向量长度的乘积，x 和 y 坐标就会标准化。

I'm guessing you are more interested in getting some insight into "why" the cosine similarity works (why it provides a good indication of similarity), rather than "how" it is calculated (the specific operations used for the calculation). If your interest is in the latter, see the reference indicated by Daniel in this post, as well as a related SO Question.

To explain both the how and even more so the why, it is useful, at first, to simplify the problem and to work only in two dimensions. Once you get this in 2D, it is easier to think of it in three dimensions, and of course harder to imagine in many more dimensions, but by then we can use linear algebra to do the numeric calculations and also to help us think in terms of lines / vectors / "planes" / "spheres" in n dimensions, even though we can't draw these.

So, in two dimensions: with regards to text similarity this means that we would focus on two distinct terms, say the words "London" and "Paris", and we'd count how many times each of these words is found in each of the two documents we wish to compare. This gives us, for each document, a point in the the x-y plane. For example, if Doc1 had Paris once, and London four times, a point at (1,4) would present this document (with regards to this diminutive evaluation of documents). Or, speaking in terms of vectors, this Doc1 document would be an arrow going from the origin to point (1,4). With this image in mind, let's think about what it means for two documents to be similar and how this relates to the vectors.

VERY similar documents (again with regards to this limited set of dimensions) would have the very same number of references to Paris, AND the very same number of references to London, or maybe, they could have the same ratio of these references. A Document, Doc2, with 2 refs to Paris and 8 refs to London, would also be very similar, only with maybe a longer text or somehow more repetitive of the cities' names, but in the same proportion. Maybe both documents are guides about London, only making passing references to Paris (and how uncool that city is ;-) Just kidding!!!.

Now, less similar documents may also include references to both cities, but in different proportions. Maybe Doc2 would only cite Paris once and London seven times.

Back to our x-y plane, if we draw these hypothetical documents, we see that when they are VERY similar, their vectors overlap (though some vectors may be longer), and as they start to have less in common, these vectors start to diverge, to have a wider angle between them.

By measuring the angle between the vectors, we can get a good idea of their similarity, and to make things even easier, by taking the Cosine of this angle, we have a nice 0 to 1 or -1 to 1 value that is indicative of this similarity, depending on what and how we account for. The smaller the angle, the bigger (closer to 1) the cosine value, and also the higher the similarity.

At the extreme, if Doc1 only cites Paris and Doc2 only cites London, the documents have absolutely nothing in common. Doc1 would have its vector on the x-axis, Doc2 on the y-axis, the angle 90 degrees, Cosine 0. In this case we'd say that these documents are orthogonal to one another.

Adding dimensions:
With this intuitive feel for similarity expressed as a small angle (or large cosine), we can now imagine things in 3 dimensions, say by bringing the word "Amsterdam" into the mix, and visualize quite well how a document with two references to each would have a vector going in a particular direction, and we can see how this direction would compare to a document citing Paris and London three times each, but not Amsterdam, etc. As said, we can try and imagine the this fancy space for 10 or 100 cities. It's hard to draw, but easy to conceptualize.

I'll wrap up just by saying a few words about the formula itself. As I've said, other references provide good information about the calculations.

First in two dimensions. The formula for the Cosine of the angle between two vectors is derived from the trigonometric difference (between angle a and angle b):

cos(a - b) = (cos(a) * cos(b)) + (sin (a) * sin(b))

This formula looks very similar to the dot product formula:

Vect1 . Vect2 =  (x1 * x2) + (y1 * y2)

where cos(a) corresponds to the x value and sin(a) the y value, for the first vector, etc. The only problem, is that x, y, etc. are not exactly the cos and sin values, for these values need to be read on the unit circle. That's where the denominator of the formula kicks in: by dividing by the product of the length of these vectors, the x and y coordinates become normalized.

回复收藏 0 原文

宫墨修音 2024-08-18 14:02:22

这是我在 C# 中的实现。

using System;

namespace CosineSimilarity
{
    class Program
    {
        static void Main()
        {
            int[] vecA = {1, 2, 3, 4, 5};
            int[] vecB = {6, 7, 7, 9, 10};

            var cosSimilarity = CalculateCosineSimilarity(vecA, vecB);

            Console.WriteLine(cosSimilarity);
            Console.Read();
        }

        private static double CalculateCosineSimilarity(int[] vecA, int[] vecB)
        {
            var dotProduct = DotProduct(vecA, vecB);
            var magnitudeOfA = Magnitude(vecA);
            var magnitudeOfB = Magnitude(vecB);

            return dotProduct/(magnitudeOfA*magnitudeOfB);
        }

        private static double DotProduct(int[] vecA, int[] vecB)
        {
            // I'm not validating inputs here for simplicity.            
            double dotProduct = 0;
            for (var i = 0; i < vecA.Length; i++)
            {
                dotProduct += (vecA[i] * vecB[i]);
            }

            return dotProduct;
        }

        // Magnitude of the vector is the square root of the dot product of the vector with itself.
        private static double Magnitude(int[] vector)
        {
            return Math.Sqrt(DotProduct(vector, vector));
        }
    }
}

Here's my implementation in C#.

using System;

namespace CosineSimilarity
{
    class Program
    {
        static void Main()
        {
            int[] vecA = {1, 2, 3, 4, 5};
            int[] vecB = {6, 7, 7, 9, 10};

            var cosSimilarity = CalculateCosineSimilarity(vecA, vecB);

            Console.WriteLine(cosSimilarity);
            Console.Read();
        }

        private static double CalculateCosineSimilarity(int[] vecA, int[] vecB)
        {
            var dotProduct = DotProduct(vecA, vecB);
            var magnitudeOfA = Magnitude(vecA);
            var magnitudeOfB = Magnitude(vecB);

            return dotProduct/(magnitudeOfA*magnitudeOfB);
        }

        private static double DotProduct(int[] vecA, int[] vecB)
        {
            // I'm not validating inputs here for simplicity.            
            double dotProduct = 0;
            for (var i = 0; i < vecA.Length; i++)
            {
                dotProduct += (vecA[i] * vecB[i]);
            }

            return dotProduct;
        }

        // Magnitude of the vector is the square root of the dot product of the vector with itself.
        private static double Magnitude(int[] vector)
        {
            return Math.Sqrt(DotProduct(vector, vector));
        }
    }
}

回复收藏 0 原文

陈独秀 2024-08-18 14:02:22

为了简单起见，我减少了向量 a 和 b：

Let :
    a : [1, 1, 0]
    b : [1, 0, 1]

然后是余弦相似度 (Theta)：

 (Theta) = (1*1 + 1*0 + 0*1)/sqrt((1^2 + 1^2))* sqrt((1^2 + 1^2)) = 1/2 = 0.5

然后 cos 0.5 的倒数是 60 度。

For simplicity I am reducing the vector a and b:

Let :
    a : [1, 1, 0]
    b : [1, 0, 1]

Then cosine similarity (Theta):

 (Theta) = (1*1 + 1*0 + 0*1)/sqrt((1^2 + 1^2))* sqrt((1^2 + 1^2)) = 1/2 = 0.5

then inverse of cos 0.5 is 60 degrees.

回复收藏 0 原文

荒芜了季节 2024-08-18 14:02:22

这段 Python 代码是我实现该算法的快速而肮脏的尝试：

import math
from collections import Counter

def build_vector(iterable1, iterable2):
    counter1 = Counter(iterable1)
    counter2 = Counter(iterable2)
    all_items = set(counter1.keys()).union(set(counter2.keys()))
    vector1 = [counter1[k] for k in all_items]
    vector2 = [counter2[k] for k in all_items]
    return vector1, vector2

def cosim(v1, v2):
    dot_product = sum(n1 * n2 for n1, n2 in zip(v1, v2) )
    magnitude1 = math.sqrt(sum(n ** 2 for n in v1))
    magnitude2 = math.sqrt(sum(n ** 2 for n in v2))
    return dot_product / (magnitude1 * magnitude2)


l1 = "Julie loves me more than Linda loves me".split()
l2 = "Jane likes me more than Julie loves me or".split()


v1, v2 = build_vector(l1, l2)
print(cosim(v1, v2))

This Python code is my quick and dirty attempt to implement the algorithm:

import math
from collections import Counter

def build_vector(iterable1, iterable2):
    counter1 = Counter(iterable1)
    counter2 = Counter(iterable2)
    all_items = set(counter1.keys()).union(set(counter2.keys()))
    vector1 = [counter1[k] for k in all_items]
    vector2 = [counter2[k] for k in all_items]
    return vector1, vector2

def cosim(v1, v2):
    dot_product = sum(n1 * n2 for n1, n2 in zip(v1, v2) )
    magnitude1 = math.sqrt(sum(n ** 2 for n in v1))
    magnitude2 = math.sqrt(sum(n ** 2 for n in v2))
    return dot_product / (magnitude1 * magnitude2)


l1 = "Julie loves me more than Linda loves me".split()
l2 = "Jane likes me more than Julie loves me or".split()


v1, v2 = build_vector(l1, l2)
print(cosim(v1, v2))

回复收藏 0 原文

明月松间行 2024-08-18 14:02:22

使用 @Bill Bell 示例，在 [R] 中执行此操作的两种方法

a = c(2,1,0,2,0,1,1,1)

b = c(2,1,1,1,1,0,1,1)

d = (a %*% b) / (sqrt(sum(a^2)) * sqrt(sum(b^2)))

或利用 crossprod() 方法的性能...

e = crossprod(a, b) / (sqrt(crossprod(a, a)) * sqrt(crossprod(b, b)))

Using @Bill Bell example, two ways to do this in [R]

a = c(2,1,0,2,0,1,1,1)

b = c(2,1,1,1,1,0,1,1)

d = (a %*% b) / (sqrt(sum(a^2)) * sqrt(sum(b^2)))

or taking advantage of crossprod() method's performance...

e = crossprod(a, b) / (sqrt(crossprod(a, a)) * sqrt(crossprod(b, b)))

回复收藏 0 原文

天冷不及心凉 2024-08-18 14:02:22

这是一个简单的 Python 代码，它实现了余弦相似度。

from scipy import linalg, mat, dot
import numpy as np

In [12]: matrix = mat( [[2, 1, 0, 2, 0, 1, 1, 1],[2, 1, 1, 1, 1, 0, 1, 1]] )

In [13]: matrix
Out[13]: 
matrix([[2, 1, 0, 2, 0, 1, 1, 1],
        [2, 1, 1, 1, 1, 0, 1, 1]])
In [14]: dot(matrix[0],matrix[1].T)/np.linalg.norm(matrix[0])/np.linalg.norm(matrix[1])
Out[14]: matrix([[ 0.82158384]])

This is a simple Python code which implements cosine similarity.

from scipy import linalg, mat, dot
import numpy as np

In [12]: matrix = mat( [[2, 1, 0, 2, 0, 1, 1, 1],[2, 1, 1, 1, 1, 0, 1, 1]] )

In [13]: matrix
Out[13]: 
matrix([[2, 1, 0, 2, 0, 1, 1, 1],
        [2, 1, 1, 1, 1, 0, 1, 1]])
In [14]: dot(matrix[0],matrix[1].T)/np.linalg.norm(matrix[0])/np.linalg.norm(matrix[1])
Out[14]: matrix([[ 0.82158384]])

回复收藏 0 原文

阳光下的泡沫是彩色的 2024-08-18 14:02:22

import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;

/**
 * 
* @author Xiao Ma
* mail : [email protected]
*
*/
  public class SimilarityUtil {

public static double consineTextSimilarity(String[] left, String[] right) {
    Map<String, Integer> leftWordCountMap = new HashMap<String, Integer>();
    Map<String, Integer> rightWordCountMap = new HashMap<String, Integer>();
    Set<String> uniqueSet = new HashSet<String>();
    Integer temp = null;
    for (String leftWord : left) {
        temp = leftWordCountMap.get(leftWord);
        if (temp == null) {
            leftWordCountMap.put(leftWord, 1);
            uniqueSet.add(leftWord);
        } else {
            leftWordCountMap.put(leftWord, temp + 1);
        }
    }
    for (String rightWord : right) {
        temp = rightWordCountMap.get(rightWord);
        if (temp == null) {
            rightWordCountMap.put(rightWord, 1);
            uniqueSet.add(rightWord);
        } else {
            rightWordCountMap.put(rightWord, temp + 1);
        }
    }
    int[] leftVector = new int[uniqueSet.size()];
    int[] rightVector = new int[uniqueSet.size()];
    int index = 0;
    Integer tempCount = 0;
    for (String uniqueWord : uniqueSet) {
        tempCount = leftWordCountMap.get(uniqueWord);
        leftVector[index] = tempCount == null ? 0 : tempCount;
        tempCount = rightWordCountMap.get(uniqueWord);
        rightVector[index] = tempCount == null ? 0 : tempCount;
        index++;
    }
    return consineVectorSimilarity(leftVector, rightVector);
}

/**
 * The resulting similarity ranges from −1 meaning exactly opposite, to 1
 * meaning exactly the same, with 0 usually indicating independence, and
 * in-between values indicating intermediate similarity or dissimilarity.
 * 
 * For text matching, the attribute vectors A and B are usually the term
 * frequency vectors of the documents. The cosine similarity can be seen as
 * a method of normalizing document length during comparison.
 * 
 * In the case of information retrieval, the cosine similarity of two
 * documents will range from 0 to 1, since the term frequencies (tf-idf
 * weights) cannot be negative. The angle between two term frequency vectors
 * cannot be greater than 90°.
 * 
 * @param leftVector
 * @param rightVector
 * @return
 */
private static double consineVectorSimilarity(int[] leftVector,
        int[] rightVector) {
    if (leftVector.length != rightVector.length)
        return 1;
    double dotProduct = 0;
    double leftNorm = 0;
    double rightNorm = 0;
    for (int i = 0; i < leftVector.length; i++) {
        dotProduct += leftVector[i] * rightVector[i];
        leftNorm += leftVector[i] * leftVector[i];
        rightNorm += rightVector[i] * rightVector[i];
    }

    double result = dotProduct
            / (Math.sqrt(leftNorm) * Math.sqrt(rightNorm));
    return result;
}

public static void main(String[] args) {
    String left[] = { "Julie", "loves", "me", "more", "than", "Linda",
            "loves", "me" };
    String right[] = { "Jane", "likes", "me", "more", "than", "Julie",
            "loves", "me" };
    System.out.println(consineTextSimilarity(left,right));
}
}

import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;

/**
 * 
* @author Xiao Ma
* mail : [email protected]
*
*/
  public class SimilarityUtil {

public static double consineTextSimilarity(String[] left, String[] right) {
    Map<String, Integer> leftWordCountMap = new HashMap<String, Integer>();
    Map<String, Integer> rightWordCountMap = new HashMap<String, Integer>();
    Set<String> uniqueSet = new HashSet<String>();
    Integer temp = null;
    for (String leftWord : left) {
        temp = leftWordCountMap.get(leftWord);
        if (temp == null) {
            leftWordCountMap.put(leftWord, 1);
            uniqueSet.add(leftWord);
        } else {
            leftWordCountMap.put(leftWord, temp + 1);
        }
    }
    for (String rightWord : right) {
        temp = rightWordCountMap.get(rightWord);
        if (temp == null) {
            rightWordCountMap.put(rightWord, 1);
            uniqueSet.add(rightWord);
        } else {
            rightWordCountMap.put(rightWord, temp + 1);
        }
    }
    int[] leftVector = new int[uniqueSet.size()];
    int[] rightVector = new int[uniqueSet.size()];
    int index = 0;
    Integer tempCount = 0;
    for (String uniqueWord : uniqueSet) {
        tempCount = leftWordCountMap.get(uniqueWord);
        leftVector[index] = tempCount == null ? 0 : tempCount;
        tempCount = rightWordCountMap.get(uniqueWord);
        rightVector[index] = tempCount == null ? 0 : tempCount;
        index++;
    }
    return consineVectorSimilarity(leftVector, rightVector);
}

/**
 * The resulting similarity ranges from −1 meaning exactly opposite, to 1
 * meaning exactly the same, with 0 usually indicating independence, and
 * in-between values indicating intermediate similarity or dissimilarity.
 * 
 * For text matching, the attribute vectors A and B are usually the term
 * frequency vectors of the documents. The cosine similarity can be seen as
 * a method of normalizing document length during comparison.
 * 
 * In the case of information retrieval, the cosine similarity of two
 * documents will range from 0 to 1, since the term frequencies (tf-idf
 * weights) cannot be negative. The angle between two term frequency vectors
 * cannot be greater than 90°.
 * 
 * @param leftVector
 * @param rightVector
 * @return
 */
private static double consineVectorSimilarity(int[] leftVector,
        int[] rightVector) {
    if (leftVector.length != rightVector.length)
        return 1;
    double dotProduct = 0;
    double leftNorm = 0;
    double rightNorm = 0;
    for (int i = 0; i < leftVector.length; i++) {
        dotProduct += leftVector[i] * rightVector[i];
        leftNorm += leftVector[i] * leftVector[i];
        rightNorm += rightVector[i] * rightVector[i];
    }

    double result = dotProduct
            / (Math.sqrt(leftNorm) * Math.sqrt(rightNorm));
    return result;
}

public static void main(String[] args) {
    String left[] = { "Julie", "loves", "me", "more", "than", "Linda",
            "loves", "me" };
    String right[] = { "Jane", "likes", "me", "more", "than", "Julie",
            "loves", "me" };
    System.out.println(consineTextSimilarity(left,right));
}
}

回复收藏 0 原文

双手揣兜 2024-08-18 14:02:22

简单的JAVA代码计算余弦相似度

/**
   * Method to calculate cosine similarity of vectors
   * 1 - exactly similar (angle between them is 0)
   * 0 - orthogonal vectors (angle between them is 90)
   * @param vector1 - vector in the form [a1, a2, a3, ..... an]
   * @param vector2 - vector in the form [b1, b2, b3, ..... bn]
   * @return - the cosine similarity of vectors (ranges from 0 to 1)
   */
  private double cosineSimilarity(List<Double> vector1, List<Double> vector2) {

    double dotProduct = 0.0;
    double normA = 0.0;
    double normB = 0.0;
    for (int i = 0; i < vector1.size(); i++) {
      dotProduct += vector1.get(i) * vector2.get(i);
      normA += Math.pow(vector1.get(i), 2);
      normB += Math.pow(vector2.get(i), 2);
    }
    return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
  }

Simple JAVA code to calculate cosine similarity

/**
   * Method to calculate cosine similarity of vectors
   * 1 - exactly similar (angle between them is 0)
   * 0 - orthogonal vectors (angle between them is 90)
   * @param vector1 - vector in the form [a1, a2, a3, ..... an]
   * @param vector2 - vector in the form [b1, b2, b3, ..... bn]
   * @return - the cosine similarity of vectors (ranges from 0 to 1)
   */
  private double cosineSimilarity(List<Double> vector1, List<Double> vector2) {

    double dotProduct = 0.0;
    double normA = 0.0;
    double normB = 0.0;
    for (int i = 0; i < vector1.size(); i++) {
      dotProduct += vector1.get(i) * vector2.get(i);
      normA += Math.pow(vector1.get(i), 2);
      normB += Math.pow(vector2.get(i), 2);
    }
    return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
  }

回复收藏 0 原文

櫻之舞 2024-08-18 14:02:22

让我尝试用 Python 代码和一些图形数学公式来解释这一点。

假设我们的代码中有两个非常短的文本：

texts = ["I am a boy", "I am a girl"]

我们想要使用快速余弦相似度分数来比较以下查询文本，看看该查询与上面的文本有多接近：

query = ["I am a boy scout"]

我们应该如何计算余弦相似度分数？首先，让我们在 Python 中为这些文本构建一个 tfidf 矩阵：

from sklearn.feature_extraction.text import TfidfVectorizer
    
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(texts)

接下来，让我们检查 tfidf 矩阵的值及其词汇表：

print(tfidf_matrix.toarray())

# output 
array([[0.57973867, 0.81480247, 0.        ],
       [0.57973867, 0.        , 0.81480247]])

这里，我们得到一个 tfidf 矩阵，其 tfidf 值为 2 x 3，或 2 个文档/文本 x 3 个术语。这是我们的 tfidf 文档术语矩阵。让我们通过调用 vectorizer.vocabulary_ 来看看这 3 个术语是什么。

print(vectorizer.vocabulary_)

# output
{'am': 0, 'boy': 1, 'girl': 2}

这告诉我们 tfidf 矩阵中的 3 个术语是“am”、“boy”和“girl”。 “am”位于第 0 列，“boy”位于第 1 列，“girl”位于第 2 列。矢量化器已删除术语“I”和“a”，因为它们是停用词。

现在我们有了 tfidf 矩阵，我们想要将查询文本与文本进行比较，看看我们的查询与文本有多接近。为此，我们可以计算查询与文本 tfidf 矩阵的余弦相似度得分。但首先，我们需要计算查询的 tfidf：

query = ["I am a boy scout"]

query_tfidf = vectorizer.transform([query])
print(query_tfidf.toarray())

#output
array([[0.57973867, 0.81480247, 0.        ]])

在这里，我们计算查询的 tfidf。我们的 query_tfidf 有一个 tfidf 值向量 [0.57973867, 0.81480247, 0.]，我们将用它来计算余弦相似度乘法分数。如果我没有记错的话，query_tfidf 值或 vectorizer.transform([query]) 值是通过从 tfidf_matrix 中选择与查询匹配最多单词的行或文档来导出的。例如，tfidf_matrix 的第 1 行或文档/文本 1 与查询文本匹配的单词最多，其中包含“am”（0.57973867）和“boy”（0.81480247），因此 [0.57973867 的 tfidf_matrix 的第 1 行, 0.81480247, 0.] 值被选择作为 query_tfidf 的值。 （注意：如果有人能帮助进一步解释这一点那就太好了）

计算完 query_tfidf 后，我们现在可以将 query_tfidf 向量与文本 tfidf_matrix 进行矩阵乘法或点积以获得余弦相似度分数。

回想一下，余弦相似度得分或公式等于以下内容：

cosine similarity score = (A . B) / ||A|| ||B||

这里， A = 我们的 query_tfidf 向量，B = 我们的 tfidf_matrix 的每一行

请注意： A 。 B = A * B^T，或 A 点积 B = A 乘以 B 转置。

知道了公式，让我们手动计算 query_tfidf 的余弦相似度分数，然后将我们的答案与 sklearn.metrics cosine_similarity 函数提供的值进行比较。让我们手动计算：

query_tfidf_arr = query_tfidf.toarray()
tfidf_matrix_arr = tfidf_matrix.toarray()

cosine_similarity_1 = np.dot(query_tfidf_arr, tfidf_matrix_arr[0].T) / 
  (np.linalg.norm(query_tfidf_arr) * np.linalg.norm(tfidf_matrix_arr[0])) 
cosine_similarity_2 = np.dot(query_tfidf_arr, tfidf_matrix_arr[1].T) / 
  (np.linalg.norm(query_tfidf_arr) * np.linalg.norm(tfidf_matrix_arr[1]))

manual_cosine_similarities = [cosine_similarity_1[0], cosine_similarity_2[0]]
print(manual_cosine_similarities)

#output
[1.0, 0.33609692727625745]

我们手动计算的余弦相似度得分给出的值为 [1.0, 0.33609692727625745]。让我们用 sklearn.metrics cosine_similarity 函数提供的答案值来检查手动计算的余弦相似度分数：

from sklearn.metrics.pairwise import cosine_similarity

function_cosine_similarities = cosine_similarity(query_tfidf, tfidf_matrix)
print(function_cosine_similarities)

#output
array([[1.0        , 0.33609693]])

输出值是相同的！手动计算的余弦相似度值与函数计算的余弦相似度值相同！

因此，这个简单的解释显示了如何计算余弦相似度值。希望您觉得这个解释有帮助。

Let me try explaining this in terms of Python code and some graphical Mathematics formulas.

Suppose we have two very short texts in our code:

texts = ["I am a boy", "I am a girl"]

And we want to compare the following query text to see how close the query is to the texts above, using fast cosine similarity scores:

query = ["I am a boy scout"]

How should we compute the cosine similarity scores? First, let's build a tfidf matrix in Python for these texts:

from sklearn.feature_extraction.text import TfidfVectorizer
    
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(texts)

Next, let's check the values of our tfidf matrix and its vocabulary:

print(tfidf_matrix.toarray())

# output 
array([[0.57973867, 0.81480247, 0.        ],
       [0.57973867, 0.        , 0.81480247]])

Here, we get a tfidf matrix with tfidf values of 2 x 3, or 2 documents/text x 3 terms. This is our tfidf document-term matrix. Let's see what are the 3 terms by calling vectorizer.vocabulary_

print(vectorizer.vocabulary_)

# output
{'am': 0, 'boy': 1, 'girl': 2}

This tells us that our 3 terms in our tfidf matrix are 'am', 'boy' and 'girl'. 'am' is at column 0, 'boy' is at column 1, and 'girl' is at column 2. The terms 'I' and 'a' has been removed by the vectorizer because they are stopwords.

Now we have our tfidf matrix, we want to compare our query text with our texts and see how close our query is to our texts. To do that, we can compute the cosine similarity scores of the query vs the tfidf matrix of the texts. But first, we need to compute the tfidf of our query:

query = ["I am a boy scout"]

query_tfidf = vectorizer.transform([query])
print(query_tfidf.toarray())

#output
array([[0.57973867, 0.81480247, 0.        ]])

Here, we computed the tfidf of our query. Our query_tfidf has a vector of tfidf values [0.57973867, 0.81480247, 0. ], which we will use to compute our cosine similarity multiplication scores. If I am not mistaken, the query_tfidf values or vectorizer.transform([query]) values are derived by just selecting the row or document from tfidf_matrix that has the most word matching with the query. For example, row 1 or document/text 1 of the tfidf_matrix has the most word matching with the query text which contains "am" (0.57973867) and "boy" (0.81480247), hence row 1 of the tfidf_matrix of [0.57973867, 0.81480247, 0. ] values are selected to be the values for query_tfidf. (Note: If someone could help further explain this that would be good)

After computing our query_tfidf, we can now matrix multiply or dot product our query_tfidf vector with our text tfidf_matrix to obtain the cosine similarity scores.

Recall that cosine similarity score or formula is equal to the following:

cosine similarity score = (A . B) / ||A|| ||B||

Here, A = our query_tfidf vector, and B = each row of our tfidf_matrix

Note that: A . B = A * B^T, or A dot product B = A multiply by B Transpose.

Knowing the formula, let's manually compute our cosine similarity scores for query_tfidf, then compare our answer with the values provided by the sklearn.metrics cosine_similarity function. Let's manually compute:

query_tfidf_arr = query_tfidf.toarray()
tfidf_matrix_arr = tfidf_matrix.toarray()

cosine_similarity_1 = np.dot(query_tfidf_arr, tfidf_matrix_arr[0].T) / 
  (np.linalg.norm(query_tfidf_arr) * np.linalg.norm(tfidf_matrix_arr[0])) 
cosine_similarity_2 = np.dot(query_tfidf_arr, tfidf_matrix_arr[1].T) / 
  (np.linalg.norm(query_tfidf_arr) * np.linalg.norm(tfidf_matrix_arr[1]))

manual_cosine_similarities = [cosine_similarity_1[0], cosine_similarity_2[0]]
print(manual_cosine_similarities)

#output
[1.0, 0.33609692727625745]

Our manually computed cosine similarity scores give values of [1.0, 0.33609692727625745]. Let's check our manually computed cosine similarity score with the answer value provided by the sklearn.metrics cosine_similarity function:

from sklearn.metrics.pairwise import cosine_similarity

function_cosine_similarities = cosine_similarity(query_tfidf, tfidf_matrix)
print(function_cosine_similarities)

#output
array([[1.0        , 0.33609693]])

The output values are both the same! The manually computed cosine similarity values are the the same as the function computed cosine similarity values!

Hence, this simple explanation shows how the cosine similarity values are computed. Hope you found this explanation helpful.

回复收藏 0 原文

够钟 2024-08-18 14:02:22

两个向量A和B存在于2D空间或3D空间中，这些向量之间的角度是cos相似度。

如果角度更大（最大可以达到180度）则Cos 180=-1，最小角度为0度。 cos 0 =1 意味着向量彼此对齐，因此向量相似。

cos 90=0 （这足以得出向量 A 和 B 根本不相似的结论，并且由于距离不能为负，所以余弦值将在 0 到 1 之间。因此，更多的角度意味着减少相似性（也可视化它）有道理）

回复收藏 0 原文

终弃我 2024-08-18 14:02:22

下面是计算余弦相似度的简单 Python 代码：

import math

def dot_prod(v1, v2):
    ret = 0
    for i in range(len(v1)):
        ret += v1[i] * v2[i]
    return ret

def magnitude(v):
    ret = 0
    for i in v:
        ret += i**2
    return math.sqrt(ret)

def cos_sim(v1, v2):
    return (dot_prod(v1, v2)) / (magnitude(v1) * magnitude(v2))

Here's a simple Python code to calculate cosine similarity:

import math

def dot_prod(v1, v2):
    ret = 0
    for i in range(len(v1)):
        ret += v1[i] * v2[i]
    return ret

def magnitude(v):
    ret = 0
    for i in v:
        ret += i**2
    return math.sqrt(ret)

def cos_sim(v1, v2):
    return (dot_prod(v1, v2)) / (magnitude(v1) * magnitude(v2))

回复收藏 0 原文

~没有更多了~

关于作者

当爱已成负担

暂无简介

0 文章

0 评论

24 人气

关注发私信

友情链接

文江博客

有人可以用非常简单的图形方式给出余弦相似度的例子吗？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（12）

关于作者

相关话题

热门标签

推荐作者

1CH1MKgiKxn9p

ゞ记忆︶ㄣ

JackDx

信远

yaoduoduo1995

霞映澄塘

友情链接

有人可以用非常简单的图形方式给出余弦相似度的例子吗？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（12）

关于作者

相关话题

热门标签

推荐作者

1CH1MKgiKxn9p

ゞ记忆︶ㄣ

JackDx

信远

yaoduoduo1995

霞映澄塘

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。