We want to know how similar these texts are, purely in terms of word counts (and ignoring word order). We begin by making a list of the words from both texts:
me Julie loves Linda than more likes Jane
Now we count the number of times each of these words appears in each text:
me 2 2
Jane 0 1
Julie 1 1
Linda 1 0
likes 0 1
loves 2 1
more 1 1
than 1 1
We are not interested in the words themselves though. We are interested only in those two vertical vectors of counts. For instance, there are two instances of 'me' in each text. We are going to decide how close these two texts are to each other by calculating one function of those two vectors, namely the cosine of the angle between them.
The cosine of the angle between them is about 0.822.
These vectors are 8-dimensional. A virtue of using cosine similarity is clearly that it converts a question that is beyond human ability to visualise to one that can be. In this case you can think of this as the angle of about 35 degrees which is some 'distance' from zero or perfect agreement.
其中 cos(a) 对应对于第一个向量,x 值和 sin(a) 是 y 值,等等。唯一的问题是 x、y等并不完全是cos和sin值,因为这些值需要在设备上读取圆圈。这就是公式的分母发挥作用的地方:通过除以这些向量长度的乘积,x 和 y 坐标就会标准化。
I'm guessing you are more interested in getting some insight into "why" the cosine similarity works (why it provides a good indication of similarity), rather than "how" it is calculated (the specific operations used for the calculation). If your interest is in the latter, see the reference indicated by Daniel in this post, as well as a related SO Question.
To explain both the how and even more so the why, it is useful, at first, to simplify the problem and to work only in two dimensions. Once you get this in 2D, it is easier to think of it in three dimensions, and of course harder to imagine in many more dimensions, but by then we can use linear algebra to do the numeric calculations and also to help us think in terms of lines / vectors / "planes" / "spheres" in n dimensions, even though we can't draw these.
So, in two dimensions: with regards to text similarity this means that we would focus on two distinct terms, say the words "London" and "Paris", and we'd count how many times each of these words is found in each of the two documents we wish to compare. This gives us, for each document, a point in the the x-y plane. For example, if Doc1 had Paris once, and London four times, a point at (1,4) would present this document (with regards to this diminutive evaluation of documents). Or, speaking in terms of vectors, this Doc1 document would be an arrow going from the origin to point (1,4). With this image in mind, let's think about what it means for two documents to be similar and how this relates to the vectors.
VERY similar documents (again with regards to this limited set of dimensions) would have the very same number of references to Paris, AND the very same number of references to London, or maybe, they could have the same ratio of these references. A Document, Doc2, with 2 refs to Paris and 8 refs to London, would also be very similar, only with maybe a longer text or somehow more repetitive of the cities' names, but in the same proportion. Maybe both documents are guides about London, only making passing references to Paris (and how uncool that city is ;-) Just kidding!!!.
Now, less similar documents may also include references to both cities, but in different proportions. Maybe Doc2 would only cite Paris once and London seven times.
Back to our x-y plane, if we draw these hypothetical documents, we see that when they are VERY similar, their vectors overlap (though some vectors may be longer), and as they start to have less in common, these vectors start to diverge, to have a wider angle between them.
By measuring the angle between the vectors, we can get a good idea of their similarity, and to make things even easier, by taking the Cosine of this angle, we have a nice 0 to 1 or -1 to 1 value that is indicative of this similarity, depending on what and how we account for. The smaller the angle, the bigger (closer to 1) the cosine value, and also the higher the similarity.
At the extreme, if Doc1 only cites Paris and Doc2 only cites London, the documents have absolutely nothing in common. Doc1 would have its vector on the x-axis, Doc2 on the y-axis, the angle 90 degrees, Cosine 0. In this case we'd say that these documents are orthogonal to one another.
Adding dimensions: With this intuitive feel for similarity expressed as a small angle (or large cosine), we can now imagine things in 3 dimensions, say by bringing the word "Amsterdam" into the mix, and visualize quite well how a document with two references to each would have a vector going in a particular direction, and we can see how this direction would compare to a document citing Paris and London three times each, but not Amsterdam, etc. As said, we can try and imagine the this fancy space for 10 or 100 cities. It's hard to draw, but easy to conceptualize.
I'll wrap up just by saying a few words about the formula itself. As I've said, other references provide good information about the calculations.
First in two dimensions. The formula for the Cosine of the angle between two vectors is derived from the trigonometric difference (between angle a and angle b):
This formula looks very similar to the dot product formula:
Vect1 . Vect2 = (x1 * x2) + (y1 * y2)
where cos(a) corresponds to the x value and sin(a) the y value, for the first vector, etc. The only problem, is that x, y, etc. are not exactly the cos and sin values, for these values need to be read on the unit circle. That's where the denominator of the formula kicks in: by dividing by the product of the length of these vectors, the x and y coordinates become normalized.
import math
from collections import Counter
def build_vector(iterable1, iterable2):
counter1 = Counter(iterable1)
counter2 = Counter(iterable2)
all_items = set(counter1.keys()).union(set(counter2.keys()))
vector1 = [counter1[k] for k in all_items]
vector2 = [counter2[k] for k in all_items]
return vector1, vector2
def cosim(v1, v2):
dot_product = sum(n1 * n2 for n1, n2 in zip(v1, v2) )
magnitude1 = math.sqrt(sum(n ** 2 for n in v1))
magnitude2 = math.sqrt(sum(n ** 2 for n in v2))
return dot_product / (magnitude1 * magnitude2)
l1 = "Julie loves me more than Linda loves me".split()
l2 = "Jane likes me more than Julie loves me or".split()
v1, v2 = build_vector(l1, l2)
print(cosim(v1, v2))
This Python code is my quick and dirty attempt to implement the algorithm:
import math
from collections import Counter
def build_vector(iterable1, iterable2):
counter1 = Counter(iterable1)
counter2 = Counter(iterable2)
all_items = set(counter1.keys()).union(set(counter2.keys()))
vector1 = [counter1[k] for k in all_items]
vector2 = [counter2[k] for k in all_items]
return vector1, vector2
def cosim(v1, v2):
dot_product = sum(n1 * n2 for n1, n2 in zip(v1, v2) )
magnitude1 = math.sqrt(sum(n ** 2 for n in v1))
magnitude2 = math.sqrt(sum(n ** 2 for n in v2))
return dot_product / (magnitude1 * magnitude2)
l1 = "Julie loves me more than Linda loves me".split()
l2 = "Jane likes me more than Julie loves me or".split()
v1, v2 = build_vector(l1, l2)
print(cosim(v1, v2))
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;
/**
*
* @author Xiao Ma
* mail : [email protected]
*
*/
public class SimilarityUtil {
public static double consineTextSimilarity(String[] left, String[] right) {
Map<String, Integer> leftWordCountMap = new HashMap<String, Integer>();
Map<String, Integer> rightWordCountMap = new HashMap<String, Integer>();
Set<String> uniqueSet = new HashSet<String>();
Integer temp = null;
for (String leftWord : left) {
temp = leftWordCountMap.get(leftWord);
if (temp == null) {
leftWordCountMap.put(leftWord, 1);
uniqueSet.add(leftWord);
} else {
leftWordCountMap.put(leftWord, temp + 1);
}
}
for (String rightWord : right) {
temp = rightWordCountMap.get(rightWord);
if (temp == null) {
rightWordCountMap.put(rightWord, 1);
uniqueSet.add(rightWord);
} else {
rightWordCountMap.put(rightWord, temp + 1);
}
}
int[] leftVector = new int[uniqueSet.size()];
int[] rightVector = new int[uniqueSet.size()];
int index = 0;
Integer tempCount = 0;
for (String uniqueWord : uniqueSet) {
tempCount = leftWordCountMap.get(uniqueWord);
leftVector[index] = tempCount == null ? 0 : tempCount;
tempCount = rightWordCountMap.get(uniqueWord);
rightVector[index] = tempCount == null ? 0 : tempCount;
index++;
}
return consineVectorSimilarity(leftVector, rightVector);
}
/**
* The resulting similarity ranges from −1 meaning exactly opposite, to 1
* meaning exactly the same, with 0 usually indicating independence, and
* in-between values indicating intermediate similarity or dissimilarity.
*
* For text matching, the attribute vectors A and B are usually the term
* frequency vectors of the documents. The cosine similarity can be seen as
* a method of normalizing document length during comparison.
*
* In the case of information retrieval, the cosine similarity of two
* documents will range from 0 to 1, since the term frequencies (tf-idf
* weights) cannot be negative. The angle between two term frequency vectors
* cannot be greater than 90°.
*
* @param leftVector
* @param rightVector
* @return
*/
private static double consineVectorSimilarity(int[] leftVector,
int[] rightVector) {
if (leftVector.length != rightVector.length)
return 1;
double dotProduct = 0;
double leftNorm = 0;
double rightNorm = 0;
for (int i = 0; i < leftVector.length; i++) {
dotProduct += leftVector[i] * rightVector[i];
leftNorm += leftVector[i] * leftVector[i];
rightNorm += rightVector[i] * rightVector[i];
}
double result = dotProduct
/ (Math.sqrt(leftNorm) * Math.sqrt(rightNorm));
return result;
}
public static void main(String[] args) {
String left[] = { "Julie", "loves", "me", "more", "than", "Linda",
"loves", "me" };
String right[] = { "Jane", "likes", "me", "more", "than", "Julie",
"loves", "me" };
System.out.println(consineTextSimilarity(left,right));
}
}
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;
/**
*
* @author Xiao Ma
* mail : [email protected]
*
*/
public class SimilarityUtil {
public static double consineTextSimilarity(String[] left, String[] right) {
Map<String, Integer> leftWordCountMap = new HashMap<String, Integer>();
Map<String, Integer> rightWordCountMap = new HashMap<String, Integer>();
Set<String> uniqueSet = new HashSet<String>();
Integer temp = null;
for (String leftWord : left) {
temp = leftWordCountMap.get(leftWord);
if (temp == null) {
leftWordCountMap.put(leftWord, 1);
uniqueSet.add(leftWord);
} else {
leftWordCountMap.put(leftWord, temp + 1);
}
}
for (String rightWord : right) {
temp = rightWordCountMap.get(rightWord);
if (temp == null) {
rightWordCountMap.put(rightWord, 1);
uniqueSet.add(rightWord);
} else {
rightWordCountMap.put(rightWord, temp + 1);
}
}
int[] leftVector = new int[uniqueSet.size()];
int[] rightVector = new int[uniqueSet.size()];
int index = 0;
Integer tempCount = 0;
for (String uniqueWord : uniqueSet) {
tempCount = leftWordCountMap.get(uniqueWord);
leftVector[index] = tempCount == null ? 0 : tempCount;
tempCount = rightWordCountMap.get(uniqueWord);
rightVector[index] = tempCount == null ? 0 : tempCount;
index++;
}
return consineVectorSimilarity(leftVector, rightVector);
}
/**
* The resulting similarity ranges from −1 meaning exactly opposite, to 1
* meaning exactly the same, with 0 usually indicating independence, and
* in-between values indicating intermediate similarity or dissimilarity.
*
* For text matching, the attribute vectors A and B are usually the term
* frequency vectors of the documents. The cosine similarity can be seen as
* a method of normalizing document length during comparison.
*
* In the case of information retrieval, the cosine similarity of two
* documents will range from 0 to 1, since the term frequencies (tf-idf
* weights) cannot be negative. The angle between two term frequency vectors
* cannot be greater than 90°.
*
* @param leftVector
* @param rightVector
* @return
*/
private static double consineVectorSimilarity(int[] leftVector,
int[] rightVector) {
if (leftVector.length != rightVector.length)
return 1;
double dotProduct = 0;
double leftNorm = 0;
double rightNorm = 0;
for (int i = 0; i < leftVector.length; i++) {
dotProduct += leftVector[i] * rightVector[i];
leftNorm += leftVector[i] * leftVector[i];
rightNorm += rightVector[i] * rightVector[i];
}
double result = dotProduct
/ (Math.sqrt(leftNorm) * Math.sqrt(rightNorm));
return result;
}
public static void main(String[] args) {
String left[] = { "Julie", "loves", "me", "more", "than", "Linda",
"loves", "me" };
String right[] = { "Jane", "likes", "me", "more", "than", "Julie",
"loves", "me" };
System.out.println(consineTextSimilarity(left,right));
}
}
Here, we get a tfidf matrix with tfidf values of 2 x 3, or 2 documents/text x 3 terms. This is our tfidf document-term matrix. Let's see what are the 3 terms by calling vectorizer.vocabulary_
print(vectorizer.vocabulary_)
# output
{'am': 0, 'boy': 1, 'girl': 2}
This tells us that our 3 terms in our tfidf matrix are 'am', 'boy' and 'girl'. 'am' is at column 0, 'boy' is at column 1, and 'girl' is at column 2. The terms 'I' and 'a' has been removed by the vectorizer because they are stopwords.
Now we have our tfidf matrix, we want to compare our query text with our texts and see how close our query is to our texts. To do that, we can compute the cosine similarity scores of the query vs the tfidf matrix of the texts. But first, we need to compute the tfidf of our query:
query = ["I am a boy scout"]
query_tfidf = vectorizer.transform([query])
print(query_tfidf.toarray())
#output
array([[0.57973867, 0.81480247, 0. ]])
Here, we computed the tfidf of our query. Our query_tfidf has a vector of tfidf values [0.57973867, 0.81480247, 0. ], which we will use to compute our cosine similarity multiplication scores. If I am not mistaken, the query_tfidf values or vectorizer.transform([query]) values are derived by just selecting the row or document from tfidf_matrix that has the most word matching with the query. For example, row 1 or document/text 1 of the tfidf_matrix has the most word matching with the query text which contains "am" (0.57973867) and "boy" (0.81480247), hence row 1 of the tfidf_matrix of [0.57973867, 0.81480247, 0. ] values are selected to be the values for query_tfidf. (Note: If someone could help further explain this that would be good)
After computing our query_tfidf, we can now matrix multiply or dot product our query_tfidf vector with our text tfidf_matrix to obtain the cosine similarity scores.
Recall that cosine similarity score or formula is equal to the following:
cosine similarity score = (A . B) / ||A|| ||B||
Here, A = our query_tfidf vector, and B = each row of our tfidf_matrix
Note that: A . B = A * B^T, or A dot product B = A multiply by B Transpose.
Knowing the formula, let's manually compute our cosine similarity scores for query_tfidf, then compare our answer with the values provided by the sklearn.metrics cosine_similarity function. Let's manually compute:
Our manually computed cosine similarity scores give values of [1.0, 0.33609692727625745]. Let's check our manually computed cosine similarity score with the answer value provided by the sklearn.metrics cosine_similarity function:
from sklearn.metrics.pairwise import cosine_similarity
function_cosine_similarities = cosine_similarity(query_tfidf, tfidf_matrix)
print(function_cosine_similarities)
#output
array([[1.0 , 0.33609693]])
The output values are both the same! The manually computed cosine similarity values are the the same as the function computed cosine similarity values!
Hence, this simple explanation shows how the cosine similarity values are computed. Hope you found this explanation helpful.
如果角度更大(最大可以达到180度)则Cos 180=-1,最小角度为0度。 cos 0 =1 意味着向量彼此对齐,因此向量相似。
cos 90=0 (这足以得出向量 A 和 B 根本不相似的结论,并且由于距离不能为负,所以余弦值将在 0 到 1 之间。因此,更多的角度意味着减少相似性(也可视化它)有道理)
Two Vectors A and B exists in a 2D space or 3D space, the angle between those vectors is cos similarity.
If the angle is more (can reach max 180 degree) which is Cos 180=-1 and the minimum angle is 0 degree. cos 0 =1 implies the vectors are aligned to each other and hence the vectors are similar.
cos 90=0 (which is sufficient to conclude that the vectors A and B are not similar at all and since distance cant be negative, the cosine values will lie from 0 to 1. Hence, more angle implies implies reducing similarity (visualising also it makes sense)
import math
def dot_prod(v1, v2):
ret = 0
for i in range(len(v1)):
ret += v1[i] * v2[i]
return ret
def magnitude(v):
ret = 0
for i in v:
ret += i**2
return math.sqrt(ret)
def cos_sim(v1, v2):
return (dot_prod(v1, v2)) / (magnitude(v1) * magnitude(v2))
Here's a simple Python code to calculate cosine similarity:
import math
def dot_prod(v1, v2):
ret = 0
for i in range(len(v1)):
ret += v1[i] * v2[i]
return ret
def magnitude(v):
ret = 0
for i in v:
ret += i**2
return math.sqrt(ret)
def cos_sim(v1, v2):
return (dot_prod(v1, v2)) / (magnitude(v1) * magnitude(v2))
发布评论
评论(12)
这里有两个非常短的文本可供比较:
朱莉爱我胜过琳达爱我
< code>Jane 喜欢我多于 Julie 爱我
我们想知道这些文本的相似度,纯粹从字数来看(并忽略词序)。我们首先列出两个文本中的单词:
现在我们计算每个单词在每个文本中出现的次数:
但我们对单词本身不感兴趣。我们只感兴趣
这两个垂直的计数向量。例如,有两个实例
每个文本中的“我”。我们将决定这两个文本的接近程度
另一种方法是计算这两个向量的一个函数,即余弦
它们之间的角度。
这两个向量又是:
它们之间角度的余弦约为 0.822。
这些向量是 8 维的。使用余弦相似度的优点很明显
它将一个超出人类想象能力的问题转化为一个问题
可以的。在这种情况下,您可以将其视为大约 35 度的角度
度数与零或完全一致有一定的“距离”。
Here are two very short texts to compare:
Julie loves me more than Linda loves me
Jane likes me more than Julie loves me
We want to know how similar these texts are, purely in terms of word counts (and ignoring word order). We begin by making a list of the words from both texts:
Now we count the number of times each of these words appears in each text:
We are not interested in the words themselves though. We are interested only in
those two vertical vectors of counts. For instance, there are two instances of
'me' in each text. We are going to decide how close these two texts are to each
other by calculating one function of those two vectors, namely the cosine of
the angle between them.
The two vectors are, again:
The cosine of the angle between them is about 0.822.
These vectors are 8-dimensional. A virtue of using cosine similarity is clearly
that it converts a question that is beyond human ability to visualise to one
that can be. In this case you can think of this as the angle of about 35
degrees which is some 'distance' from zero or perfect agreement.
我猜您更感兴趣的是深入了解“为什么”余弦相似性有效(为什么它可以很好地指示相似性),而不是“如何”它是计算出来的(用于计算的具体操作)。如果您对后者感兴趣,请参阅 Daniel 在这篇文章中指出的参考文献,以及 相关 SO问题。
为了解释如何以及为什么,首先简化问题并仅在二维中工作是有用的。一旦你在二维中得到它,就更容易在三个维度上思考它,当然在更多维度上想象它会更困难,但到那时我们可以使用线性代数进行数值计算,并帮助我们用术语思考n 维中的线/向量/“平面”/“球体”,即使我们无法绘制它们。
因此,在二维中:关于文本相似性,这意味着我们将关注两个不同的术语,比如“伦敦”和“巴黎”,并且我们会计算每个术语出现的次数这些词可以在我们想要比较的两个文档中找到。对于每个文档,这为我们提供了 xy 平面上的一个点。例如,如果 Doc1 有一次巴黎,四次伦敦,则 (1,4) 处的点将呈现此文档(关于文档的这种小型评估)。或者,用向量来说,这个 Doc1 文档将是一个从原点到点 (1,4) 的箭头。考虑到此图像,让我们考虑一下两个文档相似意味着什么以及这与向量有何关系。
非常相似的文件(同样是关于这组有限的维度)将具有相同数量的巴黎参考文献,以及相同数量的伦敦参考文献,或者它们可能具有相同的参考文献比例。一份文档 Doc2,包含 2 个巴黎参考文献和 8 个伦敦参考文献,也非常相似,只是文本可能更长,或者以某种方式更多地重复城市名称,但比例相同。也许这两份文件都是关于伦敦的指南,只是顺便提到了巴黎(以及那个城市是多么不酷;-)开玩笑!!!。
现在,不太相似的文件也可能包含对这两个城市的引用,但比例不同。也许 Doc2 只会引用一次巴黎,七次引用伦敦。
回到我们的 xy 平面,如果我们绘制这些假设的文档,我们会发现当它们非常相似时,它们的向量会重叠(尽管有些向量可能更长),并且随着它们的共同点开始减少,这些向量开始发散,它们之间的角度更大。
通过测量向量之间的角度,我们可以很好地了解它们的相似性,并且为了让事情变得更简单,通过取这个角度的余弦,我们有一个很好的 0 到 1 或 -1 到 1 值表示这种相似性,具体取决于我们考虑的内容和方式。角度越小,余弦值越大(越接近1),相似度也越高。
在极端情况下,如果 Doc1 只引用巴黎,而 Doc2 只引用伦敦,那么这些文件绝对没有任何共同点。 Doc1 的向量位于 x 轴,Doc2 的向量位于 y 轴,角度为 90 度,余弦为 0。在这种情况下,我们会说这些文档彼此正交。
添加维度:
有了这种以小角度(或大余弦)表示的相似性的直观感觉,我们现在可以在 3 维中想象事物,例如将“阿姆斯特丹”一词混合在一起,并很好地可视化具有两个引用的文档如何将有一个向量朝特定方向移动,我们可以看到这个方向与分别引用巴黎和伦敦三次但不是阿姆斯特丹等的文档相比如何。如上所述,我们可以尝试想象这个奇特的空间为 10或 100 个城市。画起来很难,但概念化很容易。
我将通过关于公式本身的几句话来结束。正如我所说,其他参考资料提供了有关计算的良好信息。
首先是二维。两个向量之间角度的余弦公式是从三角差(角度 a 和角度 b 之间)导出的:
该公式看起来与点积公式非常相似:
其中
cos(a)
对应对于第一个向量,x
值和sin(a)
是y
值,等等。唯一的问题是x
、y
等并不完全是cos
和sin
值,因为这些值需要在设备上读取圆圈。这就是公式的分母发挥作用的地方:通过除以这些向量长度的乘积,x
和y
坐标就会标准化。I'm guessing you are more interested in getting some insight into "why" the cosine similarity works (why it provides a good indication of similarity), rather than "how" it is calculated (the specific operations used for the calculation). If your interest is in the latter, see the reference indicated by Daniel in this post, as well as a related SO Question.
To explain both the how and even more so the why, it is useful, at first, to simplify the problem and to work only in two dimensions. Once you get this in 2D, it is easier to think of it in three dimensions, and of course harder to imagine in many more dimensions, but by then we can use linear algebra to do the numeric calculations and also to help us think in terms of lines / vectors / "planes" / "spheres" in n dimensions, even though we can't draw these.
So, in two dimensions: with regards to text similarity this means that we would focus on two distinct terms, say the words "London" and "Paris", and we'd count how many times each of these words is found in each of the two documents we wish to compare. This gives us, for each document, a point in the the x-y plane. For example, if Doc1 had Paris once, and London four times, a point at (1,4) would present this document (with regards to this diminutive evaluation of documents). Or, speaking in terms of vectors, this Doc1 document would be an arrow going from the origin to point (1,4). With this image in mind, let's think about what it means for two documents to be similar and how this relates to the vectors.
VERY similar documents (again with regards to this limited set of dimensions) would have the very same number of references to Paris, AND the very same number of references to London, or maybe, they could have the same ratio of these references. A Document, Doc2, with 2 refs to Paris and 8 refs to London, would also be very similar, only with maybe a longer text or somehow more repetitive of the cities' names, but in the same proportion. Maybe both documents are guides about London, only making passing references to Paris (and how uncool that city is ;-) Just kidding!!!.
Now, less similar documents may also include references to both cities, but in different proportions. Maybe Doc2 would only cite Paris once and London seven times.
Back to our x-y plane, if we draw these hypothetical documents, we see that when they are VERY similar, their vectors overlap (though some vectors may be longer), and as they start to have less in common, these vectors start to diverge, to have a wider angle between them.
By measuring the angle between the vectors, we can get a good idea of their similarity, and to make things even easier, by taking the Cosine of this angle, we have a nice 0 to 1 or -1 to 1 value that is indicative of this similarity, depending on what and how we account for. The smaller the angle, the bigger (closer to 1) the cosine value, and also the higher the similarity.
At the extreme, if Doc1 only cites Paris and Doc2 only cites London, the documents have absolutely nothing in common. Doc1 would have its vector on the x-axis, Doc2 on the y-axis, the angle 90 degrees, Cosine 0. In this case we'd say that these documents are orthogonal to one another.
Adding dimensions:
With this intuitive feel for similarity expressed as a small angle (or large cosine), we can now imagine things in 3 dimensions, say by bringing the word "Amsterdam" into the mix, and visualize quite well how a document with two references to each would have a vector going in a particular direction, and we can see how this direction would compare to a document citing Paris and London three times each, but not Amsterdam, etc. As said, we can try and imagine the this fancy space for 10 or 100 cities. It's hard to draw, but easy to conceptualize.
I'll wrap up just by saying a few words about the formula itself. As I've said, other references provide good information about the calculations.
First in two dimensions. The formula for the Cosine of the angle between two vectors is derived from the trigonometric difference (between angle a and angle b):
This formula looks very similar to the dot product formula:
where
cos(a)
corresponds to thex
value andsin(a)
they
value, for the first vector, etc. The only problem, is thatx
,y
, etc. are not exactly thecos
andsin
values, for these values need to be read on the unit circle. That's where the denominator of the formula kicks in: by dividing by the product of the length of these vectors, thex
andy
coordinates become normalized.这是我在 C# 中的实现。
Here's my implementation in C#.
为了简单起见,我减少了向量 a 和 b:
然后是余弦相似度 (Theta):
然后 cos 0.5 的倒数是 60 度。
For simplicity I am reducing the vector a and b:
Then cosine similarity (Theta):
then inverse of cos 0.5 is 60 degrees.
这段 Python 代码是我实现该算法的快速而肮脏的尝试:
This Python code is my quick and dirty attempt to implement the algorithm:
使用 @Bill Bell 示例,在 [R] 中执行此操作的两种方法
或利用 crossprod() 方法的性能...
Using @Bill Bell example, two ways to do this in [R]
or taking advantage of crossprod() method's performance...
这是一个简单的
Python
代码,它实现了余弦相似度。This is a simple
Python
code which implements cosine similarity.简单的JAVA代码计算余弦相似度
Simple JAVA code to calculate cosine similarity
让我尝试用 Python 代码和一些图形数学公式来解释这一点。
假设我们的代码中有两个非常短的文本:
我们想要使用快速余弦相似度分数来比较以下查询文本,看看该查询与上面的文本有多接近:
我们应该如何计算余弦相似度分数?首先,让我们在 Python 中为这些文本构建一个 tfidf 矩阵:
接下来,让我们检查 tfidf 矩阵的值及其词汇表:
这里,我们得到一个 tfidf 矩阵,其 tfidf 值为 2 x 3,或 2 个文档/文本 x 3 个术语。这是我们的 tfidf 文档术语矩阵。让我们通过调用
vectorizer.vocabulary_
来看看这 3 个术语是什么。这告诉我们 tfidf 矩阵中的 3 个术语是“am”、“boy”和“girl”。 “am”位于第 0 列,“boy”位于第 1 列,“girl”位于第 2 列。矢量化器已删除术语“I”和“a”,因为它们是停用词。
现在我们有了 tfidf 矩阵,我们想要将查询文本与文本进行比较,看看我们的查询与文本有多接近。为此,我们可以计算查询与文本 tfidf 矩阵的余弦相似度得分。但首先,我们需要计算查询的 tfidf:
在这里,我们计算查询的 tfidf。我们的 query_tfidf 有一个 tfidf 值向量
[0.57973867, 0.81480247, 0.]
,我们将用它来计算余弦相似度乘法分数。如果我没有记错的话,query_tfidf 值或vectorizer.transform([query])
值是通过从 tfidf_matrix 中选择与查询匹配最多单词的行或文档来导出的。例如,tfidf_matrix 的第 1 行或文档/文本 1 与查询文本匹配的单词最多,其中包含“am”(0.57973867)和“boy”(0.81480247),因此[0.57973867 的 tfidf_matrix 的第 1 行, 0.81480247, 0.]
值被选择作为 query_tfidf 的值。 (注意:如果有人能帮助进一步解释这一点那就太好了)计算完 query_tfidf 后,我们现在可以将 query_tfidf 向量与文本 tfidf_matrix 进行矩阵乘法或点积以获得余弦相似度分数。
回想一下,余弦相似度得分或公式等于以下内容:
这里, A = 我们的 query_tfidf 向量,B = 我们的 tfidf_matrix 的每一行
请注意: A 。 B = A * B^T,或 A 点积 B = A 乘以 B 转置。
知道了公式,让我们手动计算 query_tfidf 的余弦相似度分数,然后将我们的答案与 sklearn.metrics cosine_similarity 函数提供的值进行比较。让我们手动计算:
我们手动计算的余弦相似度得分给出的值为
[1.0, 0.33609692727625745]
。让我们用 sklearn.metrics cosine_similarity 函数提供的答案值来检查手动计算的余弦相似度分数:输出值是相同的!手动计算的余弦相似度值与函数计算的余弦相似度值相同!
因此,这个简单的解释显示了如何计算余弦相似度值。希望您觉得这个解释有帮助。
Let me try explaining this in terms of Python code and some graphical Mathematics formulas.
Suppose we have two very short texts in our code:
And we want to compare the following query text to see how close the query is to the texts above, using fast cosine similarity scores:
How should we compute the cosine similarity scores? First, let's build a tfidf matrix in Python for these texts:
Next, let's check the values of our tfidf matrix and its vocabulary:
Here, we get a tfidf matrix with tfidf values of 2 x 3, or 2 documents/text x 3 terms. This is our tfidf document-term matrix. Let's see what are the 3 terms by calling
vectorizer.vocabulary_
This tells us that our 3 terms in our tfidf matrix are 'am', 'boy' and 'girl'. 'am' is at column 0, 'boy' is at column 1, and 'girl' is at column 2. The terms 'I' and 'a' has been removed by the vectorizer because they are stopwords.
Now we have our tfidf matrix, we want to compare our query text with our texts and see how close our query is to our texts. To do that, we can compute the cosine similarity scores of the query vs the tfidf matrix of the texts. But first, we need to compute the tfidf of our query:
Here, we computed the tfidf of our query. Our query_tfidf has a vector of tfidf values
[0.57973867, 0.81480247, 0. ]
, which we will use to compute our cosine similarity multiplication scores. If I am not mistaken, the query_tfidf values orvectorizer.transform([query])
values are derived by just selecting the row or document from tfidf_matrix that has the most word matching with the query. For example, row 1 or document/text 1 of the tfidf_matrix has the most word matching with the query text which contains "am" (0.57973867) and "boy" (0.81480247), hence row 1 of the tfidf_matrix of[0.57973867, 0.81480247, 0. ]
values are selected to be the values for query_tfidf. (Note: If someone could help further explain this that would be good)After computing our query_tfidf, we can now matrix multiply or dot product our query_tfidf vector with our text tfidf_matrix to obtain the cosine similarity scores.
Recall that cosine similarity score or formula is equal to the following:
Here, A = our query_tfidf vector, and B = each row of our tfidf_matrix
Note that: A . B = A * B^T, or A dot product B = A multiply by B Transpose.
Knowing the formula, let's manually compute our cosine similarity scores for query_tfidf, then compare our answer with the values provided by the sklearn.metrics cosine_similarity function. Let's manually compute:
Our manually computed cosine similarity scores give values of
[1.0, 0.33609692727625745]
. Let's check our manually computed cosine similarity score with the answer value provided by the sklearn.metrics cosine_similarity function:The output values are both the same! The manually computed cosine similarity values are the the same as the function computed cosine similarity values!
Hence, this simple explanation shows how the cosine similarity values are computed. Hope you found this explanation helpful.
两个向量A和B存在于2D空间或3D空间中,这些向量之间的角度是cos相似度。
如果角度更大(最大可以达到180度)则Cos 180=-1,最小角度为0度。 cos 0 =1 意味着向量彼此对齐,因此向量相似。
cos 90=0 (这足以得出向量 A 和 B 根本不相似的结论,并且由于距离不能为负,所以余弦值将在 0 到 1 之间。因此,更多的角度意味着减少相似性(也可视化它)有道理)
Two Vectors A and B exists in a 2D space or 3D space, the angle between those vectors is cos similarity.
If the angle is more (can reach max 180 degree) which is Cos 180=-1 and the minimum angle is 0 degree. cos 0 =1 implies the vectors are aligned to each other and hence the vectors are similar.
cos 90=0 (which is sufficient to conclude that the vectors A and B are not similar at all and since distance cant be negative, the cosine values will lie from 0 to 1. Hence, more angle implies implies reducing similarity (visualising also it makes sense)
下面是计算余弦相似度的简单 Python 代码:
Here's a simple Python code to calculate cosine similarity: