谷本分数及其使用时间
I read a wiki article which describes about Jaccard index and explains the Tanimoto score as extended Jaccard index, but what exactly it tries to do?
How is it different from other similarity scores?
When is it used?
Thank you
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我也刚刚看了维基百科的文章,所以我只能为你解释一下内容。
Jaccards 分数用于采用离散值的向量,最常见的是二进制值(1 或 0)。 Tanimoto 分数用于可以取连续值的向量。它的设计使得如果向量只取 1 和 0 的值,它的工作原理与 Jaccards 相同。
我想当你有一个包含一些连续值部分和一些二进制值部分的“混合”向量时,你会是谷本的。
I just read the wikipedia article too, so I can only interpret the content for you.
Jaccards score is used for vectors that take discrete values, most often for binary values (1 or 0). Tanimoto score is used for vectors that can take on continuous values. It is designed so that if the vector only takes values of 1 and 0, it works the same as Jaccards.
I would imagine you would Tanimoto's when you have a 'mixed' vector that has some continuous valued parts and some binary valued parts.
Tanimoto 分数假设每个数据对象都是
属性向量。在这种情况下,属性可能是也可能不是二进制的。如果它们都是二进制的,则 Tanimoto 方法简化为 Jaccard 方法。
式中,A和B是用向量表示的数据对象。相似度得分
是 A 和 B 的点积除以 A 和 B 的平方值减去
点积。
有多种可用的相似度分数,但让我们与最常用的进行比较。
Tanimoto 系数是通过查看两个数据对象共有的属性数量(数据字符串的交集)与任一属性的数量(数据对象的并集)。
Dice 系数是两个数据对象共有的属性数量相对于存在的属性总数的平均大小,即
( A intersect B ) / 0.5 ( A + B )
查找两个数据对象之间的余弦相似度要求两个对象都在向量中表示其属性。然后将相似度测量为两个向量之间的角度。
您还可以参考两个对象何时可以具有相同的 Tanimoto 和 Cosine 分数。
皮尔逊系数是一种寻找相似性的复杂且精密的方法。该方法在两个数据对象的属性之间生成“最佳拟合”线。皮尔逊系数使用以下方程计算:
其中,
cov(A,B) -->协方差
σ A --> A σ B 标准差
--> B 的标准差
该系数是通过将协方差除以两个数据对象属性的标准差的乘积得出的。它对于未标准化的数据更加稳健。例如,如果一个人对电影“a”、“b”和“c”的评分分别为 1、2 和 3,那么他将与对同一部电影评分为 4、5、 6.
有关 Tanimoto 分数与其他相似性分数/系数的更多信息,您可以参考:
为什么 Tanimoto 索引是基于指纹的相似度计算的合适选择?
Tanimoto 评分可用于以下两种情况:
时 以下应用广泛使用 Tanimoto 评分:
Tanimoto score assumes that each data object is a
vector of attributes. The attributes may or may not be binary in this case. If they all are binary, the Tanimoto method reduces to the Jaccard method.
In the equation, A and B are data objects represented by vectors. The similarity score
is the dot product of A and B divided by the squared magnitudes of A and B minus the
dot product.
There are various similarity scores available but let's compare with the most frequently used.
The Tanimoto coefficent is determined by looking at the number of attributes that are common to both data objects (the intersection of the data strings) compared to the number of attributes that are in either (the union of the data objects).
The Dice coefficient is the number of attributes in common to both data objects relative to the average size of the total number of attributes present, i.e.
( A intersect B ) / 0.5 ( A + B )
Finding the cosine similarity between two data objects requires that both objects represent their attributes in a vector. Similarity is then measured as the angle between the two vectors.
You can also refer When can two objects have identical Tanimoto and Cosine score.
The Pearson Coefficient is a complex and sophisticated approach to finding similarity. The method generates a "best fit" line between attributes in two data objects. The Pearson Coefficient is found using the following equation:
where,
cov(A,B) --> Covariance
σ A --> Standard deviation of A
σ B --> Standard deviation of B
The coefficient is found from dividing the covariance by the product of the standard deviations of the attributes of two data objects. It is more robust against data that isn't normalized. For example, if one person ranked movies "a", "b", and "c" with scores of 1, 2, and 3 respectively, he would have a perfect correlation to someone who ranked the same movies with a 4, 5, and 6.
For more information on Tanimoto score v/s other similarity scores/coefficients you can refer:
Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations?
Tanimoto score can be used in both the situations:
Following applications extensively use Tanimoto score: