当前位置：文江博客话题详情

谷本分数及其使用时间

发布于 2024-10-09 19:34:05 字数 200 浏览 7 评论 0原文

我读了一篇 wiki 文章，其中描述了 Jaccard 指数并将 Tanimoto 分数解释为扩展 Jaccard 指数，但是它到底想做什么？

它与其他相似度分数有何不同？

什么时候使用？

谢谢

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

伴我老 2024-10-16 19:34:05

我也刚刚看了维基百科的文章，所以我只能为你解释一下内容。

Jaccards 分数用于采用离散值的向量，最常见的是二进制值（1 或 0）。 Tanimoto 分数用于可以取连续值的向量。它的设计使得如果向量只取 1 和 0 的值，它的工作原理与 Jaccards 相同。

我想当你有一个包含一些连续值部分和一些二进制值部分的“混合”向量时，你会是谷本的。

回复收藏 0 原文

西瑶 2024-10-16 19:34:05

它到底想做什么？

Tanimoto 分数假设每个数据对象都是
属性向量。在这种情况下，属性可能是也可能不是二进制的。如果它们都是二进制的，则 Tanimoto 方法简化为 Jaccard 方法。

T(A,B)= A.B/(||A||2 + ||B||2 - A.B)

式中，A和B是用向量表示的数据对象。相似度得分
是 A 和 B 的点积除以 A 和 B 的平方值减去
点积。

它与其他相似度分数有何不同？

Tanimoto v/s Jaccard：如果属性是二元，则Tanimoto 被简化为Jaccard 指数。

有多种可用的相似度分数，但让我们与最常用的进行比较。

Tanimoto v/s Dice：

Tanimoto 系数是通过查看两个数据对象共有的属性数量（数据字符串的交集）与任一属性的数量（数据对象的并集）。

Dice 系数是两个数据对象共有的属性数量相对于存在的属性总数的平均大小，即
( A intersect B ) / 0.5 ( A + B )

D(A,B) = A.B/(0.5(||A||2 + ||B||2))

Tanimoto v/s Cosine

查找两个数据对象之间的余弦相似度要求两个对象都在向量中表示其属性。然后将相似度测量为两个向量之间的角度。

Cos(θ) = A.B/(||A||.||B||)

您还可以参考两个对象何时可以具有相同的 Tanimoto 和 Cosine 分数。

Tanimoto v /s Pearson：

皮尔逊系数是一种寻找相似性的复杂且精密的方法。该方法在两个数据对象的属性之间生成“最佳拟合”线。皮尔逊系数使用以下方程计算：

p(A,B) = cov(A,B)/σAσB

其中，
cov(A,B) -->协方差

σ A --> A σ B 标准差

--> B 的标准差

该系数是通过将协方差除以两个数据对象属性的标准差的乘积得出的。它对于未标准化的数据更加稳健。例如，如果一个人对电影“a”、“b”和“c”的评分分别为 1、2 和 3，那么他将与对同一部电影评分为 4、5、 6.

有关 Tanimoto 分数与其他相似性分数/系数的更多信息，您可以参考：
为什么 Tanimoto 索引是基于指纹的相似度计算的合适选择？

什么时候使用？

Tanimoto 评分可用于以下两种情况：

当属性为二元时
当属性为非二元

时以下应用广泛使用 Tanimoto 评分：

化学信息学
聚类抄袭
检测
自动同义词库提取
可视化高维数据
集分析市场篮交易数据
检测空间中的异常- 时间数据

what exactly it tries to do?

Tanimoto score assumes that each data object is a
vector of attributes. The attributes may or may not be binary in this case. If they all are binary, the Tanimoto method reduces to the Jaccard method.

T(A,B)= A.B/(||A||2 + ||B||2 - A.B)

In the equation, A and B are data objects represented by vectors. The similarity score
is the dot product of A and B divided by the squared magnitudes of A and B minus the
dot product.

How is it different from other similarity scores?

Tanimoto v/s Jaccard: If the attributes are binary, Tanimoto is reduced to Jaccard Index.

There are various similarity scores available but let's compare with the most frequently used.

Tanimoto v/s Dice:

The Tanimoto coefficent is determined by looking at the number of attributes that are common to both data objects (the intersection of the data strings) compared to the number of attributes that are in either (the union of the data objects).

The Dice coefficient is the number of attributes in common to both data objects relative to the average size of the total number of attributes present, i.e.
( A intersect B ) / 0.5 ( A + B )

D(A,B) = A.B/(0.5(||A||2 + ||B||2))

Tanimoto v/s Cosine

Finding the cosine similarity between two data objects requires that both objects represent their attributes in a vector. Similarity is then measured as the angle between the two vectors.

Cos(θ) = A.B/(||A||.||B||)

You can also refer When can two objects have identical Tanimoto and Cosine score.

Tanimoto v/s Pearson:

The Pearson Coefficient is a complex and sophisticated approach to finding similarity. The method generates a "best fit" line between attributes in two data objects. The Pearson Coefficient is found using the following equation:

p(A,B) = cov(A,B)/σAσB

where,
cov(A,B) --> Covariance

σ A --> Standard deviation of A

σ B --> Standard deviation of B

The coefficient is found from dividing the covariance by the product of the standard deviations of the attributes of two data objects. It is more robust against data that isn't normalized. For example, if one person ranked movies "a", "b", and "c" with scores of 1, 2, and 3 respectively, he would have a perfect correlation to someone who ranked the same movies with a 4, 5, and 6.

For more information on Tanimoto score v/s other similarity scores/coefficients you can refer:
Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations?