如何在相似性度量和差异性(距离)度量之间进行转换?
是否有一种通用方法可以在相似性度量和距离度量之间进行转换?
考虑相似性度量,例如两个字符串共有的 2 元语法的数量。
2-grams('beta', 'delta') = 1
2-grams('apple', 'dappled') = 4
如果我需要将其提供给需要差异度量(例如编辑距离)的优化算法,该怎么办?
这只是一个例子......我正在寻找一种通用的解决方案(如果存在)。比如如何从编辑距离到相似度度量?
我感谢您提供的任何指导。
Is there a general way to convert between a measure of similarity and a measure of distance?
Consider a similarity measure like the number of 2-grams that two strings have in common.
2-grams('beta', 'delta') = 1
2-grams('apple', 'dappled') = 4
What if I need to feed this to an optimization algorithm that expects a measure of difference, like Levenshtein distance?
This is just an example...I'm looking for a general solution, if one exists. Like how to go from Levenshtein distance to a measure of similarity?
I appreciate any guidance you may offer.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
令d表示距离,s表示相似度。要将距离度量转换为相似性度量,我们需要首先使用 d_norm = d/max(<强>d)。然后相似性度量由下式给出:
s = 1 - d_norm。
其中s在[0 1]范围内,1表示相似度最高(比较的项目相同),0表示相似度最低(距离最大)。
Let d denotes distance, s denotes similarity. To convert distance measure to similarity measure, we need to first normalize d to [0 1], by using d_norm = d/max(d). Then the similarity measure is given by:
s = 1 - d_norm.
where s is in the range [0 1], with 1 denotes highest similarity (the items in comparison are identical), and 0 denotes lowest similarity (largest distance).
如果您的相似性度量在 0 到 1 之间,您可以使用以下之一:
If your similarity measure (s) is between 0 and 1, you can use one of these:
执行 1/相似度不会保留分布的属性。
最好的办法是
距离(a->b)=最高相似度-相似度(a->b)。
相似度最高的是具有最大值的相似度。因此,你翻转了你的分布。
最高相似度变为0等
Doing 1/similarity is not going to keep the properties of the distribution.
the best way is
distance (a->b) = highest similarity - similarity (a->b).
with highest similarity being the similarity with the biggest value. You hence flip your distribution.
the highest similarity becomes 0 etc
是的,有一种最通用方法可以在相似度和距离之间进行更改:严格单调递减函数
f(x)
。也就是说,使用
f(x)
,您可以生成similarity = f(distance)
或distance = f(similarity)
。它在两个方向上都起作用。这样的函数之所以有效,是因为相似度和距离之间的关系是一个减小,另一个增大。示例:
这些是一些众所周知的严格单调递减候选,适用于非负相似度或距离:
f(x) = 1 / (a + x )
f(x) = exp(- x^a
)f(x) = arccot(ax)
可以选择参数
a>0
(例如,a=1
)一个非常实用的方法是使用属于的函数 sim2diss 统计软件R。该函数提供了多达 13 种方法来计算相似性和差异性。遗憾的是,这些方法根本没有解释:您必须查看代码:-\
Yes, there is a most general way to change between similarity and distance: a strictly monotone decreasing function
f(x)
.That is, with
f(x)
you can makesimilarity = f(distance)
ordistance = f(similarity)
. It works in both directions. Such function works, because the relation between similarity and distance is that one decreases when the other increases.Examples:
These are some well-known strictly monotone decreasing candidates that work for non-negative similarities or distances:
f(x) = 1 / (a + x)
f(x) = exp(- x^a
)f(x) = arccot(ax)
You can choose parameter
a>0
(e.g.,a=1
)A very practical approach is to use the function sim2diss belonging to the statistical software R. This functions provides a up to 13 methods to compute dissimilarity from similarities. Sadly the methods are not at all explained: you have to look into the code :-\
并注意
差异= 0
and watch out for
difference = 0
根据scikit learn:
内核是相似性的度量,即 s( a、b)> s(a, c) 如果对象 a 和 b 被认为比对象 a 和 c “更相似”。核也必须是半正定的。
有多种方法可以在距离度量和相似性度量之间进行转换,例如内核。令 D 为距离,S 为内核:
num_features
According to scikit learn:
Kernels are measures of similarity, i.e. s(a, b) > s(a, c) if objects a and b are considered “more similar” than objects a and c. A kernel must also be positive semi-definite.
There are a number of ways to convert between a distance metric and a similarity measure, such as a kernel. Let D be the distance, and S be the kernel:
num_features
对于 Levenshtein 距离,每次序列匹配时,您都可以将 sim 分数增加 1;也就是说,每次不需要删除、插入或替换时都为 1。这样,度量将是两个字符串有多少共同字符的线性度量。
In the case of Levenshtein distance, you could increase the sim score by 1 for every time the sequences match; that is, 1 for every time you didn't need a deletion, insertion or substitution. That way the metric would be a linear measure of how many characters the two strings have in common.
在我的一个项目(基于协作过滤)中,我必须在相关性(向量之间的余弦)之间进行转换,从 -1 到 1(越接近 1 越相似,越接近 -1 越多样化)到归一化距离(接近0 表示距离较小,如果接近 1 则距离较大)
在这种情况下:距离 ~ 多样性
我的公式是:
dist = 1 - (cor + 1)/2
如果您与多样性和域是 [0,1] 在这两种情况下最简单的方法是:
dist = 1 - sim
sim = 1 - dist
In one of my projects (based on Collaborative Filtering) I had to convert between correlation (cosine between vectors) which was from -1 to 1 (closer 1 is more similar, closer to -1 is more diverse) to normalized distance (close to 0 the distance is smaller and if it's close to 1 the distance is bigger)
In this case: distance ~ diversity
My formula was:
dist = 1 - (cor + 1)/2
If you have similarity to diversity and the domain is [0,1] in both cases the simlest way is:
dist = 1 - sim
sim = 1 - dist
余弦相似度是广泛用于 n-gram 计数或 TFIDF 向量。
余弦相似度可用于计算正式距离度量根据维基百科。它遵循您期望的距离的所有属性(对称性、非负性等):
这两个指标的范围都在 0 和 1 之间。
如果您有一个 tokenizer 从字符串生成 N 元语法,您可以像这样使用这些指标:
我发现了
Counter 的优雅内积
这个答案Cosine similarity is widely used for n-gram count or TFIDF vectors.
Cosine similarity can be used to compute a formal distance metric according to wikipedia. It obeys all the properties of a distance that you would expect (symmetry, nonnegativity, etc):
Both of these metrics range between 0 and 1.
If you have a tokenizer that produces N-grams from a string you could use these metrics like this:
I found the elegant inner product of
Counter
in this SO answer