如何在相似性度量和差异性（距离）度量之间进行转换？

发布于 2024-09-30 08:59:12 字数 284 浏览 17 评论 0原文

是否有一种通用方法可以在相似性度量和距离度量之间进行转换？

考虑相似性度量，例如两个字符串共有的 2 元语法的数量。

2-grams('beta', 'delta') = 1
2-grams('apple', 'dappled') = 4

如果我需要将其提供给需要差异度量（例如编辑距离）的优化算法，该怎么办？

这只是一个例子......我正在寻找一种通用的解决方案（如果存在）。比如如何从编辑距离到相似度度量？

我感谢您提供的任何指导。

原文

Is there a general way to convert between a measure of similarity and a measure of distance?

Consider a similarity measure like the number of 2-grams that two strings have in common.

2-grams('beta', 'delta') = 1
2-grams('apple', 'dappled') = 4

What if I need to feed this to an optimization algorithm that expects a measure of difference, like Levenshtein distance?

This is just an example...I'm looking for a general solution, if one exists. Like how to go from Levenshtein distance to a measure of similarity?

I appreciate any guidance you may offer.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

疯了 2024-10-07 08:59:12

令d表示距离，s表示相似度。要将距离度量转换为相似性度量，我们需要首先使用 d_norm = d/max(<强>d）。然后相似性度量由下式给出：

s = 1 - d_norm。

其中s在[0 1]范围内，1表示相似度最高（比较的项目相同），0表示相似度最低（距离最大）。

回复收藏 0 原文

寄居人 2024-10-07 08:59:12

如果您的相似性度量在 0 到 1 之间，您可以使用以下之一：

1-s
sqrt(1-s)
-log(s)
(1/s)-1

If your similarity measure (s) is between 0 and 1, you can use one of these:

1-s
sqrt(1-s)
-log(s)
(1/s)-1

回复收藏 0 原文

粉红×色少女 2024-10-07 08:59:12

执行 1/相似度不会保留分布的属性。

最好的办法是
距离(a->b)=最高相似度-相似度(a->b)。
相似度最高的是具有最大值的相似度。因此，你翻转了你的分布。
最高相似度变为0等

回复收藏 0 原文

一曲琵琶半遮面シ 2024-10-07 08:59:12

是的，有一种最通用方法可以在相似度和距离之间进行更改：严格单调递减函数f(x)。

也就是说，使用 f(x)，您可以生成 similarity = f(distance) 或 distance = f(similarity)。它在两个方向上都起作用。这样的函数之所以有效，是因为相似度和距离之间的关系是一个减小，另一个增大。

示例：

这些是一些众所周知的严格单调递减候选，适用于非负相似度或距离：

f(x) = 1 / (a + x )
f(x) = exp(- x^a)
f(x) = arccot(ax)

可以选择参数a>0（例如，a=1）

编辑 2021-08

一个非常实用的方法是使用属于的函数 sim2diss 统计软件R。该函数提供了多达 13 种方法来计算相似性和差异性。遗憾的是，这些方法根本没有解释：您必须查看代码：-\

回复收藏 0 原文

你在我安 2024-10-07 08:59:12

similarity = 1/difference

并注意差异= 0

similarity = 1/difference

and watch out for difference = 0

回复收藏 0 原文

长安忆 2024-10-07 08:59:12

根据scikit learn：

内核是相似性的度量，即 s( a、b)＞ s(a, c) 如果对象 a 和 b 被认为比对象 a 和 c “更相似”。核也必须是半正定的。

有多种方法可以在距离度量和相似性度量之间进行转换，例如内核。令 D 为距离，S 为内核：

S = np.exp(-D * gamma)，其中用于选择 gamma 的一种启发式为 1 /
num_features
S = 1./(D/np.max(D))

回复收藏 0 原文

满地尘埃落定 2024-10-07 08:59:12

对于 Levenshtein 距离，每次序列匹配时，您都可以将 sim 分数增加 1；也就是说，每次不需要删除、插入或替换时都为 1。这样，度量将是两个字符串有多少共同字符的线性度量。

回复收藏 0 原文

黯然 2024-10-07 08:59:12

在我的一个项目（基于协作过滤）中，我必须在相关性（向量之间的余弦）之间进行转换，从 -1 到 1（越接近 1 越相似，越接近 -1 越多样化）到归一化距离（接近0 表示距离较小，如果接近 1 则距离较大）

在这种情况下：距离 ~ 多样性

我的公式是：dist = 1 - (cor + 1)/2

如果您与多样性和域是 [0,1] 在这两种情况下最简单的方法是：

dist = 1 - sim

sim = 1 - dist

回复收藏 0 原文

非要怀念 2024-10-07 08:59:12

余弦相似度是广泛用于 n-gram 计数或 TFIDF 向量。

from math import pi, acos
def similarity(x, y):
    return sum(x[k] * y[k] for k in x if k in y) / sum(v**2 for v in x.values())**.5 / sum(v**2 for v in y.values())**.5

余弦相似度可用于计算正式距离度量根据维基百科。它遵循您期望的距离的所有属性（对称性、非负性等）：

def distance_metric(x, y):
    return 1 - 2 * acos(similarity(x, y)) / pi

这两个指标的范围都在 0 和 1 之间。

如果您有一个 tokenizer 从字符串生成 N 元语法，您可以像这样使用这些指标：

>>> import Tokenizer
>>> tokenizer = Tokenizer(ngrams=2, lower=True, nonwords_set=set(['hello', 'and']))

>>> from Collections import Counter
>>> list(tokenizer('Hello World again and again?'))
['world', 'again', 'again', 'world again', 'again again']
>>> Counter(tokenizer('Hello World again and again?'))
Counter({'again': 2, 'world': 1, 'again again': 1, 'world again': 1})
>>> x = _
>>> Counter(tokenizer('Hi world once again.'))
Counter({'again': 1, 'world once': 1, 'hi': 1, 'once again': 1, 'world': 1, 'hi world': 1, 'once': 1})
>>> y = _
>>> sum(x[k]*y[k] for k in x if k in y) / sum(v**2 for v in x.values())**.5 / sum(v**2 for v in y.values())**.5
0.42857142857142855
>>> distance_metric(x, y)
0.28196592805724774

我发现了 Counter 的优雅内积 这个答案

Cosine similarity is widely used for n-gram count or TFIDF vectors.

from math import pi, acos
def similarity(x, y):
    return sum(x[k] * y[k] for k in x if k in y) / sum(v**2 for v in x.values())**.5 / sum(v**2 for v in y.values())**.5

Cosine similarity can be used to compute a formal distance metric according to wikipedia. It obeys all the properties of a distance that you would expect (symmetry, nonnegativity, etc):

def distance_metric(x, y):
    return 1 - 2 * acos(similarity(x, y)) / pi

Both of these metrics range between 0 and 1.

If you have a tokenizer that produces N-grams from a string you could use these metrics like this:

>>> import Tokenizer
>>> tokenizer = Tokenizer(ngrams=2, lower=True, nonwords_set=set(['hello', 'and']))

>>> from Collections import Counter
>>> list(tokenizer('Hello World again and again?'))
['world', 'again', 'again', 'world again', 'again again']
>>> Counter(tokenizer('Hello World again and again?'))
Counter({'again': 2, 'world': 1, 'again again': 1, 'world again': 1})
>>> x = _
>>> Counter(tokenizer('Hi world once again.'))
Counter({'again': 1, 'world once': 1, 'hi': 1, 'once again': 1, 'world': 1, 'hi world': 1, 'once': 1})
>>> y = _
>>> sum(x[k]*y[k] for k in x if k in y) / sum(v**2 for v in x.values())**.5 / sum(v**2 for v in y.values())**.5
0.42857142857142855
>>> distance_metric(x, y)
0.28196592805724774

I found the elegant inner product of Counter in this SO answer

回复收藏 0 原文

~没有更多了~