计算相对编辑距离 - 有意义吗？

发布于 2024-09-26 05:57:19 字数 230 浏览 20 评论 0原文

我使用 Daitch-Mokotoff soundexing 和 Damerau-Levenshtein 来确定应用程序中的用户条目和值是否“相同”。

编辑距离应该用作绝对值吗？如果我有一个 20 个字母的单词，那么 4 的距离还不错。如果这个单词有 4 个字母...

我现在正在做的是计算距离/长度以获得更好地反映单词已更改百分比的距离。

这是一种有效/经过验证的方法吗？或者这根本就是愚蠢的吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

苄①跕圉湢 2024-10-03 05:57:19

编辑距离应该是
用作绝对值？

看来这取决于你的要求。（澄清一下：编辑距离是一个绝对值，但正如OP指出的那样，对于给定的应用程序，原始值可能不如考虑单词长度的度量有用这是因为我们实际上对相似性比距离本身更感兴趣。）

我正在使用 Daitch-Mokotoff
soundexing 和 Damerau-Levenshtein 到
查明是否有用户条目和值
应用程序中的内容“相同”。

听起来您正在尝试确定用户是否希望其输入与给定的数据值相同？

你在进行拼写检查吗？或使无效输入符合一组已知值？
你的优先事项是什么？

最大限度地减少误报（尝试确保所有建议的单词都非常“相似”，并且建议列表很短）
最大限度地减少误报（尝试确保用户想要的字符串位于建议列表中，即使它使列表长）
最大化平均匹配准确度

您可能最终会以一种方式使用编辑距离来确定是否应在建议列表中提供某个单词；以及确定如何对建议列表进行排序的另一种方法。

在我看来，如果我正确推断了你的目的，那么你想要测量的核心是相似性而不是两个字符串之间的差异。因此，您可以使用 Jaro 或 Jaro-Winkler 距离，这需要考虑字符串的长度和公共字符的数量：

两个给定的 Jaro 距离 dj
字符串 s1 和 s2 是
<前><代码>(m / |s1| + m / |s2| + (m - t) / m) / 3
地点：
m 是匹配字符的数量
t 是换位次数
Jaro–Winkler 距离使用前缀
规模p这给出了更有利的
评级到匹配的字符串
以设定的前缀长度 l 开始。

Is Levenshtein distance supposed to be
used as an absolute value?

It seems like it would depend on your requirements. (To clarify: Levenshtein distance is an absolute value, but as the OP pointed out, the raw value may not be as useful as for a given application as a measure that takes the length of the word into account. This is because we are really more interested in similarity than distance per se.)

I am using both Daitch-Mokotoff
soundexing and Damerau-Levenshtein to
find out if a user entry and a value
in the application are "the same".

Sounds like you're trying to determine whether the user intended their entry to be the same as a given data value?

Are you doing spell-checking? or conforming invalid input to a known set of values?
What are your priorities?

Minimize false positives (try to make sure all suggested words are very "similar", and list of suggestions is short)
Minimize false negatives (try to make sure that the string the user intended is in the list of suggestions, even if it makes the list long)
Maximize average matching accuracy

You might end up using the Levenshtein distance in one way to determine whether a word should be offered in a suggestion list; and another way to determine how to order the suggestion list.

It seems to me, if I've inferred your purpose correctly, that the core thing you want to measure is similarity rather than difference between two strings. As such, you could use Jaro or Jaro-Winkler distance, which takes into account the length of the strings and the number of characters in common:

The Jaro distance dj of two given
strings s1 and s2 is
(m / |s1| + m / |s2| + (m - t) / m) / 3
where:
m is the number of matching characters
t is the number of transpositions
Jaro–Winkler distance uses a prefix
scale p which gives more favourable
ratings to strings that match from the
beginning for a set prefix length l.