计算相对编辑距离 - 有意义吗?

发布于 2024-09-26 05:57:19 字数 230 浏览 7 评论 0原文

我使用 Daitch-Mokotoff soundexing 和 Damerau-Levenshtein 来确定应用程序中的用户条目和值是否“相同”。

编辑距离应该用作绝对值吗?如果我有一个 20 个字母的单词,那么 4 的距离还不错。如果这个单词有 4 个字母...

我现在正在做的是计算距离/长度以获得更好地反映单词已更改百分比的距离。

这是一种有效/经过验证的方法吗?或者这根本就是愚蠢的吗?

I am using both Daitch-Mokotoff soundexing and Damerau-Levenshtein to find out if a user entry and a value in the application are "the same".

Is Levenshtein distance supposed to be used as an absolute value? If I have a 20 letter word, a distance of 4 is not so bad. If the word has 4 letters...

What I am now doing is taking the distance / length to get a distance that better reflects what percentage of the word has been changed.

Is that a valid/proven approach? Or is it plain stupid?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

苄①跕圉湢 2024-10-03 05:57:19

编辑距离应该是
用作绝对值?

看来这取决于你的要求。 (澄清一下:编辑距离一个绝对值,但正如OP指出的那样,对于给定的应用程序,原始值可能不如考虑单词长度的度量有用这是因为我们实际上对相似性比距离本身更感兴趣。)

我正在使用 Daitch-Mokotoff
soundexing 和 Damerau-Levenshtein 到
查明是否有用户条目和值
应用程序中的内容“相同”。

听起来您正在尝试确定用户是否希望其输入与给定的数据值相同?

你在进行拼写检查吗?或使无效输入符合一组已知值?
你的优先事项是什么?

  • 最大限度地减少误报(尝试确保所有建议的单词都非常“相似”,并且建议列表很短)
  • 最大限度地减少误报(尝试确保用户想要的字符串位于建议列表中,即使它使列表长)
  • 最大化平均匹配准确度

您可能最终会以一种方式使用编辑距离来确定是否应在建议列表中提供某个单词;以及确定如何对建议列表进行排序的另一种方法。

在我看来,如果我正确推断了你的目的,那么你想要测量的核心是相似性而不是两个字符串之间的差异。因此,您可以使用 Jaro 或 Jaro-Winkler 距离,这需要考虑字符串的长度和公共字符的数量:

两个给定的 Jaro 距离 dj
字符串 s1 和 s2 是

<前><代码>(m / |s1| + m / |s2| + (m - t) / m) / 3

地点:

  • m 是匹配字符的数量
  • t 是换位次数

Jaro–Winkler 距离使用前缀
规模p这给出了更有利的
评级到匹配的字符串
以设定的前缀长度 l 开始。

Is Levenshtein distance supposed to be
used as an absolute value?

It seems like it would depend on your requirements. (To clarify: Levenshtein distance is an absolute value, but as the OP pointed out, the raw value may not be as useful as for a given application as a measure that takes the length of the word into account. This is because we are really more interested in similarity than distance per se.)

I am using both Daitch-Mokotoff
soundexing and Damerau-Levenshtein to
find out if a user entry and a value
in the application are "the same".

Sounds like you're trying to determine whether the user intended their entry to be the same as a given data value?

Are you doing spell-checking? or conforming invalid input to a known set of values?
What are your priorities?

  • Minimize false positives (try to make sure all suggested words are very "similar", and list of suggestions is short)
  • Minimize false negatives (try to make sure that the string the user intended is in the list of suggestions, even if it makes the list long)
  • Maximize average matching accuracy

You might end up using the Levenshtein distance in one way to determine whether a word should be offered in a suggestion list; and another way to determine how to order the suggestion list.

It seems to me, if I've inferred your purpose correctly, that the core thing you want to measure is similarity rather than difference between two strings. As such, you could use Jaro or Jaro-Winkler distance, which takes into account the length of the strings and the number of characters in common:

The Jaro distance dj of two given
strings s1 and s2 is

(m / |s1| + m / |s2| + (m - t) / m) / 3

where:

  • m is the number of matching characters
  • t is the number of transpositions

Jaro–Winkler distance uses a prefix
scale p which gives more favourable
ratings to strings that match from the
beginning for a set prefix length l.

烟若柳尘 2024-10-03 05:57:19

编辑距离是两个单词之间的相对值。将 LD 与长度进行比较是不相关的,例如

cat -> scat = 1 (75%相似??)

差异->差异 = 1(90% 相似??)

这两个单词的 lev 距离均为 1,即它们相差一个字符,但与它们的长度相比,第二组单词看起来“更”相似。

我使用 soundexing 对具有相同 lev 距离的单词进行排名,例如

catfat 相对于 kat 的 LD 均为 1,但是单词使用 soundex 时,“kat”比“fat”更有可能(假设该单词拼写错误,而不是输入错误!)

因此简短的答案是使用 lev 距离来确定相似性。

The levenshtein distance is a relative value between two words. Comparing the LD to the length is not relevant eg

cat -> scat = 1 (75% similar??)

difference -> differences = 1 (90% similar??)

Both these words have lev distances of 1 ie they differ by one character, but when compared to their lengths the second set would appear to be 'more' similar.

I use soundexing to rank words that have the same lev distance eg

cat and fat both have a LD of 1 relative to kat, but the word is more likely to be kat than fat when using soundex (assuming the word is incrrectly spelt, not incorrectly typed!)

So the short answer is just use the lev distance to determine the similarity.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文