计算相对编辑距离 - 有意义吗?
我使用 Daitch-Mokotoff soundexing 和 Damerau-Levenshtein 来确定应用程序中的用户条目和值是否“相同”。
编辑距离应该用作绝对值吗?如果我有一个 20 个字母的单词,那么 4 的距离还不错。如果这个单词有 4 个字母...
我现在正在做的是计算距离/长度以获得更好地反映单词已更改百分比的距离。
这是一种有效/经过验证的方法吗?或者这根本就是愚蠢的吗?
I am using both Daitch-Mokotoff soundexing and Damerau-Levenshtein to find out if a user entry and a value in the application are "the same".
Is Levenshtein distance supposed to be used as an absolute value? If I have a 20 letter word, a distance of 4 is not so bad. If the word has 4 letters...
What I am now doing is taking the distance / length to get a distance that better reflects what percentage of the word has been changed.
Is that a valid/proven approach? Or is it plain stupid?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
看来这取决于你的要求。 (澄清一下:编辑距离是一个绝对值,但正如OP指出的那样,对于给定的应用程序,原始值可能不如考虑单词长度的度量有用这是因为我们实际上对相似性比距离本身更感兴趣。)
听起来您正在尝试确定用户是否希望其输入与给定的数据值相同?
你在进行拼写检查吗?或使无效输入符合一组已知值?
你的优先事项是什么?
您可能最终会以一种方式使用编辑距离来确定是否应在建议列表中提供某个单词;以及确定如何对建议列表进行排序的另一种方法。
在我看来,如果我正确推断了你的目的,那么你想要测量的核心是相似性而不是两个字符串之间的差异。因此,您可以使用 Jaro 或 Jaro-Winkler 距离,这需要考虑字符串的长度和公共字符的数量:
It seems like it would depend on your requirements. (To clarify: Levenshtein distance is an absolute value, but as the OP pointed out, the raw value may not be as useful as for a given application as a measure that takes the length of the word into account. This is because we are really more interested in similarity than distance per se.)
Sounds like you're trying to determine whether the user intended their entry to be the same as a given data value?
Are you doing spell-checking? or conforming invalid input to a known set of values?
What are your priorities?
You might end up using the Levenshtein distance in one way to determine whether a word should be offered in a suggestion list; and another way to determine how to order the suggestion list.
It seems to me, if I've inferred your purpose correctly, that the core thing you want to measure is similarity rather than difference between two strings. As such, you could use Jaro or Jaro-Winkler distance, which takes into account the length of the strings and the number of characters in common:
编辑距离是两个单词之间的相对值。将 LD 与长度进行比较是不相关的,例如
cat -> scat = 1 (75%相似??)
差异->差异 = 1(90% 相似??)
这两个单词的 lev 距离均为 1,即它们相差一个字符,但与它们的长度相比,第二组单词看起来“更”相似。
我使用 soundexing 对具有相同 lev 距离的单词进行排名,例如
cat
和fat
相对于kat
的 LD 均为 1,但是单词使用 soundex 时,“kat”比“fat”更有可能(假设该单词拼写错误,而不是输入错误!)因此简短的答案是使用 lev 距离来确定相似性。
The levenshtein distance is a relative value between two words. Comparing the LD to the length is not relevant eg
cat -> scat = 1 (75% similar??)
difference -> differences = 1 (90% similar??)
Both these words have lev distances of 1 ie they differ by one character, but when compared to their lengths the second set would appear to be 'more' similar.
I use soundexing to rank words that have the same lev distance eg
cat
andfat
both have a LD of 1 relative tokat
, but the word is more likely to be kat than fat when using soundex (assuming the word is incrrectly spelt, not incorrectly typed!)So the short answer is just use the lev distance to determine the similarity.