基于编辑距离的方法与 Soundex

发布于 2024-07-04 21:18:30 字数 158 浏览 6 评论 0原文

根据相关线程中的 this 评论,我想知道为什么 Levenshtein 距离基于方法比 Soundex 更好。

As per this comment in a related thread, I'd like to know why Levenshtein distance based methods are better than Soundex.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

任性一次 2024-07-11 21:18:30

Soundex 相当原始 - 它最初是为了手动计算而开发的。 它会产生一个可以比较的密钥。

Soundex 与西方名字配合得很好,因为它最初是为美国人口普查数据开发的。 它用于语音比较。

编辑距离着眼于两个值并根据它们的相似性生成一个值。 它正在寻找丢失或替换的字母。

基本上,Soundex 更适合发现“Schmidt”和“Smith”可能是同一个姓氏。

Levenshtein 距离更适合发现用户输入错误的“Levnshtein”;-)

Soundex is rather primitive - it was originally developed to be hand calculated. It results in a key that can be compared.

Soundex works well with western names, as it was originally developed for US census data. It's intended for phonetic comparison.

Levenshtein distance looks at two values and produces a value based on their similarity. It's looking for missing or substituted letters.

Basically Soundex is better for finding that "Schmidt" and "Smith" might be the same surname.

Levenshtein distance is better for spotting that the user has mistyped "Levnshtein" ;-)

腹黑女流氓 2024-07-11 21:18:30

我建议使用 Metaphone,而不是 Soundex。 如前所述,Soundex 是在 19 世纪为美国名字开发的。 当检查那些“听出来”并按语音拼写的糟糕拼写者的工作时,Metaphone 会给你一些结果。

编辑距离擅长捕捉拼写错误,例如重复字母、颠倒字母或按错键。

考虑应用程序来决定哪一个最适合您的用户,或者同时使用两者,并使用 Metaphone 补充 Levenshtein 提供的建议。

关于最初的问题,我已成功使用 n-grams在信息检索应用中。

I would suggest using Metaphone, not Soundex. As noted, Soundex was developed in the 19th century for American names. Metaphone will give you some results when checking the work of poor spellers who are "sounding it out", and spelling phonetically.

Edit distance is good at catching typos such as repeated letters, transposed letters, or hitting the wrong key.

Consider the application to decide which will fit your users best—or use both together, with Metaphone complementing the suggestions produced by Levenshtein.

With regard to the original question, I've used n-grams successfully in information retrieval applications.

萌逼全场 2024-07-11 21:18:30

我同意你关于 Daitch-Mokotoff 的观点,Soundex 有偏见,因为最初的美国人口普查员想要“美国化”的名字。

也许一个关于差异的例子会有所帮助:

Soundex 将附加值放在单词的开头 - 事实上它只考虑前 4 个语音。 因此,虽然“Schmidt”和“Smith”将匹配“Smith”,但“Wmith”则不会。

Levenshtein 的算法更适合查找拼写错误 - 一两个丢失或替换的字母会产生高度相关性,而这些丢失字母的语音影响则不太重要。

我认为两者都不是更好,我会考虑使用距离算法和语音算法来帮助用户纠正键入的输入。

I agree with you on Daitch-Mokotoff, Soundex is biased because the original US census takers wanted 'Americanized' names.

Maybe an example on the difference would help:

Soundex puts addition value in the start of a word - in fact it only considers the first 4 phonetic sounds. So while "Schmidt" and "Smith" will match "Smith" and "Wmith" won't.

Levenshtein's algorithm would be better for finding typos - one or two missing or replaced letters produces a high correlation, while the phonetic impact of those missing letters is less important.

I don't think either is better, and I'd consider both a distance algorithm and a phonetic one for helping users correct typed input.

洒一地阳光 2024-07-11 21:18:30

@Keith

正如我在另一个问题上发布的那样,Daitch-Mokotoff对我们欧洲人来说更好(我认为对美国人来说)。

我还阅读了有关 Levenshtein 的 Wiki。 但我不明白为什么(在现实生活中)它对用户来说比 Soundex 更好。

@Keith:

As I posted on the other question, Daitch-Mokotoff is better for us Europeans (and I'd argue the US).

I've also read the Wiki on Levenshtein. But I don't see why (in real life) it's better for the user than Soundex.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文