比较单词的算法(不按字母顺序)

发布于 2024-07-20 13:59:41 字数 699 浏览 7 评论 0 原文

我需要为某个需求编写一个解决方案,我想知道是否有人熟悉可以实现它的现成库,或者可以指导我最佳实践。 描述:

用户输入一个单词,该单词应该是几个固定选项之一(我将选项保存在列表中)。 我知道输入必须在列表中的成员中,但由于它是用户输入,他/她可能犯了错误。 我正在寻找一种算法来告诉我用户最可能想说的单词是什么。 我没有任何上下文,我不能强迫用户从列表中进行选择(即他必须能够自由地手动输入单词)。

例如,假设该列表包含单词“water”、“quarter”、“beer”、“beet”、“hell”、“hello”和“aardvark”。

该解决方案必须考虑不同类型的“正常”错误:

  • 速度拼写错误(例如,重复字符、删除字符等)
  • 键盘相邻字符拼写错误(例如,“qater”表示“水”)
  • 非母语英语拼写错误(例如,“quater”表示 “水”) “季度”)
  • 等等......

显而易见的解决方案是逐个字母进行比较,并对每个不同的字母、额外的字母和缺失的字母给予“惩罚权重”。 但这个解决方案忽略了数千个“标准”错误,我确信在某处列出了。 我确信有一些启发式方法可以处理所有情况,无论是特定的还是一般的情况,可能使用标准不匹配的大型数据库(我对数据密集型解决方案持开放态度)。

我正在用 Python 编码,但我认为这个问题与语言无关。

有什么建议/想法吗?

I need to code a solution for a certain requirement, and I wanted to know if anyone is either familiar with an off-the-shelf library that can achieve it, or can direct me at the best practice. Description:

The user inputs a word that is supposed to be one of several fixed options (I hold the options in a list). I know the input must be in a member in the list, but since it is user input, he/she may have made a mistake. I'm looking for an algorithm that will tell me what is the most probable word the user meant. I don't have any context and I can’t force the user to choose from a list (i.e. he must be able to input the word freely and manually).

For example, say the list contains the words "water", “quarter”, "beer", “beet”, “hell”, “hello” and "aardvark".

The solution must account for different types of "normal" errors:

  • Speed typos (e.g. doubling characters, dropping characters etc)
  • Keyboard adjacent-character typos (e.g. "qater" for “water”)
  • Non-native English typos (e.g. "quater" for “quarter”)
  • And so on...

The obvious solution is to compare letter-by-letter and give "penalty weights" to each different letter, extra letter and missing letter. But this solution ignores thousands of "standard" errors I'm sure are listed somewhere. I'm sure there are heuristics out there that deal with all the cases, both specific and general, probably using a large database of standard mismatches (I’m open to data-heavy solutions).

I'm coding in Python but I consider this question language-agnostic.

Any recommendations/thoughts?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

瑕疵 2024-07-27 13:59:41

您想了解谷歌是如何做到这一点的:http://norvig.com/spell- Correct.html

编辑:有些人提到了定义用户给定单词和候选单词(levenshtein、soundex)之间度量的算法。 然而,这并不是问题的完整解决方案,因为还需要一种数据结构来有效地执行非欧几里德最近邻搜索。 这可以通过覆盖树来完成: http://hunch.net/~ jl/projects/cover_tree/cover_tree.html

You want to read how google does this: http://norvig.com/spell-correct.html

Edit: Some people have mentioned algorithms that define a metric between a user given word and a candidate word (levenshtein, soundex). This is however not a complete solution to the problem, since one would also need a datastructure to efficiently perform a non-euclidean nearest neighbour search. This can be done e.g. with the Cover Tree: http://hunch.net/~jl/projects/cover_tree/cover_tree.html

不即不离 2024-07-27 13:59:41

常见的解决方案是计算输入和固定文本之间的 Levenshtein 距离。 两个字符串的编辑距离只是将一个字符串转换为另一个字符串所需的简单操作(单个字符的插入、删除和替换)的数量。

A common solution is to calculate the Levenshtein distance between the input and your fixed texts. The Levenshtein distance of two strings is just the number of simple operations - insertions, deletions, and substitutions of a single character - required to turn one of the string into the other.

百变从容 2024-07-27 13:59:41

您是否考虑过通过语音进行比较的算法,例如 soundex? 生成单词列表的 soundex 表示、存储它们,然后获取用户输入的 soundex 并找到最接近的匹配项应该不会太难。

Have you considered algorithms that compare by phonetic sounds, such as soundex? It shouldn't be too hard to produce soundex representations of your list of words, store them, and then get a soundex of the user input and find the closest match there.

清眉祭 2024-07-27 13:59:41

寻找 Bitap 算法。 它非常适合您想要做的事情,甚至在维基百科中提供了源代码示例。

Look for the Bitap algorithm. It qualifies well for what you want to do, and even comes with a source code example in Wikipedia.

留蓝 2024-07-27 13:59:41

如果您的数据集非常小,只需独立比较所有项目的编辑距离就足够了。 但是,如果它更大,则需要使用 BK-Tree 或类似的索引系统。 我链接到的文章描述了如何在给定的 Levenshtein 距离内查找匹配项,但适应最近邻搜索相当简单(并留给读者作为练习;)。

If your data set is really small, simply comparing the Levenshtein distance on all items independently ought to suffice. If it's larger, though, you'll need to use a BK-Tree or similar indexing system. The article I linked to describes how to find matches within a given Levenshtein distance, but it's fairly straightforward to adapt to do nearest-neighbor searches (and left as an exercise to the reader ;).

风追烟花雨 2024-07-27 13:59:41

尽管它可能无法解决整个问题,但您可能需要考虑使用 soundex 算法作为解决方案的一部分。 谷歌快速搜索“soundex”和“python”显示了该算法的一些 python 实现。

Though it may not solve the entire problem, you may want to consider using the soundex algorithm as part of the solution. A quick google search of "soundex" and "python" showed some python implementations of the algorithm.

那支青花 2024-07-27 13:59:41

尝试搜索“编辑距离”或“编辑距离”。 它计算将一个单词转换为另一个单词所需的编辑操作(删除、插入、更改字母)的数量。 这是一种常见的算法,但根据问题的不同,您可能需要针对不同类型的拼写错误具有不同权重的特殊算法。

Try searching for "Levenshtein distance" or "edit distance". It counts the number of edit operations (delete, insert, change letter) you need to transform one word into another. It's a common algorithm, but depending on the problem you might need something special with different weights for the different types of typos.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文