在 Python 中编辑距离
我正在用 Python 编写一个拼写检查程序。我有一个有效单词列表(字典),我需要从该字典中输出一个单词列表,这些单词与给定的无效单词的编辑距离为 2。
我知道我需要首先生成一个与无效单词的编辑距离为 1 的列表(然后在所有生成的单词上再次运行该列表)。我有三种方法,inserts(...)、deletions(...) 和changes(...),它们应该输出编辑距离为1 的单词列表,其中inserts 输出所有有效单词,其中多一个字母对于给定的单词,删除输出所有少一个字母的有效单词,更改输出所有带有一个不同字母的有效单词。
我检查了很多地方,但似乎找不到描述此过程的算法。我提出的所有想法都涉及多次循环字典列表,这将非常耗时。如果有人能够提供一些见解,我将非常感激。
I'm programming a spellcheck program in Python. I have a list of valid words (the dictionary) and I need to output a list of words from this dictionary that have an edit distance of 2 from a given invalid word.
I know I need to start by generating a list with an edit distance of one from the invalid word(and then run that again on all the generated words). I have three methods, inserts(...), deletions(...) and changes(...) that should output a list of words with an edit distance of 1, where inserts outputs all valid words with one more letter than the given word, deletions outputs all valid words with one less letter, and changes outputs all valid words with one different letter.
I've checked a bunch of places but I can't seem to find an algorithm that describes this process. All the ideas I've come up with involve looping through the dictionary list multiple times, which would be extremely time consuming. If anyone could offer some insight, I'd be extremely grateful.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(11)
您需要最小编辑距离来完成此任务。
以下是我的 MED 版本,又名 Levenshtein Distance。
You need Minimum Edit Distance for this task.
Following is my version of MED a.k.a Levenshtein Distance.
根据@Santosh 的版本微调代码,并应解决@Artur Krajewski 提出的问题;最大的区别是替换了有效的二维矩阵
Fine tuned codes based on the version from @Santosh and should address the issue brought up by @Artur Krajewski; The biggest difference is replacing an effective 2d matrix
跟进@krassowski的回答
following up on @krassowski's answer
您所看到的东西称为编辑距离,这里有一个 wiki 上的很好的解释。有很多方法可以定义两个单词之间的距离,并且您想要的距离称为编辑距离,这里是 python 中的 DP(动态编程)实现。
这里还有更多实现。
The thing you are looking at is called an edit distance and here is a nice explanation on wiki. There are a lot of ways how to define a distance between the two words and the one that you want is called Levenshtein distance and here is a DP (dynamic programming) implementation in python.
And a couple of more implementations are here.
标准库中的
difflib
有各种用于序列的实用程序匹配,包括您可以使用的get_close_matches
方法。它使用改编自 Ratcliff 和 Obershelp 的算法。来自文档
difflib
in the standard library has various utilities for sequence matching, including theget_close_matches
method that you could use. It uses an algorithm adapted from Ratcliff and Obershelp.From the docs
我建议不要自己创建此类代码。有一些库可以做到这一点。
例如 Levenshtein 库。
I would recommend not creating this kind of code on your own. There are libraries for that.
For instance the Levenshtein library.
这是我的编辑距离版本
Here is my version for Levenshtein distance
使用 Python 构建的
SequenceMatcher
-indifflib
是另一种方法,但是(正如注释中正确指出的那样),结果与编辑距离的定义不完全匹配。奖励:它支持忽略“垃圾”部分(例如空格或标点符号)。Using the
SequenceMatcher
from Python built-indifflib
is another way of doing it, but (as correctly pointed out in the comments), the result does not match the definition of an edit distance exactly. Bonus: it supports ignoring "junk" parts (e.g. spaces or punctuation).与上面 Santoshi 的解决方案类似,但我做了三处更改:
Similar to Santoshi's solution above but I made three changes:
不要使用 Levenshtein 距离算法,而是使用 BK 树 或 TRIE,因为这些算法的复杂性低于编辑距离。仔细浏览这些主题将会给出详细的描述。
此链接< /a> 将帮助您了解有关拼写检查的更多信息。
Instead of going with Levenshtein distance algo use BK tree or TRIE, as these algorithms have less complexity then edit distance. A good browse over these topic will give a detailed description.
This link will help you more about spell checking.