从“enchant suggest()”中获取最相关的单词(拼写检查)在Python中
我想从 enchant suggest()
中获取最相关的单词。有没有更好的方法可以做到这一点。我觉得我的函数在检查 100k 或更多范围内的大量单词时效率不高。
enchant suggest()
的问题:
>>> import enchant >>> d.suggest("prfomnc") ['prominence', 'performance', 'preform', 'Provence', 'preferment', 'proforma']
我的函数从一组建议单词中获取适当的单词:
import enchant, difflib word="prfomnc" dict,max = {},0 a = set(d.suggest(word)) for b in a: tmp = difflib.SequenceMatcher(None, word, b).ratio(); dict[tmp] = b if tmp > max: max = tmp print dict[max] Result: performance
更新:
如果我获得多个键,则意味着相同的 difflibratio()
值,我用的是多键字典。如此处所述: http:// code.activestate.com/recipes/440502-a-dictionary-with-multiple-values-for-each-key/
I want to get the most relevant word from enchant suggest()
. Is there any better way to do that. I feel my function is not efficient when it comes to checking large set of words in the range of 100k or more.
Problem with enchant suggest()
:
>>> import enchant >>> d.suggest("prfomnc") ['prominence', 'performance', 'preform', 'Provence', 'preferment', 'proforma']
My function to get the appropriate word from a set of suggested words:
import enchant, difflib word="prfomnc" dict,max = {},0 a = set(d.suggest(word)) for b in a: tmp = difflib.SequenceMatcher(None, word, b).ratio(); dict[tmp] = b if tmp > max: max = tmp print dict[max] Result: performance
Updated:
if I get multiple keys, meaning same difflib ratio()
values, I use multi-key dictionary. As explained here: http://code.activestate.com/recipes/440502-a-dictionary-with-multiple-values-for-each-key/
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
恐怕没有灵丹妙药......但是有一些建议。
我猜测逻辑中的大部分时间都花在 difflib 的 SequenceMatcher().ratio() 调用上。这并不奇怪,因为此方法使用 Rattcliff-Obershelp 算法的变体 从 CPU 角度来看,它相对昂贵(但它产生的指标相当“准确”,可以找到紧密的匹配,这可能就是您喜欢它的原因)。
可以肯定的是,您应该分析此逻辑并确认 SequenceMatcher() 确实是热点。也许 Enchant.suggest() 也有点慢,但我们在代码方面几乎无能为力来改进这一点(在配置方面,可能有一些选择,例如,取消个人字典来保存双重查找和合并等)。
假设 SequenceMatcher() 确实是罪魁祸首,并且假设您希望坚持使用 Ratcliff-Obershelp 相似性度量作为选择最佳匹配的方式,您可以执行以下操作:
毕竟,Enchant.suggest() 以有序的方式返回其最佳建议先猜测;因此,虽然基于不同的启发法,附魔顺序也有价值,但当我们在列表中向下移动时,找到高排名匹配的机会可能会减少。此外,尽管如此,我们最终可能会忽略一些这样的高排名比赛,但通过仅测试前几个附魔建议,我们以某种方式将附魔启发式中发现的“智慧”与 Ratcliff-Obershelp 指标中的“智慧”结合起来。
这个想法与前面的类似:一旦找到更好的可能性越来越小(并且一旦我们手头有一个不错的选择(即使不是最好的选择)),就避免调用 SequenceMatcher
我们的想法是进行一个相对快速/便宜的测试,这可能会告诉我们给定的单词不太可能在 SequenceMatcher 比率上得分很高。例如,排除至少不具有用户字符串长度减去两个共同字符的单词。
顺便说一句,您也许可以使用 SequenceMatcher 对象的一些 [quicker] 函数来获取一些用于过滤启发式的数据。
至少在某些情况下。
显然只有最佳选择才重要,因此除了测试目的之外,您可能不需要[相对较小的]字典开销。
HTH,祝您好运;-)
No magic bullet, I'm afraid... a few suggestions however.
I'm guessing that most of the time in the logic is spent in the difflib's SequenceMatcher().ratio() call. This wouldn't be surprising since this method uses a variation on the Rattcliff-Obershelp algorithm which is relatively expensive, CPU-wise (but the metric it produces is rather "on the mark" to locate close matches, and that is probably why you like it).
To be sure, you should profile this logic and confirm that indeed SequenceMatcher() is the hot spot. Maybe Enchant.suggest() is also a bit slow, but there would be little we could do, code-wise, to improve this (configuration-wise, there may be a few options, for eg. doing away with personal dictionary to save the double look-upup and merge etc.).
Assuming that SequenceMatcher() is indeed the culprit, and assuming that you wish to stick with the Ratcliff-Obershelp similarity metric as the way to select the best match, you could do [some of] the following:
After all, Enchant.suggest() returns its suggestions in an ordered fashion with its best guesses first; therefore while based on different heuristics, there's value in the Enchant order as well, the chances of finding a hi-ranking match probably diminish as we move down the list. Also, even though, we may end up ignoring a few such hi-ranking matches, by testing only the top few Enchant suggestions, we somehow combine the "wisdom" found in Enchant's heuristics with these from the Ratcliff-Obershelp metric.
The idea is similar to the preceding: avoid calling SequenceMatcher once the odds of finding better are getting smaller (and once we have a decent if not best choice in hand)
The idea is to have a relatively quick/inexpensive test which may tell us that a given word is unlikely to score well on the SequenceMatcher ratio. For example exclude words which do not have at least, say, length of user string minus two characters in common.
BTW, you can maybe use some of the SequenceMatcher object's [quicker] functions to get some data for the filtering heuristics.
at least in some cases.
Apparently only the top choice matters, so except for test purposes you may not need the [relatively small] overhead of the dictionary.
HTH, good luck ;-)
如果您只对最佳匹配感兴趣,则实际上不需要保留
dict
You don't actuall need to keep a
dict
if you are only interested in the best matches