从“enchant suggest()”中获取最相关的单词(拼写检查)在Python中

发布于 2024-11-19 22:26:52 字数 882 浏览 2 评论 0原文

我想从 enchant suggest() 中获取最相关的单词。有没有更好的方法可以做到这一点。我觉得我的函数在检查 100k 或更多范围内的大量单词时效率不高。

enchant suggest() 的问题:

>>> import enchant
>>> d.suggest("prfomnc")
['prominence', 'performance', 'preform', 'Provence', 'preferment', 'proforma']

我的函数从一组建议单词中获取适当的单词:

import enchant, difflib

word="prfomnc"
dict,max = {},0
a = set(d.suggest(word))
for b in a:
    tmp = difflib.SequenceMatcher(None, word, b).ratio();
    dict[tmp] = b
    if tmp > max:
       max = tmp

print dict[max]

Result: performance

更新:

如果我获得多个键,则意味着相同的 difflibratio() 值,我用的是多键字典。如此处所述: http:// code.activestate.com/recipes/440502-a-dictionary-with-multiple-values-for-each-key/

I want to get the most relevant word from enchant suggest(). Is there any better way to do that. I feel my function is not efficient when it comes to checking large set of words in the range of 100k or more.

Problem with enchant suggest():

>>> import enchant
>>> d.suggest("prfomnc")
['prominence', 'performance', 'preform', 'Provence', 'preferment', 'proforma']

My function to get the appropriate word from a set of suggested words:

import enchant, difflib

word="prfomnc"
dict,max = {},0
a = set(d.suggest(word))
for b in a:
    tmp = difflib.SequenceMatcher(None, word, b).ratio();
    dict[tmp] = b
    if tmp > max:
       max = tmp

print dict[max]

Result: performance

Updated:

if I get multiple keys, meaning same difflib ratio() values, I use multi-key dictionary. As explained here: http://code.activestate.com/recipes/440502-a-dictionary-with-multiple-values-for-each-key/

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

盛夏已如深秋| 2024-11-26 22:26:53

恐怕没有灵丹妙药......但是有一些建议。

我猜测逻辑中的大部分时间都花在 difflib 的 SequenceMatcher().ratio() 调用上。这并不奇怪,因为此方法使用 Rattcliff-Obershelp 算法的变体 从 CPU 角度来看,它相对昂贵(但它产生的指标相当“准确”,可以找到紧密的匹配,这可能就是您喜欢它的原因)。

可以肯定的是,您应该分析此逻辑并确认 SequenceMatcher() 确实是热点。也许 Enchant.suggest() 也有点慢,但我们在代码方面几乎无能为力来改进这一点(在配置方面,可能有一些选择,例如,取消个人字典来保存双重查找和合并等)。

假设 SequenceMatcher() 确实是罪魁祸首,并且假设您希望坚持使用 Ratcliff-Obershelp 相似性度量作为选择最佳匹配的方式,您可以执行以下操作:

  • 仅计算 SequenceMatcher来自 Enchant 的前 (?) 5 项的比率值。
    毕竟,Enchant.suggest() 以有序的方式返回其最佳建议先猜测;因此,虽然基于不同的启发法,附魔顺序也有价值,但当我们在列表中向下移动时,找到高排名匹配的机会可能会减少。此外,尽管如此,我们最终可能会忽略一些这样的高排名比赛,但通过仅测试前几个附魔建议,我们以某种方式将附魔启发式中发现的“智慧”与 Ratcliff-Obershelp 指标中的“智慧”结合起来。
  • 达到一定阈值后停止计算 SequenceMatcher 比率
    这个想法与前面的类似:一旦找到更好的可能性越来越小(并且一旦我们手头有一个不错的选择(即使不是最好的选择)),就避免调用 SequenceMatcher
  • 用你自己的逻辑过滤掉 Enchant 中的一些单词。
    我们的想法是进行一个相对快速/便宜的测试,这可能会告诉我们给定的单词不太可能在 SequenceMatcher 比率上得分很高。例如,排除至少不具有用户字符串长度减去两个共同字符的单词。
    顺便说一句,您也许可以使用 SequenceMatcher 对象的一些 [quicker] 函数来获取一些用于过滤启发式的数据。
  • 使用 SequenceMatcher *quick_ratio*() 函数代替
    至少在某些情况下。
  • 仅将最佳匹配保留在字符串中,而不是使用字典
    显然只有最佳选择才重要,因此除了测试目的之外,您可能不需要[相对较小的]字典开销。
  • 可能考虑编写自己的Ratcliff-Obershelp(或类似)方法,在满足当前最大比率的可能性很小时在其中引入各种早期退出。请注意,可能很难生成一种与 difflib 的 C 语言方法一样高效的方法,您对此感兴趣的是早期退出...

HTH,祝您好运;-)

No magic bullet, I'm afraid... a few suggestions however.

I'm guessing that most of the time in the logic is spent in the difflib's SequenceMatcher().ratio() call. This wouldn't be surprising since this method uses a variation on the Rattcliff-Obershelp algorithm which is relatively expensive, CPU-wise (but the metric it produces is rather "on the mark" to locate close matches, and that is probably why you like it).

To be sure, you should profile this logic and confirm that indeed SequenceMatcher() is the hot spot. Maybe Enchant.suggest() is also a bit slow, but there would be little we could do, code-wise, to improve this (configuration-wise, there may be a few options, for eg. doing away with personal dictionary to save the double look-upup and merge etc.).

Assuming that SequenceMatcher() is indeed the culprit, and assuming that you wish to stick with the Ratcliff-Obershelp similarity metric as the way to select the best match, you could do [some of] the following:

  • only compute the SequenceMatcher ratio value for the top (?) 5 items from Enchant.
    After all, Enchant.suggest() returns its suggestions in an ordered fashion with its best guesses first; therefore while based on different heuristics, there's value in the Enchant order as well, the chances of finding a hi-ranking match probably diminish as we move down the list. Also, even though, we may end up ignoring a few such hi-ranking matches, by testing only the top few Enchant suggestions, we somehow combine the "wisdom" found in Enchant's heuristics with these from the Ratcliff-Obershelp metric.
  • stop computing the SequenceMatcher ratio after a certain threshold has been reached
    The idea is similar to the preceding: avoid calling SequenceMatcher once the odds of finding better are getting smaller (and once we have a decent if not best choice in hand)
  • filter out some of the words from Enchant with your own logic.
    The idea is to have a relatively quick/inexpensive test which may tell us that a given word is unlikely to score well on the SequenceMatcher ratio. For example exclude words which do not have at least, say, length of user string minus two characters in common.
    BTW, you can maybe use some of the SequenceMatcher object's [quicker] functions to get some data for the filtering heuristics.
  • use SequenceMatcher *quick_ratio*() function instead
    at least in some cases.
  • only keep the best match, in a string, rather than using a dictionary
    Apparently only the top choice matters, so except for test purposes you may not need the [relatively small] overhead of the dictionary.
  • you may consider writing your own Ratcliff-Obershelp (or similar) method, introducing therein various early exits when the prospect of meeting the current max ratio is small. BEWARE, it would likely be difficult to produce a method as efficient as the C-language one of difflib, your interest in doing this is with the early exits...

HTH, good luck ;-)

半窗疏影 2024-11-26 22:26:53

如果您只对最佳匹配感兴趣,则实际上不需要保留 dict

>>> word="prfomnc"
>>> best_words = []
>>> best_ratio = 0
>>> a = set(d.suggest(word))
>>> for b in a:
...   tmp = difflib.SequenceMatcher(None, word, b).ratio()
...   if tmp > best_ratio:
...     best_words = [b]
...     best_ratio = tmp
...   elif tmp == best_ratio:
...     best_words.append(b)
... 
>>> best_words
['performance']

You don't actuall need to keep a dict if you are only interested in the best matches

>>> word="prfomnc"
>>> best_words = []
>>> best_ratio = 0
>>> a = set(d.suggest(word))
>>> for b in a:
...   tmp = difflib.SequenceMatcher(None, word, b).ratio()
...   if tmp > best_ratio:
...     best_words = [b]
...     best_ratio = tmp
...   elif tmp == best_ratio:
...     best_words.append(b)
... 
>>> best_words
['performance']
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文