“大” Python 中的缩放拼写检查
令人惊讶的是,我找不到其他人真正这样做,但肯定有人这样做。我目前正在开发一个 python 项目,涉及约 16000 个单词的拼写检查。不幸的是,这个字数只会增加。现在我正在从 Mongo 中提取单词,迭代它们,然后使用 pyenchant 对它们进行拼写检查。我首先从那里获取所有项目,从而消除了 mongo 作为潜在瓶颈的问题。这让我有大约 20 分钟的时间来处理 16k 个单词,这显然比我想要花的时间要长。这给我留下了一些想法/问题:
显然我可以利用线程或某种形式的并行性。即使我把它切成 4 块,假设达到最佳性能,我仍然会花费大约 5 分钟。显然
有没有办法知道 Enchant 在 pyenchant 下使用的拼写库是什么? Enchant 的网站似乎暗示在拼写检查时它将使用所有可用的拼写库/词典。如果是这样,那么我可能会通过三到四个拼写词典来运行每个单词。这可能是我的问题,但我很难证明情况确实如此。即使是,我真的可以选择卸载其他库吗?听起来很不幸。
那么,关于如何从中至少获得更多性能,有什么想法吗?我很乐意将其分解为并行任务,但我仍然希望在这样做之前让它的核心部分更快一点。
编辑:抱歉,在早上喝咖啡之前发帖...如果某个单词拼写错误,Enchant 会为我生成一个建议列表。这似乎是我在这个处理部分花费大部分时间的地方。
Surprisingly I've been unable to find anyone else really doing this, but surely someone has. I'm working on a python project currently that involves spell checking some 16 thousand words. That number of words is only going to grow unfortunately. Right now I'm pulling words from Mongo, iterating through them, and then spell checking them with pyenchant. I've removed mongo as the potential bottleneck by grabbing all my items from there first. That leaves me with around 20 minutes to process through 16k words, which is obviously longer than I want to spend. This leaves me with a couple ideas/questions:
Obviously I could leverage threading or some form of parallelism. Even if I chop this into 4 pieces, I'm still looking at roughly 5 minutes assuming peak performance.
Is there a way to tell what spelling library Enchant is using underneath pyenchant? Enchant's website seems to imply it'll use all available spelling libraries/dictionaries when spell checking. If so, then I'm potentially running each word through three-four spelling dicts. This could be my issue right here, but I'm having a hard time proving that's the case. Even if it is, is my option really to uninstall other libraries? Sounds unfortunate.
So, any ideas on how I can squeeze at least a bit more performance out of this? I'm fine with chopping this into parallel tasks, but I'd still like to get the core piece of it to be a bit faster before I do.
Edit: Sorry, posting before morning coffee... Enchant generates a list of suggestions for me if a word is incorrectly spelled. That would appear to be where I spend most of my time in this processing portion.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我想我们都同意这里的性能瓶颈是附魔;对于这种大小的数据集,几乎可以立即执行布尔值
isSpeltCorrectly
。那么,为什么不呢:使用 Enchant 提供的词典或获取您自己的词典来构建一组正确拼写的单词(例如 OpenOffice 的)。
(可选)对文档的单词进行唯一化,例如将它们放入
集合
中。这可能不会为您节省太多。检查每个单词是否在集合中。这很快,因为它只是一个集合查找。 (可能
O(log N)
,其中N是单词数?假设通过哈希set
存储桶并进行二分搜索......Python大师可以在这里纠正我。 )如果不是,那么请 Enchant 为其推荐一个词。这必然很慢。
这假设您的大部分单词拼写正确;如果不是,你就必须变得更聪明。
I think we agree that the performance bottleneck here is Enchant; for this size of dataset it's nearly instantaneous to do a boolean
isSpeltCorrectly
. So, why not:Build a set in memory of correctly-spelt words, using the dictionaries that Enchant does or fetching your own (e.g. OpenOffice's).
Optionally, uniquify the document's words, say by putting them in a
set
. This probably won't save you very much.Check whether each word is in the set or not. This is fast, because it's just a set lookup. (Probably
O(log N)
where N is the number of words? assumingset
buckets by hash and does a binary search... a Python guru can correct me here.)If it isn't, then ask Enchant to recommend a word for it. This is necessarily slow.
This assumes that most of your words are spelt correctly; if they aren't, you'll have to be cleverer.
我会使用 Peter Norvig 风格的拼写检查器。我已经就此写了一篇完整的文章。
http://blog.mattalcock.com/2012/12/5/ python-spell-checker/
这是一段代码,用于查看要检查的单词的可能编辑。
您应该使用此代码快速迭代不断增长的单词数据文件以进行检查。有关更多信息,请参阅完整帖子:
http://blog.mattalcock。 com/2012/12/5/python-拼写检查器/
I would use A Peter Norvig style spell checker. I've written a complete post on this.
http://blog.mattalcock.com/2012/12/5/python-spell-checker/
Here's a snippet of the code that looks at possible edits of the word to check.
You should be iterate through your growing data file of words to check extremely quickly with this code to check. See the full post for more information:
http://blog.mattalcock.com/2012/12/5/python-spell-checker/
也许更好的方法是压缩文档,因为这会删除任何重复的单词实例,实际上您只需要拼写检查一次。我只是建议这样做,因为它可能比编写您自己的独特单词查找器执行得更快。
压缩版本应该引用其文件中某处的独特单词,您可能需要查找它们的结构。
然后您可以对所有独特的单词进行拼写检查。我希望你不要用单独的 SQL 查询或类似的东西来检查它们,你应该以树的形式将字典加载到你的内存中,然后根据它检查单词。
完成此操作后,只需将其解压缩,然后嘿,很快就完成了拼写检查。这应该是一个相当快的解决方案。
或者,如果拼写检查确实像注释所建议的那样快,那么您可能不需要经历整个压缩过程,这将表明实现是错误的。
Perhaps a better way of doing this would be to compress the document, as this would remove any repeating instances of words, which you actually only need to spell check once. I only suggest this as it would probably perform faster than writing your own unique word finder.
The compressed version should have references to the unique words, somewhere within its file, you might have to look up how they are structured.
You can then spell check all the unique words. I hope you are not checking them with individual SQL queries or something like that, you should load a dictionary in the form of a tree into your memory and then check words against that.
Once this is done, simply uncompress it and hey presto it's all spell checked. This should be a fairly fast solution.
Or perhaps you don't need to go through the whole zipping process if spell checking really is as fast as the comments suggest, which would indicate a wrong implementation.