自动更正文本输入中的拼写错误
我正在用 C# 编写一个自然语言处理器,用于提取句子的情绪(正面/负面)。然而,在能够辨别拼写错误的单词的情感方面存在一些问题 - 如果它不在字典中,我既不能标记它也不能评价它!
我知道必须有一种方法来处理这个问题。谷歌总是给出准确的建议,我只需要从类似的算法中获取最重要的建议并用它来访问数据库即可。问题是,我不知道从哪里开始算法名称等。我需要帮助来解决这个问题。
我在网站上检查了类似的问题,发现了一些似乎有用的概念,但处理拼写错误和真实单词之间距离的基本方法基本上依赖于击中数据集中的每个单词,这似乎效率极低。一些使算法快速运行的想法的帮助也将不胜感激;该分析引擎应该能够每天处理数千个项目。
提前致谢。
I am writing a natural language processor in C# that extracts the sentiment (positive/negative) of a sentence. There is something of an issue, though, in being able to discern the sentiment of a misspelled word - if it's not in the dictionary, I can neither tag it nor rate it!
I know there has to be a way to handle this. Google gives accurate suggestions all the time, I simply need to take the top suggestion from a similar algorithm and hit the database with it. The problem is, I'm not sure where to start with algorithm names and so forth. I need help figuring that out.
I checked around on the site for similar questions, and found some concepts that seemed useful, but the basic way of handling the distance between a misspelling and a real word basically relied on hitting every word in your data set, which seems horribly inefficient. Some help with ideas to make the algorithm run quickly would also be much appreciated; this analysis engine is supposed to be able to handle multiple thousands of items a day.
Thanks in advance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这个问题并不那么愚蠢。 Norvig 写了一篇关于它的文章。一般来说,难度取决于准确性。 “最简单”的方法是使用前缀树或trie来避免探索所有可能性。
基本上你有这样的东西:
并遵循你基本上保持在正轨上的路径。一旦您陷入困境,您应该根据错误类型检查如何继续前进。
您可以阅读 Norvig 的文章进行更深入的分析。
This problem is not that stupid. Norvig wrote an article about it. Generally speaking the difficulty depends on the accuracy. The "easiest" way to do it is using a prefix tree or trie to avoid exploring all possibilities.
Basically you have something like this:
and following the path you basically stay on track. Once you reach a point where you are stuck you should check how to move on based on the type of error you have.
You can read Norvig's article for a deeper analysis.
迪埃尔给出的方法——包括彼得·诺维格的文章——当然值得进一步考虑。
但是,对于快速而肮脏的解决方案:如果在您自己的字典中找不到可能拼写错误的单词,您可以尝试在此 常见拼写错误列表
The approach given by dierre - including Peter Norvig's Article - is certainly worth being considered further.
However, for a quick-and-dirty solution: if a possibly misspelled word is not found in your own dictionary, you can try to find a mapping in this list of common misspellings
如果您想有效地计算拼写错误和大量字典单词之间的编辑距离,@dierre 提到的前缀树非常有用。 Brill 和 Moore (2000) 描述了一种使用前缀树的方法,使用与 Norvig 和许多其他拼写检查器相同的通用方法。他们的论文可以在这里找到:http://www.ldc.upenn .edu/acl/P/P00/P00-1037.pdf
The prefix trees mentioned by @dierre are extremely useful if you want to efficiently calculate the edit distance between a misspelling and a large set of dictionary words. Brill and Moore (2000) describe an approach using prefix trees using the same general approach as Norvig and many other spell checkers. Their paper is available here: http://www.ldc.upenn.edu/acl/P/P00/P00-1037.pdf