用于搜索查询更正的英语词典
我正在通过实现“拼写更正是一个利用网络用户集体知识的迭代过程"。
高级方法如下:对于给定的查询,提出每个一元组和二元组的可能的校正候选(查询日志中一定编辑距离内的单词),然后执行修改的维特比搜索以找到最可能的序列给定二元词频率的候选人。重复这个过程,直到序列具有最大概率。
对维特比搜索的修改是,如果两个相邻单词都在可信词典中找到,则最多可以纠正一个。这对于避免将正确拼写的单词查询纠正为高频词尤其重要。
我的问题是哪里可以找到这样的词典。它应该是英文的,并包含可能出现在搜索查询中的专有名词(名字/姓氏、地点、品牌名称等)以及常见和不常见的英语单词。即使朝着正确的方向推动也会很有用。
此外,如果有人正在阅读本文并对本文中提供的方法有任何改进建议,我也愿意接受这些建议,因为这是我第一次涉足 NLP。
I'm building a spelling corrector for search engine queries by implementing the method described in "Spelling correction as an iterative process that exploits the collective knowledge of web users".
The high-level approach is as follows: for a given query, come up with possible correction candidates (words in the query log within a certain edit distance) of each unigram and bigram, then perform a modified Viterbi search to find the most likely sequence of candidates given bigram frequencies. Repeat this process until the sequence is of maximum probability.
The modification to the Viterbi search is such that if two adjacent words are both found in a trusted lexicon, at most one can be corrected. This is especially important for avoiding correction of properly-spelled single-word queries to words of higher frequency.
My question is where to find such a lexicon. It should be in English and contain proper nouns (first/last names, places, brand names, etc) likely to show up in search queries as well as common and uncommon English words. Even a push in the right direction would be useful.
Also, if anyone is reading this and has any suggestions for improvement on the methodology supplied in the paper, I am open to those as well given that this is my first foray into NLP.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
为此目的最好的词典可能是 Google Web 1T 5-gram 数据集。
http://www.ldc.upenn.edu/Catalog/CatalogEntry。 jsp?catalogId=LDC2006T13
不幸的是,它不是免费的,除非您的大学是 LDC 的成员。
您还可以尝试 Python NLTK 等软件包中的语料库,但 Google 语料库似乎最适合您的目的,因为它已经与搜索查询相关。
The best lexicon for this purpose is probably the Google Web 1T 5-gram data set.
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13
Unfortunately, it is not free unless your university is a member of LDC.
You could also try the corpora in packages like Python NLTK, but the Google one seems to be the best for your purpose since it is related to search queries already.