“你的意思是”字典数据库中的特征
我有一个 ~300.000 行表;其中包括技术术语;使用 PHP 和 MySQL + FULLTEXT 索引进行查询。但是当我搜索一个错误输入的术语时;例如“超文本”;自然不会给出任何结果。
我需要“补偿”一些书写错误并从数据库中获取最近的记录。我怎样才能实现这样的功能?我了解 Levenshtein 距离、Soundex 和 Metaphone 算法,但目前还没有一个可靠的想法来实现它来查询数据库。
谢谢
I have a ~300.000 row table; which includes technical terms; queried using PHP and MySQL + FULLTEXT indexes. But when I searching a wrong typed term; for example "hyperpext"; naturally giving no results.
I need to "compansate" little writing errors and getting nearest record from database. How I can accomplish such feaure? I know about Levenshtein distance, Soundex and Metaphone algorithms but currently not having a solid idea to implement this to querying against database.
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
请参阅本文,了解如何在 MySQL 存储函数中实现 Levenshtein 距离。
对于后代,作者的建议是这样做:
他还提供了 LEVENSHTEIN_RATIO 辅助方法,该方法将评估不同字符/总字符的比率,而不是直接的编辑距离。例如,如果为 60%,则源单词中五分之三的字符与目标单词不同。
See this article for how you might implement Levenshtein distance in a MySQL stored function.
For posterity, the author's suggestion is to do this:
He also supplies a LEVENSHTEIN_RATIO helper method which will evaluate the ratio of different/total characters, rather than a straight edit distance. For instance, if it's 60%, then three-fifths of the characters in the source word are different from the destination word.
来自 http://dev.mysql.com/ 的评论doc/refman/5.0/en/udf-compiling.html
From the comments of http://dev.mysql.com/doc/refman/5.0/en/udf-compiling.html
我建议您在查询输入上生成拼写错误变体。
即超文本> { hyperpeext, hipertext, ... } 等
其中之一必然是正确的拼写(尤其是常见的拼写错误)
识别最可能匹配的方法是在索引上查找每个匹配项,该索引会告诉您文档频率该术语的。 (有道理吗?)
I suggest that you generate typo variations on the query input.
i.e. hyperpext > { hyperpeext, hipertext, ... } etc
One of these is bound to be the correct spelling (especially for common misspellings)
The way you identify the most likely match is to do a lookup for each on an index which tells you the document frequency of the term. (make sense?)
为什么不添加一个表列来存储单词的替代形式(例如,Soundex)?这样,如果您的第一个 SELECT 没有找到完全匹配的内容,您可以进行第二次搜索来查找匹配的替代形式。
诀窍是对每个单词进行编码,以便拼写错误的变体最终转换为相同的替代形式。
Why not add a table column for storing the word in its alternate (e.g., Soundex) form? that way, if your first SELECT does not find the exact match, you can do a second search to look for matching alternate forms.
The trick is to encode each word so that misspelled variations end up converted into the same alternate form.