Java 中的模糊字符串搜索,包括单词交换
我是一名 Java 初学者,正在尝试编写一个程序,将输入与预定义字符串列表相匹配。我查看了 Levenshtein 距离,但遇到了这样的问题:
如果我有一个输入,例如“牛肉片”,我希望它与“牛肉片”匹配。问题是,根据编辑距离,“牛肉片”更接近“金枪鱼片”之类的东西,这当然是错误的。
我应该使用 Lucene 之类的东西吗?是否在 Java 类中使用 Lucene 方法?
谢谢!
I am a Java beginner, trying to write a program that will match an input to a list of predefined strings. I have looked at Levenshtein distance, but I have come to problems such as this:
If I have an input such as "fillet of beef" I want it to be matched to "beef fillet". The problem is that "fillet of beef" is closer, according to Levenshtein distance, to something like "fillet of tuna", which of course is wrong.
Should I be using something like Lucene for this? Does one use Lucene methods within a Java class?
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您需要计算搜索词与输入字符串的相关性。 Lucene 确实内置了相关性计算,这篇文章可能是一个很好的开始了解它们(我刚刚浏览了它,但看起来相当权威)。
基本过程是这样的:
HashSet
中,每个词一个。或者,如果您想为每个单词赋予不同的权重,请使用 HashMap,其中单词是键。有一个简单的技巧可以处理拼写错误:在初始化期间,您创建包含搜索词的潜在拼写错误的集合。 Peter Norvig 的文章“如何编写拼写校正器”描述了这个过程(它使用 Python 代码,但 Java 实现当然是可能的)。
You need to compute the relevance of your search terms to the input strings. Lucene does have relevance calculations built in, and this article might be a good start to understanding them (I just scanned it, but it seems reasonably authoritative).
The basic process is this:
HashSet
s, one per term. Or, if you want to give different weights to each word, useHashMap
where the word is the key.There's an easy trick to handle misspellings: during initialization, you create sets containing potential misspellings of the search terms. Peter Norvig's post on "How to Write a Spelling Corrector" describes this process (it uses Python code, but a Java implementation is certainly possible).
Lucene确实支持基于Levenshtein距离的模糊搜索。
https://lucene.apache.org/java/2_4_0/queryparsersyntax.html #Fuzzy%20Searches
但 lucene 的目的是搜索文档集而不是字符串搜索,因此 lucene 对您来说可能有点大材小用了。还有其他可用的 Java 实现。看看 http://www.merriampark.com/ldjava.htm
Lucene does support fuzzy search based on Levenshtein distance.
https://lucene.apache.org/java/2_4_0/queryparsersyntax.html#Fuzzy%20Searches
But lucene is meant to search on set of documents rather than string search, so lucene might be an overkill for you. There are other Java implementation available. Take a look at http://www.merriampark.com/ldjava.htm
应该可以将编辑距离应用于单词,而不是字符。然后,为了匹配单词,您可以再次在字符级别应用 Levenshtein,以便“filet of Beef”中的“filet”应与“beef fillet”中的“fillet”匹配。
It should be possible to apply the Levenshtein distance to words, not characters. Then, to match words, you could again apply Levenshtein on the character level, so that "filet" in "filet of beef" should match "fillet" in "beef fillet".