Java 中的模糊字符串搜索,包括单词交换

发布于 2024-10-31 03:38:18 字数 221 浏览 0 评论 0原文

我是一名 Java 初学者,正在尝试编写一个程序,将输入与预定义字符串列表相匹配。我查看了 Levenshtein 距离,但遇到了这样的问题:

如果我有一个输入,例如“牛肉片”,我希望它与“牛肉片”匹配。问题是,根据编辑距离,“牛肉片”更接近“金枪鱼片”之类的东西,这当然是错误的。

我应该使用 Lucene 之类的东西吗?是否在 Java 类中使用 Lucene 方法?

谢谢!

I am a Java beginner, trying to write a program that will match an input to a list of predefined strings. I have looked at Levenshtein distance, but I have come to problems such as this:

If I have an input such as "fillet of beef" I want it to be matched to "beef fillet". The problem is that "fillet of beef" is closer, according to Levenshtein distance, to something like "fillet of tuna", which of course is wrong.

Should I be using something like Lucene for this? Does one use Lucene methods within a Java class?

Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

朦胧时间 2024-11-07 03:38:18

您需要计算搜索词与输入字符串的相关性。 Lucene 确实内置了相关性计算,这篇文章可能是一个很好的开始了解它们(我刚刚浏览了它,但看起来相当权威)。

基本过程是这样的:

  • 初始化:标记您的搜索词,并将它们存储在一系列 HashSet 中,每个词一个。或者,如果您想为每个单词赋予不同的权重,请使用 HashMap,其中单词是键。
  • 处理:标记每个输入字符串,并探测每个搜索词组以确定它们与输入的应用程度。请参阅上面的算法描述。

有一个简单的技巧可以处理拼写错误:在初始化期间,您创建包含搜索词的潜在拼写错误的集合。 Peter Norvig 的文章“如何编写拼写校正器”描述了这个过程(它使用 Python 代码,但 Java 实现当然是可能的)。

You need to compute the relevance of your search terms to the input strings. Lucene does have relevance calculations built in, and this article might be a good start to understanding them (I just scanned it, but it seems reasonably authoritative).

The basic process is this:

  • Initialization: tokenize your search terms, and store them in a series of HashSets, one per term. Or, if you want to give different weights to each word, use HashMap where the word is the key.
  • Processing: tokenize each input string, and probe each of the sets of search terms to determine how closely they apply to the input. See above for a description of algorithms.

There's an easy trick to handle misspellings: during initialization, you create sets containing potential misspellings of the search terms. Peter Norvig's post on "How to Write a Spelling Corrector" describes this process (it uses Python code, but a Java implementation is certainly possible).

哥,最终变帅啦 2024-11-07 03:38:18

Lucene确实支持基于Levenshtein距离的模糊搜索。

https://lucene.apache.org/java/2_4_0/queryparsersyntax.html #Fuzzy%20Searches

但 lucene 的目的是搜索文档集而不是字符串搜索,因此 lucene 对您来说可能有点大材小用了。还有其他可用的 Java 实现。看看 http://www.merriampark.com/ldjava.htm

Lucene does support fuzzy search based on Levenshtein distance.

https://lucene.apache.org/java/2_4_0/queryparsersyntax.html#Fuzzy%20Searches

But lucene is meant to search on set of documents rather than string search, so lucene might be an overkill for you. There are other Java implementation available. Take a look at http://www.merriampark.com/ldjava.htm

所有深爱都是秘密 2024-11-07 03:38:18

应该可以将编辑距离应用于单词,而不是字符。然后,为了匹配单词,您可以再次在字符级别应用 Levenshtein,以便“filet of Beef”中的“filet”应与“beef fillet”中的“fillet”匹配。

It should be possible to apply the Levenshtein distance to words, not characters. Then, to match words, you could again apply Levenshtein on the character level, so that "filet" in "filet of beef" should match "fillet" in "beef fillet".

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文