lucene使用的字符串匹配算法
我想知道 Apache Lucene 使用的字符串匹配算法。我一直在浏览 此处 所使用的 lucene 索引文件格式。 lucene 似乎按原样存储文本中出现的所有单词以及它们在每个文档中出现的频率。 但据我所知,为了有效的字符串匹配,需要对文档中出现的单词进行预处理。
例子: 搜索“iamrohitbanga是stackoverflow的用户”(使用模糊匹配)
在一些文档中
。可能有一个包含字符串“rohit banga”的文档,
要发现子字符串rohit和banga存在于搜索字符串中,它将使用一些有效的子字符串匹配。
我想知道它是什么算法。另外,如果它做了一些预处理,java api 中的哪个函数调用会触发它。
i want to know the string matching algorithms used by Apache Lucene. i have been going through the index file format used by lucene given here. it seems that lucene stores all words occurring in the text as is with their frequency of occurrence in each document.
but as far as i know that for efficient string matching it would need to preprocess the words occurring in the Documents.
example:
search for "iamrohitbanga is a user of stackoverflow" (use fuzzy matching)
in some documents.
it is possible that there is a document containing the string "rohit banga"
to find that the substrings rohit and banga are present in the search string, it would use some efficient substring matching.
i want to know which algorithm it is. also if it does some preprocessing which function call in the java api triggers it.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
正如 Yuval 所解释的,一般来说,Lucene 适合精确匹配(通过在索引和查询时使用分析器对术语进行规范化)。
在 Lucene 主干代码(尚未发布任何版本)中,实际上存在用于不精确匹配的后缀树用法,例如 Regex、Wildcard 和 Fuzzy。
其工作原理是 Lucene 术语词典本身实际上是后缀树的一种形式。您可以在您在几个地方提到的文件格式中看到这一点:
术语“信息索引”通过按一定间隔(默认情况下每 128 个术语)索引该树来为我们提供“随机访问”。
因此,它是低级别的后缀树,但在较高级别,我们利用这些属性(主要是 IndexReader.terms 中指定的属性,将术语字典视为确定性有限状态自动机(DFA):
Regex、Wildcard 和 Fuzzy 等不精确查询本身也被定义为 DFA,而“匹配”只是 DFA 交集。
As Yuval explained, in general Lucene is geared at exact matches (by normalizing terms with analyzers at both index and query time).
In the Lucene trunk code (not any released version yet) there is in fact suffix tree usage for inexact matches such as Regex, Wildcard, and Fuzzy.
The way this works is that a Lucene term dictionary itself is really a form of a suffix tree. You can see this in the file formats that you mentioned in a few places:
The term info index gives us "random access" by indexing this tree at certain intervals (every 128th term by default).
So low-level it is a suffix tree, but at the higher level, we exploit these properties (mainly the ones specified in IndexReader.terms to treat the term dictionary as a deterministic finite state automaton (DFA):
Inexact queries such as Regex, Wildcard, and Fuzzy are themselves also defined as DFAs, and the "matching" is simply DFA intersection.
Lucene 的基本设计使用精确的字符串匹配,或使用 分析器。分析器将文本分解为可索引标记。在此过程中,它可能会整理等效字符串(例如大小写、词干字符串、删除变音符号等)
生成的标记作为字典以及文档中标记的发布列表存储在索引中。因此,您可以构建并使用 Lucene 索引,而无需使用 KMP 等字符串匹配算法。
但是, FuzzyQuery 和WildCardQuery 使用类似的东西,首先搜索匹配的术语,然后使用它们进行完全匹配。请参阅Robert Muir 关于 AutomatonQuery 的博客文章 寻找一种新的、有效的方法来解决这个问题。
The basic design of Lucene uses exact string matches, or defines equivalent strings using an Analyzer. An analyzer breaks text into indexable tokens. During this process, it may collate equivalent strings (e.g. upper and lower case, stemmed strings, remove diacritics etc.)
The resulting tokens are stored in the index as a dictionary plus a posting list of the tokens in documents. Therefore, you can build and use a Lucene index without ever using a string-matching algorithm such as KMP.
However, FuzzyQuery and WildCardQuery use something similar, first searching for matching terms and then using them for the full match. Please see Robert Muir's Blog Post about AutomatonQuery for a new, efficient approach to this problem.
正如您所指出的,Lucene 仅存储文档中出现的术语列表。 Lucene 如何提取这些单词取决于您。默认的 lucene 分析器只是简单地分解由空格分隔的单词。您可以编写自己的实现,例如,对于源字符串“iamrohitbanga”,会生成 5 个标记:“iamrohitbanga”、“i”、“am”、“rohit”、“banga”。
请查看 lucene API 文档以获取 TokenFilter 类。
As you pointed out Lucene stores only list of terms that occured in documents. How Lucene extracts these words is up to you. Default lucene analyzer simply breaks the words separated by spaces. You could write your own implementation that, for example for source string 'iamrohitbanga' yields 5 tokens: 'iamrohitbanga', 'i', 'am', 'rohit', 'banga'.
Please look lucene API docs for TokenFilter class.