lucene使用的字符串匹配算法

发布于 2024-08-20 05:44:31 字数 453 浏览 5 评论 0原文

我想知道 Apache Lucene 使用的字符串匹配算法。我一直在浏览 此处 所使用的 lucene 索引文件格式。 lucene 似乎按原样存储文本中出现的所有单词以及它们在每个文档中出现的频率。 但据我所知,为了有效的字符串匹配,需要对文档中出现的单词进行预处理。

例子: 搜索“iamrohitbanga是stackoverflow的用户”(使用模糊匹配)

在一些文档中

。可能有一个包含字符串“rohit banga”的文档,

要发现子字符串rohit和banga存在于搜索字符串中,它将使用一些有效的子字符串匹配。

我想知道它是什么算法。另外,如果它做了一些预处理,java api 中的哪个函数调用会触发它。

i want to know the string matching algorithms used by Apache Lucene. i have been going through the index file format used by lucene given here. it seems that lucene stores all words occurring in the text as is with their frequency of occurrence in each document.
but as far as i know that for efficient string matching it would need to preprocess the words occurring in the Documents.

example:
search for "iamrohitbanga is a user of stackoverflow" (use fuzzy matching)

in some documents.

it is possible that there is a document containing the string "rohit banga"

to find that the substrings rohit and banga are present in the search string, it would use some efficient substring matching.

i want to know which algorithm it is. also if it does some preprocessing which function call in the java api triggers it.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

苏辞 2024-08-27 05:44:31

正如 Yuval 所解释的,一般来说,Lucene 适合精确匹配(通过在索引和查询时使用分析器对术语进行规范化)。

在 Lucene 主干代码(尚未发布任何版本)中,实际上存在用于不精确匹配的后缀树用法,例如 Regex、Wildcard 和 Fuzzy。

其工作原理是 Lucene 术语词典本身实际上是后缀树的一种形式。您可以在您在几个地方提到的文件格式中看到这一点:

因此,如果前一个术语的文本是“bone”并且该术语是“boy”,则 PrefixLength 为 2,后缀为“y”。

术语“信息索引”通过按一定间隔(默认情况下每 128 个术语)索引该树来为我们提供“随机访问”。

因此,它是低级别的后缀树,但在较高级别,我们利用这些属性(主要是 IndexReader.terms 中指定的属性,将术语字典视为确定性有限状态自动机(DFA):

返回从给定术语开始的所有术语的枚举。如果给定项不存在,则枚举位于大于所提供项的第一个项处。枚举按 Term.compareTo() 排序。每个项都大于枚举中它之前的所有项。

Regex、Wildcard 和 Fuzzy 等不精确查询本身也被定义为 DFA,而“匹配”只是 DFA 交集。

As Yuval explained, in general Lucene is geared at exact matches (by normalizing terms with analyzers at both index and query time).

In the Lucene trunk code (not any released version yet) there is in fact suffix tree usage for inexact matches such as Regex, Wildcard, and Fuzzy.

The way this works is that a Lucene term dictionary itself is really a form of a suffix tree. You can see this in the file formats that you mentioned in a few places:

Thus, if the previous term's text was "bone" and the term is "boy", the PrefixLength is two and the suffix is "y".

The term info index gives us "random access" by indexing this tree at certain intervals (every 128th term by default).

So low-level it is a suffix tree, but at the higher level, we exploit these properties (mainly the ones specified in IndexReader.terms to treat the term dictionary as a deterministic finite state automaton (DFA):

Returns an enumeration of all terms starting at a given term. If the given term does not exist, the enumeration is positioned at the first term greater than the supplied term. The enumeration is ordered by Term.compareTo(). Each term is greater than all that precede it in the enumeration.

Inexact queries such as Regex, Wildcard, and Fuzzy are themselves also defined as DFAs, and the "matching" is simply DFA intersection.

野稚 2024-08-27 05:44:31

Lucene 的基本设计使用精确的字符串匹配,或使用 分析器。分析器将文本分解为可索引标记。在此过程中,它可能会整理等效字符串(例如大小写、词干字符串、删除变音符号等)
生成的标记作为字典以及文档中标记的发布列表存储在索引中。因此,您可以构建并使用 Lucene 索引,而无需使用 KMP 等字符串匹配算法。
但是, FuzzyQueryWildCardQuery 使用类似的东西,首先搜索匹配的术语,然后使用它们进行完全匹配。请参阅Robert Muir 关于 AutomatonQuery 的博客文章 寻找一种新的、有效的方法来解决这个问题。

The basic design of Lucene uses exact string matches, or defines equivalent strings using an Analyzer. An analyzer breaks text into indexable tokens. During this process, it may collate equivalent strings (e.g. upper and lower case, stemmed strings, remove diacritics etc.)
The resulting tokens are stored in the index as a dictionary plus a posting list of the tokens in documents. Therefore, you can build and use a Lucene index without ever using a string-matching algorithm such as KMP.
However, FuzzyQuery and WildCardQuery use something similar, first searching for matching terms and then using them for the full match. Please see Robert Muir's Blog Post about AutomatonQuery for a new, efficient approach to this problem.

饭团 2024-08-27 05:44:31

正如您所指出的,Lucene 仅存储文档中出现的术语列表。 Lucene 如何提取这些单词取决于您。默认的 lucene 分析器只是简单地分解由空格分隔的单词。您可以编写自己的实现,例如,对于源字符串“iamrohitbanga”,会生成 5 个标记:“iamrohitbanga”、“i”、“am”、“rohit”、“banga”。

请查看 lucene API 文档以获取 TokenFilter 类。

As you pointed out Lucene stores only list of terms that occured in documents. How Lucene extracts these words is up to you. Default lucene analyzer simply breaks the words separated by spaces. You could write your own implementation that, for example for source string 'iamrohitbanga' yields 5 tokens: 'iamrohitbanga', 'i', 'am', 'rohit', 'banga'.

Please look lucene API docs for TokenFilter class.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文