当前位置：文江博客话题详情

查询多个单词时，搜索索引如何工作？

发布于 2025-01-08 01:07:01 字数 212 浏览 4 评论 0原文

我正在尝试构建自己的搜索引擎进行实验。

我知道倒排索引。例如在索引单词时。

键是单词，并且有一个包含该单词的文档 ID 列表。因此，当您搜索该单词时，您会立即获得文档，

对于多个单词如何工作

，您会获得每个单词的所有文档并遍历这些文档以查看是否包含这两个单词？

我觉得事实并非如此。

有人知道这个问题的真正答案吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

感悟人生的甜 2025-01-15 01:07:02

正如 biziclop 所指出的，对于 AND 查询，您需要将两个查询项的匹配列表（也称为倒排列表）相交。

在典型的实现中，倒排列表的实现使得它们可以非常有效地搜索任何给定的文档ID（通常以对数时间）。实现此目的的一种方法是对它们进行排序（并使用二分搜索），但请注意，这并不简单，因为还需要以压缩形式存储它们。

给定一个查询A AND B，并假设 A 有 occ(A) 匹配，B 有 occ(B) 匹配（即 occ(x) := term 的匹配列表的长度x）。不失一般性地假设occ(A)＞0。 occ(B)，即 A 在文档中出现的频率比 B 更高。然后您要做的就是迭代 B 的所有匹配项，并在列表中搜索 A 的每个匹配项。如果确实可以在对数时间内搜索列表，这意味着您需要

occ(B) * log(occ(A))

计算步骤来识别包含这两个术语的所有匹配项。

Managing GB 是一本描述实现各个方面的好书。

As pointed out by biziclop, for an AND query you need to intersect the match lists (aka inverted lists) for the two query terms.

In typical implementations, the inverted lists are implemented such that they can be searched for any given document id very efficiently (generally, in logarithmic time). One way to achieve this is to keep them sorted (and use binary search), but note that this is not trivial as there is also a need to store them in compressed form.

Given a query A AND B, and assume that there are occ(A) matches for A and occ(B) matches for B (i.e. occ(x) := the length of the match list for term x). Assume, without loss of generality, that occ(A) > occ(B), i.e. A occurs more frequently in the documents than B. What you do then is to iterate through all matches for B and search for each of them in the list for A. If indeed the lists can be searched in logarithmic time, this means you need

occ(B) * log(occ(A))

computational steps to identify all matches that contain both terms.

A great book describing various aspects of the implementation is Managing Gigabytes.

回复收藏 0 原文

辞取 2025-01-15 01:07:02

我真的不明白为什么人们要谈论交叉点。

Lucene 支持使用 BooleanQuery 组合查询，如果需要的话可以无限嵌套。

QueryParser 还支持 AND 关键字，这要求两个单词都在文档中。

示例（Lucene.NET、C#）：

var outerQuery + new BooleanQuery();
outerQuery.Add(new TermQuery( new Term( "FieldNameToSearch", word1 ) ), BooleanClause.Occur.MUST );
outerQuery.Add(new TermQuery( new Term( "FieldNameToSearch", word2 ) ), BooleanClause.Occur.MUST );

如果您想使用相同的分析器拆分单词（您的实际搜索词），也有一些方法可以做到这一点。尽管如此，QueryParser 可能更容易使用。

例如，您可以查看此答案，了解如何使用用于索引的同一分析器拆分字符串：

使用 lucene.net 搜索“mvc2”时没有命中

I don't really understand why people is talking about intersection for this.

Lucene supports combination of queries using BooleanQuery, which you can nest indefinitely if you must.

The QueryParser also supports the AND keyword, which would require both words to be in the document.

Example (Lucene.NET, C#):

var outerQuery + new BooleanQuery();
outerQuery.Add(new TermQuery( new Term( "FieldNameToSearch", word1 ) ), BooleanClause.Occur.MUST );
outerQuery.Add(new TermQuery( new Term( "FieldNameToSearch", word2 ) ), BooleanClause.Occur.MUST );

If you want to split the words (your actual search term) using the same analyzer, there are ways to do that too. Although, a QueryParser might be easier to use.

You can view this answer for example on how to split the string using the same analyzer that you used for indexing:

No hits when searching for "mvc2" with lucene.net

回复收藏 0 原文

ぃ弥猫深巷。 2025-01-15 01:07:01

使用 zig-zag 算法，倒排索引对于获取交集非常有效：

假设您的项是一个列表 T：

lastDoc <- 0 //the first doc in the collection
currTerm <- 0 //the first term in T
while (lastDoc != infinity):
  if (currTerm > T.last): //if we have passed the last term:
     insert lastDoc into result
     currTerm <- 0
     lastDoc <- lastDoc + 1
     continue
  docId <- T[currTerm].getFirstAfter(lastDoc-1)
  if (docID != lastDoc):
     lastDoc <- docID
     currTerm <- 0
  else: 
     currTerm <- currTerm + 1

该算法假设高效的 getFirstAfter() ，它可以为您提供第一个符合该术语且其 docId 大于指定参数的文档。如果没有，它应该返回无穷大。

如果对术语进行排序以使最稀有的术语排在第一位，则该算法将是最有效的。

该算法确保最多 #docs_matching_first_term * #terms 次迭代，但实际上 - 通常迭代次数要少得多。

注意：虽然这个算法很有效，但据我所知 lucene 并不使用它。

更多信息可以在本讲义中找到幻灯片 11-13 [讲座首页的版权]

Inverted index is very efficient for getting intersection, using a zig-zag alorithm:

Assume your terms is a list T:

lastDoc <- 0 //the first doc in the collection
currTerm <- 0 //the first term in T
while (lastDoc != infinity):
  if (currTerm > T.last): //if we have passed the last term:
     insert lastDoc into result
     currTerm <- 0
     lastDoc <- lastDoc + 1
     continue
  docId <- T[currTerm].getFirstAfter(lastDoc-1)
  if (docID != lastDoc):
     lastDoc <- docID
     currTerm <- 0
  else: 
     currTerm <- currTerm + 1

This algorithm assumes efficient getFirstAfter() which can give you the first document which fits the term and his docId is greater then the specified parameter. It should return infinity if there is none.

The algorithm will be most efficient if the terms are sorted such that the rarest term is first.

The algorithm ensures at most #docs_matching_first_term * #terms iterations, but practically - it will usually be much less iterations.

Note: Though this alorithm is efficient, AFAIK lucene does not use it.

More info can be found in this lecture notes slides 11-13 [copy rights in the lecture's first page]

回复收藏 0 原文