查询多个单词时,搜索索引如何工作?
我正在尝试构建自己的搜索引擎进行实验。
我知道倒排索引。例如在索引单词时。
键是单词,并且有一个包含该单词的文档 ID 列表。因此,当您搜索该单词时,您会立即获得文档,
对于多个单词如何工作
,您会获得每个单词的所有文档并遍历这些文档以查看是否包含这两个单词?
我觉得事实并非如此。
有人知道这个问题的真正答案吗?
I'm trying to build my own search engine for experimenting.
I know about the inverted indexes. for example when indexing words.
the key is the word and has a list of document ids containing that word. So when you search for that word you get the documents right away
how does it work for multiple words
you get all documents for every word and traverse those document to see if have both words?
I feel it is not the case.
anyone knows the real answer for this without speculating?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
正如 biziclop 所指出的,对于 AND 查询,您需要将两个查询项的匹配列表(也称为倒排列表)相交。
在典型的实现中,倒排列表的实现使得它们可以非常有效地搜索任何给定的文档ID(通常以对数时间)。实现此目的的一种方法是对它们进行排序(并使用二分搜索),但请注意,这并不简单,因为还需要以压缩形式存储它们。
给定一个查询
A AND B
,并假设 A 有 occ(A) 匹配,B 有 occ(B) 匹配(即 occ(x) := term 的匹配列表的长度x)。不失一般性地假设occ(A)>0。 occ(B),即 A 在文档中出现的频率比 B 更高。然后您要做的就是迭代 B 的所有匹配项,并在列表中搜索 A 的每个匹配项。如果确实可以在对数时间内搜索列表,这意味着您需要计算步骤来识别包含这两个术语的所有匹配项。
Managing GB 是一本描述实现各个方面的好书。
As pointed out by biziclop, for an AND query you need to intersect the match lists (aka inverted lists) for the two query terms.
In typical implementations, the inverted lists are implemented such that they can be searched for any given document id very efficiently (generally, in logarithmic time). One way to achieve this is to keep them sorted (and use binary search), but note that this is not trivial as there is also a need to store them in compressed form.
Given a query
A AND B
, and assume that there are occ(A) matches for A and occ(B) matches for B (i.e. occ(x) := the length of the match list for term x). Assume, without loss of generality, that occ(A) > occ(B), i.e. A occurs more frequently in the documents than B. What you do then is to iterate through all matches for B and search for each of them in the list for A. If indeed the lists can be searched in logarithmic time, this means you needcomputational steps to identify all matches that contain both terms.
A great book describing various aspects of the implementation is Managing Gigabytes.
我真的不明白为什么人们要谈论交叉点。
Lucene 支持使用 BooleanQuery 组合查询,如果需要的话可以无限嵌套。
QueryParser 还支持 AND 关键字,这要求两个单词都在文档中。
示例(Lucene.NET、C#):
如果您想使用相同的分析器拆分单词(您的实际搜索词),也有一些方法可以做到这一点。尽管如此,QueryParser 可能更容易使用。
例如,您可以查看此答案,了解如何使用用于索引的同一分析器拆分字符串:
使用 lucene.net 搜索“mvc2”时没有命中
I don't really understand why people is talking about intersection for this.
Lucene supports combination of queries using BooleanQuery, which you can nest indefinitely if you must.
The QueryParser also supports the AND keyword, which would require both words to be in the document.
Example (Lucene.NET, C#):
If you want to split the words (your actual search term) using the same analyzer, there are ways to do that too. Although, a QueryParser might be easier to use.
You can view this answer for example on how to split the string using the same analyzer that you used for indexing:
No hits when searching for "mvc2" with lucene.net
使用 zig-zag 算法,倒排索引对于获取交集非常有效:
假设您的项是一个列表
T
:该算法假设高效的
getFirstAfter()
,它可以为您提供第一个符合该术语且其 docId 大于指定参数的文档。如果没有,它应该返回无穷大。如果对术语进行排序以使最稀有的术语排在第一位,则该算法将是最有效的。
该算法确保最多
#docs_matching_first_term * #terms
次迭代,但实际上 - 通常迭代次数要少得多。注意:虽然这个算法很有效,但据我所知 lucene 并不使用它。
更多信息可以在本讲义中找到 幻灯片 11-13 [讲座首页的版权]
Inverted index is very efficient for getting intersection, using a zig-zag alorithm:
Assume your terms is a list
T
:This algorithm assumes efficient
getFirstAfter()
which can give you the first document which fits the term and his docId is greater then the specified parameter. It should return infinity if there is none.The algorithm will be most efficient if the terms are sorted such that the rarest term is first.
The algorithm ensures at most
#docs_matching_first_term * #terms
iterations, but practically - it will usually be much less iterations.Note: Though this alorithm is efficient, AFAIK lucene does not use it.
More info can be found in this lecture notes slides 11-13 [copy rights in the lecture's first page]
您需要将文档中单词的位置存储在索引文件中。
你的索引文件结构应该是这样的..
单词 ID - 文档 ID - 编号。的点击次数- 的点击次数。
现在假设查询包含 4 个单词 "w1 w2 w3 w4" 。选择包含大部分单词的文件。现在计算它们在文档中的相对距离。单词出现最多且相对距离最小的文档将在搜索结果中具有高优先级。
我开发了一个完整的搜索引擎,没有使用互联网上可用的任何爬行或索引工具。您可以在此处阅读详细说明 -搜索引擎
了解更多信息,请阅读 Google 创始人撰写的这篇论文 -点击此处
You need to store position of a word in a document in index file.
Your index file structure should be like this..
word id - doc id- no. of hits- pos of hits.
Now suppose the query contains 4 words "w1 w2 w3 w4" . Choose those files containing most of the words. Now calculate their relative distance in the document. The document where most of the words occur and their relative distance is minimum will have high priority in search results.
I have developed a total search engine without using any crawling or indexing tool available in internet. You can read a detailed description here-Search Engine
for more info read this paper by Google founders-click here
正如 biziclop 所说,您可以找到文档集的交集,并且可以以相当快的方式完成。请参阅这篇文章以及其中链接的论文以获得更正式的描述。
You find the intersection of document sets as biziclop said, and you can do it in a fairly fast way. See this post and the papers linked therein for a more formal description.