在全文搜索(例如网页搜索)中使用索引进行多词查询
据我了解,全文搜索的一个基本方面是使用倒排索引。因此,使用倒排索引,单字查询就变得很容易回答。假设索引的结构如下:
some-word -> [doc385,doc211,doc39977,...](按排名降序排序)
要回答该单词的查询,解决方案只是在索引中找到正确的条目(这需要 O(log n) 时间)并呈现一些索引中指定的列表中给定数量的文档(例如前 10 个)。
但是返回匹配两个单词的文档的查询又如何呢?最直接的实现如下:
- 将 A 设置为包含单词 1 的文档集(通过搜索索引)。
- 将 B 设为包含单词 2 的文档集(同上)。
- 计算 A 和 B 的交集。
现在,执行第三步可能需要 O(n log n) 时间。对于非常大的 A 和 B,这可能会使查询的回答速度变慢。但像谷歌这样的搜索引擎总是在几毫秒内返回答案。所以这不可能是完整的答案。
一个明显的优化是,由于像 Google 这样的搜索引擎无论如何都不会返回所有匹配的文档,因此我们不必计算整个交集。我们可以从最小的集合(例如B)开始,找到足够多的也属于另一个集合(例如A)的条目。
但难道我们还不能有下面最坏的情况吗?如果我们将 A 设为与一个常见单词匹配的文档集合,并将 B 设为与另一个常见单词匹配的文档集合,则仍然可能存在 A ∩ B 非常小的情况(即组合很少见)。这意味着搜索引擎必须线性地遍历 B 的所有元素 x 成员,检查它们是否也是 A 的元素,以找到符合这两个条件的少数元素。
线性速度并不快。而且您可以搜索两个以上的单词,因此仅采用并行性肯定不是完整的解决方案。那么,这些案例是如何优化的呢?大型全文搜索引擎是否使用某种复合索引?布隆过滤器?有什么想法吗?
I understand that a fundamental aspect of full-text search is the use of inverted indexes. So, with an inverted index a one-word query becomes trivial to answer. Assuming the index is structured like this:
some-word -> [doc385, doc211, doc39977, ...] (sorted by rank, descending)
To answer the query for that word the solution is just to find the correct entry in the index (which takes O(log n) time) and present some given number of documents (e.g. the first 10) from the list specified in the index.
But what about queries which return documents that match, say, two words? The most straightforward implementation would be the following:
- set A to be the set of documents which have word 1 (by searching the index).
- set B to be the set of documents which have word 2 (ditto).
- compute the intersection of A and B.
Now, step three probably takes O(n log n) time to perform. For very large A and Bs that could make the query slow to answer. But search engines like Google always return their answer in a few milliseconds. So that can't be the full answer.
One obvious optimization is that since a search engine like Google doesn't return all the matching documents anyway, we don't have to compute the whole intersection. We can start with the smallest set (e.g. B) and find enough entries which also belong to the other set (e.g. A).
But can't we still have the following worst case? If we have set A be the set of documents matching a common word, and set B be the set of documents matching another common word, there might still be cases where A ∩ B is very small (i.e. the combination is rare). That means that the search engine has to linearly go through a all elements x member of B, checking if they are also elements of A, to find the few that match both conditions.
Linear isn't fast. And you can have way more than two words to search for, so just employing parallelism surely isn't the whole solution. So, how are these cases optimized? Do large-scale full-text search engines use some kind of compound indexes? Bloom filters? Any ideas?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
正如你所说的一些词 -> [doc385, doc211, doc39977, ...](按排名排序,降序),我认为搜索引擎可能不会这样做,文档列表应该按文档 ID 排序,每个文档根据单词都有一个排名。
当一个查询出现时,它包含几个关键字。对于每个单词,您都可以找到一个文档列表。对于所有关键字,您可以执行合并操作,并计算文档与查询的相关性。最后将排名最高的相关性文档返回给用户。
并且查询过程可以分布式以获得更好的性能。
As you said some-word -> [doc385, doc211, doc39977, ...] (sorted by rank, descending), I think the search engine may not do this, the doc list should be sorted by doc ID, each doc has a rank according to the word.
When a query comes, it contains several keywords. For each word, you can find a doc list. For all keywords, you can do merge operations, and compute the relevance of doc to query. Finally return the top ranked relevance doc to user.
And the query process can be distributed to gain better performance.
即使没有排名,我想知道谷歌如何如此快地计算两个集合的交集。
显然,计算某些单词 A、B、C 的交集的最坏情况是它们的索引非常大而交集非常小。典型的情况是搜索不同语言中的一些非常常见(数据库术语中的“流行”)单词。
让我们尝试一下中文中的“具体”和位置(“站点”、“位置”)和日语中的“极端な”(“极端”)。
Google 搜索位置返回“大约 1,500,000,000 个结果(0.28 秒)”
Google 搜索“concrete”返回“大约 2,020,000,000 个结果(0.46 秒)”
Google 搜索“极端な”大约 7,590,000 个结果(0.25 秒)
这三个术语极不可能出现在同一个文档中,但让我们用 google 搜索它们:
Google 搜索“concrete 位置极端な”返回“大约 174,000 个结果(0.13 秒)”
添加俄语单词“игра”(游戏)
搜索 игра: 约 212,000,000 个结果(0.37 秒)
搜索全部:“ играcrete 位置极端な ”返回约 12,600 个结果(0.33 秒)
当然返回的搜索结果是废话,它们不包含所有搜索词。
但是看看组合查询的查询时间,我想知道是否在单词索引上计算了一些交集。即使所有内容都在 RAM 中并且高度分片,计算具有 1,500,000,000 和 2,020,000,000 个条目的两个集合的交集也是 O(n),并且很难在 <0.5 秒内完成,因为数据位于不同的机器上并且它们必须进行通信。
肯定有一些连接计算,但至少对于流行词来说,这肯定不是在整个词索引上完成的。再加上结果是模糊的,谷歌显然使用了某种“返回一些排名靠前的结果,并在 0.5 秒后停止”的优化。
这是如何实现的,我不知道。有什么想法吗?
Even without ranking, I wonder how the intersection of two sets is computed so fast by google.
Obviously the worst-case scenario for computing the intersection for some words A, B, C is when their indexes are very big and the intersection very small. A typical case would be a search for some very common ("popular" in DB terms) words in different languages.
Let's try "concrete" and 位置 ("site", "location") in chinese and 極端な ("extreme") in japanese.
Google search for 位置 returns "About 1,500,000,000 results (0.28 seconds) "
Google search for "concrete" returns "About 2,020,000,000 results (0.46 seconds) "
Google search for "極端な" About 7,590,000 results (0.25 seconds)
It is extremly improbable that all three terms would ever appear in the same document, but let's google them:
Google search for "concrete 位置 極端な" returns "About 174,000 results (0.13 seconds)"
Adding a russian word "игра" (game)
Search игра: About 212,000,000 results (0.37 seconds)
Search for all of them: " игра concrete 位置 極端な " returns About 12,600 results (0.33 seconds)
Of course the returned search results are nonsense and they do not contain all the search terms.
But looking at the query time for the composed ones, I wonder if there is some intersection computed on the word indexes at all. Even if everything is in RAM and heavily sharded, computing the intersection of two sets with 1,500,000,000 and 2,020,000,000 entries is O(n) and can hardly be done in <0.5 sec, since the data is on different machines and they have to communicate.
There must be some join computation, but at least for popular words, this is surely not done on the whole word index. Adding the fact that the results are fuzzy, it seems evident that Google uses some optimization of kind "give back some high-ranked results, and stop after 0,5 sec".
How this is implemented, I don't know. Any ideas?
大多数系统都以某种方式实现 TF-IDF。 TF-IDF 是函数词频和逆文档频率的乘积。
IDF 函数将文档频率与集合中的文档总数相关联。对于此函数的普遍直觉是,它应该为出现在少数文档中的术语提供较高的值,并为出现在所有文档中的术语提供较低的值,从而使它们不相关。
您提到了 Google,但 Google 通过 PageRank(链接输入/输出)以及术语频率和邻近度来优化搜索。 Google 分发数据并使用 Map/Reduce 并行操作 - 计算 PageRank+TF-IDF。
信息检索:实现搜索引擎第 2 章对此背后的理论有很好的解释。进一步研究的想法也是看看 Solr 是如何实现这一点的。
Most systems somehow implement TF-IDF in one way or another. TF-IDF is a product of functions term frequency and inverse document frequency.
The IDF function relates the document frequency to the total number of documents in a collection. The common intuition for this function says that it should give a higher value for terms that appear in few documents and lower value for terms that appear in all documents making them irrelevant.
You mention Google, but Google optimises search with PageRank (links in/out) as well as term frequency and proximity. Google distributes the data and uses Map/Reduce to parallelise operations - to compute PageRank+TF-IDF.
There's a great explanation of the theory behind this in Information Retrieval: Implementing Search Engines chapter 2. Another idea to investigate further is also to look how Solr implements this.
谷歌并不需要真正找到所有结果,只需要找到最上面的结果。
索引可以先按年级排序,然后再按 id 排序。由于相同的 ID 始终具有相同的等级,因此不会影响集合交叉时间。
因此,谷歌开始交集,直到找到 10 个结果,然后进行统计估计,告诉您它还发现了多少个结果。
最坏的情况几乎是不可能的。
如果所有单词都是“常见”,那么交集将非常快地给出前 10 个结果。如果存在罕见单词,则交集速度很快,因为复杂度为 O(N long M),其中 N 是最小组。
您需要记住,谷歌将其索引保存在内存中并使用并行计算。例如,您可以将问题分成两个搜索,每个搜索仅搜索一半的网络,然后对结果进行合并并取最好的。 Google 拥有数百万台计算机
Google does not need to actually find all results, only the top ones.
The index can be sorted by grade first and only then by id. Since the same ID always has the same grade this does not hurt sets intersection time.
So google starts intersection until it finds 10 results , and then does a statistical estimation to tell you how many more results it found.
A worst case is almost impossible.
If all words are "common" then intersection will give the first 10 results very fast. If there is a rare word, then intersection is fast because complexity is O(N long M) where N is the smallest group.
You need to remember that google keeps it's indexes in memory and uses parallel computing.For example U can split the problem into two searches each searching only half of the web, and then marge result and take the best. Google has millions of computes