在全文搜索（例如网页搜索）中使用索引进行多词查询

发布于 2024-11-07 18:42:49 字数 900 浏览 17 评论 0原文

据我了解，全文搜索的一个基本方面是使用倒排索引。因此，使用倒排索引，单字查询就变得很容易回答。假设索引的结构如下：

some-word -> [doc385，doc211，doc39977，...]（按排名降序排序）

要回答该单词的查询，解决方案只是在索引中找到正确的条目（这需要 O(log n) 时间）并呈现一些索引中指定的列表中给定数量的文档（例如前 10 个）。

但是返回匹配两个单词的文档的查询又如何呢？最直接的实现如下：

将 A 设置为包含单词 1 的文档集（通过搜索索引）。
将 B 设为包含单词 2 的文档集（同上）。
计算 A 和 B 的交集。

现在，执行第三步可能需要 O(n log n) 时间。对于非常大的 A 和 B，这可能会使查询的回答速度变慢。但像谷歌这样的搜索引擎总是在几毫秒内返回答案。所以这不可能是完整的答案。

一个明显的优化是，由于像 Google 这样的搜索引擎无论如何都不会返回所有匹配的文档，因此我们不必计算整个交集。我们可以从最小的集合（例如B）开始，找到足够多的也属于另一个集合（例如A）的条目。

但难道我们还不能有下面最坏的情况吗？如果我们将 A 设为与一个常见单词匹配的文档集合，并将 B 设为与另一个常见单词匹配的文档集合，则仍然可能存在 A ∩ B 非常小的情况（即组合很少见）。这意味着搜索引擎必须线性地遍历 B 的所有元素 x 成员，检查它们是否也是 A 的元素，以找到符合这两个条件的少数元素。

线性速度并不快。而且您可以搜索两个以上的单词，因此仅采用并行性肯定不是完整的解决方案。那么，这些案例是如何优化的呢？大型全文搜索引擎是否使用某种复合索引？布隆过滤器？有什么想法吗？

原文

I understand that a fundamental aspect of full-text search is the use of inverted indexes. So, with an inverted index a one-word query becomes trivial to answer. Assuming the index is structured like this:

some-word -> [doc385, doc211, doc39977, ...] (sorted by rank, descending)

To answer the query for that word the solution is just to find the correct entry in the index (which takes O(log n) time) and present some given number of documents (e.g. the first 10) from the list specified in the index.

But what about queries which return documents that match, say, two words? The most straightforward implementation would be the following:

set A to be the set of documents which have word 1 (by searching the index).
set B to be the set of documents which have word 2 (ditto).
compute the intersection of A and B.

Now, step three probably takes O(n log n) time to perform. For very large A and Bs that could make the query slow to answer. But search engines like Google always return their answer in a few milliseconds. So that can't be the full answer.

One obvious optimization is that since a search engine like Google doesn't return all the matching documents anyway, we don't have to compute the whole intersection. We can start with the smallest set (e.g. B) and find enough entries which also belong to the other set (e.g. A).

But can't we still have the following worst case? If we have set A be the set of documents matching a common word, and set B be the set of documents matching another common word, there might still be cases where A ∩ B is very small (i.e. the combination is rare). That means that the search engine has to linearly go through a all elements x member of B, checking if they are also elements of A, to find the few that match both conditions.

Linear isn't fast. And you can have way more than two words to search for, so just employing parallelism surely isn't the whole solution. So, how are these cases optimized? Do large-scale full-text search engines use some kind of compound indexes? Bloom filters? Any ideas?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

半步萧音过轻尘 2024-11-14 18:42:49

正如你所说的一些词 -> [doc385, doc211, doc39977, ...]（按排名排序，降序），我认为搜索引擎可能不会这样做，文档列表应该按文档 ID 排序，每个文档根据单词都有一个排名。
当一个查询出现时，它包含几个关键字。对于每个单词，您都可以找到一个文档列表。对于所有关键字，您可以执行合并操作，并计算文档与查询的相关性。最后将排名最高的相关性文档返回给用户。
并且查询过程可以分布式以获得更好的性能。