Lucene/Solr 如何在多领域/分面搜索中实现高性能?

发布于 2024-10-30 15:08:38 字数 643 浏览 0 评论 0原文

上下文

这是一个主要关于 Lucene(或者可能是 Solr)内部结构的问题。主要主题是分面搜索,其中搜索可以沿着对象的多个独立维度(例如汽车的大小、速度、价格)进行。

当使用关系数据库实现时,对于大量的facet,多字段索引没有用,因为facet可以按任何顺序搜索,因此使用特定有序多索引的可能性很小,并且创建索引的所有可能顺序难以忍受。

Solr 被宣传可以很好地处理分面搜索任务,如果我认为正确的话,它必须与 Lucene 连接(据说)在多字段查询(其中文档的字段与对象的分面相关)上表现良好。

问题

Lucene的倒排索引可以存储在关系数据库中,自然地通过使用单字段索引的RDBMS也可以轻松实现匹配文档的交集。

因此,Lucene 应该具有一些先进的多字段查询技术,而不仅仅是基于倒排索引获取匹配文档的交集。

那么问题是,这个技术/技巧是什么?更广泛地说:为什么 Lucene/Solr 理论上可以比 RDBMS 实现更好的分面搜索性能(如果是的话)?

注意:我的第一个猜测是 Lucene 会使用某种空间划分方法来划分由文档字段构建的向量空间作为维度,但据我了解 Lucene 并不是纯粹基于向量空间的。

Context

This is a question mainly about Lucene (or possibly Solr) internals. The main topic is faceted search, in which search can happen along multiple independent dimensions (facets) of objects (for example size, speed, price of a car).

When implemented with relational database, for a large number of facets multi-field indices are not useful, since facets can be searched in any order, so a specific ordered multi-index is used with low chance, and creating all possible orderings of indices is unbearable.

Solr is advertised to cope well with the faceted search task, which if I think correctly has to be connected with Lucene (supposedly) performing well on multi-field queries (where fields of a document relate to facets of an object).

Question

The inverted index of Lucene can be stored in a relational database, and naturally taking the intersections of the matching documents can also be trivially achieved with RDBMS using single-field indices.

Therefore, Lucene supposedly has some advanced technique for multi-field queries other than just taking the intersection of matching documents based on the inverted index.

So the question is, what is this technique/trick? More broadly: Why can Lucene/Solr achieve better faceted search performance theoretically than RDBMS could (if so)?

Note: My first guess would be that Lucene would use some space partitioning method for partitioning a vector space built from the document fields as dimensions, but as I understand Lucene is not purely vector space based.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

沉溺在你眼里的海 2024-11-06 15:08:38

分面

对于分面有两种答案,因为分面有两种类型。我不确定这些是否比 RDBMS 更快。

  1. 枚举分面。查询的结果是一个位向量,其中如果第 i 个文档匹配,则第 i 位为 1。构面也是一个位向量,因此交集只是按位与。我不认为这是一种新颖的方法,大多数 RDBMS 可能都支持它。
  2. 字段缓存。这只是一个普通(非倒排)索引。此处运行的 SQL 样式查询如下:

    从field_cache中选择facet、count(*)
    query_results 中的 docId
    按方面分组

,我不认为这是普通 RDBMS 无法做到的事情。索引是一个跳跃列表,以 docId 为键。

多词搜索

这就是 Lucene 的闪光点。为什么 Lucene 的方法如此之好,篇幅太长,无法在此处发布,但我可以推荐 这篇文章Lucene Performance,或其中链接的论文。

Faceting

There are two answers for faceting, because there are two types of faceting. I'm not certain that either of these are faster than an RDBMS.

  1. Enum faceting. Results of a query are a bit vector where the ith bit is 1 if the ith document was a match. The facet is also a bit vector, so intersection is just a bitwise AND. I don't think this is a novel approach, and most RDBMS's probably support it.
  2. Field Cache. This is just a normal (non-inverted) index. The SQL-style query that is run here is like:

    select facet, count(*) from field_cache
    where docId in query_results
    group by facet

Again, I don't think this is anything that a normal RDBMS couldn't do. The index is a skip list, with the docId as the key.

Multi-term search

This is where Lucene shines. Why Lucene's approach is so good is too long to post here, but I can recommend this post on Lucene Performance, or the papers linked therein.

白首有我共你 2024-11-06 15:08:38

可以在以下位置找到解释性帖子: http:// yonik.wordpress.com/2008/11/25/solr-faceted-search-performance-improvements/

新方法的工作原理是不反转要分面的索引字段,从而允许快速查找任何给定文档的字段中的术语。它实际上是一种混合方法 - 为了节省内存和提高速度,许多文档中出现的术语(超过 5%)并不是不倒置的,而是使用传统的集合交集逻辑来获取计数。

An explaining post can be found at: http://yonik.wordpress.com/2008/11/25/solr-faceted-search-performance-improvements/

The new method works by un-inverting the indexed field to be faceted, allowing quick lookup of the terms in the field for any given document. It’s actually a hybrid approach – to save memory and increase speed, terms that appear in many documents (over 5%) are not un-inverted, instead the traditional set intersection logic is used to get the counts.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文