Lucene/Solr 如何在多领域/分面搜索中实现高性能?
上下文
这是一个主要关于 Lucene(或者可能是 Solr)内部结构的问题。主要主题是分面搜索,其中搜索可以沿着对象的多个独立维度(例如汽车的大小、速度、价格)进行。
当使用关系数据库实现时,对于大量的facet,多字段索引没有用,因为facet可以按任何顺序搜索,因此使用特定有序多索引的可能性很小,并且创建索引的所有可能顺序难以忍受。
Solr 被宣传可以很好地处理分面搜索任务,如果我认为正确的话,它必须与 Lucene 连接(据说)在多字段查询(其中文档的字段与对象的分面相关)上表现良好。
问题
Lucene的倒排索引可以存储在关系数据库中,自然地通过使用单字段索引的RDBMS也可以轻松实现匹配文档的交集。
因此,Lucene 应该具有一些先进的多字段查询技术,而不仅仅是基于倒排索引获取匹配文档的交集。
那么问题是,这个技术/技巧是什么?更广泛地说:为什么 Lucene/Solr 理论上可以比 RDBMS 实现更好的分面搜索性能(如果是的话)?
注意:我的第一个猜测是 Lucene 会使用某种空间划分方法来划分由文档字段构建的向量空间作为维度,但据我了解 Lucene 并不是纯粹基于向量空间的。
Context
This is a question mainly about Lucene (or possibly Solr) internals. The main topic is faceted search, in which search can happen along multiple independent dimensions (facets) of objects (for example size, speed, price of a car).
When implemented with relational database, for a large number of facets multi-field indices are not useful, since facets can be searched in any order, so a specific ordered multi-index is used with low chance, and creating all possible orderings of indices is unbearable.
Solr is advertised to cope well with the faceted search task, which if I think correctly has to be connected with Lucene (supposedly) performing well on multi-field queries (where fields of a document relate to facets of an object).
Question
The inverted index of Lucene can be stored in a relational database, and naturally taking the intersections of the matching documents can also be trivially achieved with RDBMS using single-field indices.
Therefore, Lucene supposedly has some advanced technique for multi-field queries other than just taking the intersection of matching documents based on the inverted index.
So the question is, what is this technique/trick? More broadly: Why can Lucene/Solr achieve better faceted search performance theoretically than RDBMS could (if so)?
Note: My first guess would be that Lucene would use some space partitioning method for partitioning a vector space built from the document fields as dimensions, but as I understand Lucene is not purely vector space based.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
分面
对于分面有两种答案,因为分面有两种类型。我不确定这些是否比 RDBMS 更快。
字段缓存。这只是一个普通(非倒排)索引。此处运行的 SQL 样式查询如下:
从field_cache中选择facet、count(*)
query_results 中的 docId
按方面分组
,我不认为这是普通 RDBMS 无法做到的事情。索引是一个跳跃列表,以 docId 为键。
多词搜索
这就是 Lucene 的闪光点。为什么 Lucene 的方法如此之好,篇幅太长,无法在此处发布,但我可以推荐 这篇文章Lucene Performance,或其中链接的论文。
Faceting
There are two answers for faceting, because there are two types of faceting. I'm not certain that either of these are faster than an RDBMS.
Field Cache. This is just a normal (non-inverted) index. The SQL-style query that is run here is like:
select facet, count(*) from field_cache
where docId in query_results
group by facet
Again, I don't think this is anything that a normal RDBMS couldn't do. The index is a skip list, with the docId as the key.
Multi-term search
This is where Lucene shines. Why Lucene's approach is so good is too long to post here, but I can recommend this post on Lucene Performance, or the papers linked therein.
可以在以下位置找到解释性帖子: http:// yonik.wordpress.com/2008/11/25/solr-faceted-search-performance-improvements/
An explaining post can be found at: http://yonik.wordpress.com/2008/11/25/solr-faceted-search-performance-improvements/