返回大型结果集对 Lucene 性能的影响
有谁知道让 Lucene(或 Solr)返回非常长的结果集而不是通常的“前 10 个”对性能的影响。 我们希望返回用户搜索的所有结果(可能是大约 100.000 个文档),然后在返回实际结果之前对返回的文档 ID 进行后处理。
我们当前的索引包含大约 10-2000 万份文档。
Does anyone know the performance impact of letting Lucene (or Solr) return very long result sets instead of just the usual "top 10".
We would like to return all results (which can be around 100.000 documents) from a user search and then post-process the returned document ids before returning the actual result.
Our current index contains about 10-20 million documents.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
正如斯普拉夫所说,任何形式问题的答案都是“X 足够快吗?”是:“这取决于。”
我会担心:
我不知道你在做什么,但它可能可以通过自定义评分算法来完成。
当然,仅仅因为搜索所有文档会变慢,并不意味着它会太慢而无法使用。一些分面实现本质上确实获得了所有匹配的文档,并且这些对于许多人来说都足够有效。
As spraff said, the answer to any question of the form "will X be fast enough?" is: "it depends."
I would be concerned about:
I don't know what you're doing, but it's possible that it could be accomplished with a custom score algorithm.
Of course, just because it will be slower to search all documents, this doesn't mean it will be too slow to be useful. Some faceting implementations do essentially get all matching documents, and these perform adequately for many people.
我能够在 2.5 秒内返回 100,000 行,并索引了 2700 万个文档(每个文档有 1k 字节,包含大约 600B 的文本字段)。硬件并不普通,它有 128 GB 的 RAM。 Solr 的内存使用情况如下:Res 为 50GB Virt 为 106GB。
在处理 8000 万个文档后,我开始发现性能下降。目前正在研究如何将硬件与问题相匹配。希望对您有帮助。
I was able to get 100,000 rows back in 2.5 sec with 27 million documents indexed (each doc has 1k bytes with about 600B of text fields). The hardware is not ordinary it had 128 GB of RAM. Memory usage by Solr was like this: Res was 50GB Virt was 106GB.
I started seeing performance degradation after going to 80 million documents. Currently looking to investigate how to match the hardware to the problem. Hope that helps you.