返回大型结果集对 Lucene 性能的影响

发布于 2025-01-02 21:51:09 字数 155 浏览 4 评论 0原文

有谁知道让 Lucene(或 Solr)返回非常长的结果集而不是通常的“前 10 个”对性能的影响。 我们希望返回用户搜索的所有结果(可能是大约 100.000 个文档),然后在返回实际结果之前对返回的文档 ID 进行后处理。

我们当前的索引包含大约 10-2000 万份文档。

Does anyone know the performance impact of letting Lucene (or Solr) return very long result sets instead of just the usual "top 10".
We would like to return all results (which can be around 100.000 documents) from a user search and then post-process the returned document ids before returning the actual result.

Our current index contains about 10-20 million documents.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

屋檐 2025-01-09 21:51:09

正如斯普拉夫所说,任何形式问题的答案都是“X 足够快吗?”是:“这取决于。”

我会担心:

  1. 如果这些文档很大,特别是如果您存储了要检索的字段,您将丢弃缓存。
  2. 由于#1,您将拥有大量磁盘 IO,这非常慢。
  3. Lucene 的性能随着返回文档的数量而增长。因此,即使忽略诸如“磁盘比 RAM 慢”之类的实际考虑,它也会更慢。

我不知道你在做什么,但它可能可以通过自定义评分算法来完成。

当然,仅仅因为搜索所有文档会变慢,并不意味着它会太慢而无法使用。一些分面实现本质上确实获得了所有匹配的文档,并且这些对于许多人来说都足够有效。

As spraff said, the answer to any question of the form "will X be fast enough?" is: "it depends."

I would be concerned about:

  1. You'll trash your caches if these documents are large, especially if you have stored fields that you're retrieving.
  2. Because of #1, you'll have tons of disk IO, which is very slow.
  3. Lucene's performance grows with the number of returned documents. So even ignoring practical considerations like "disk is slower than RAM", it will be slower.

I don't know what you're doing, but it's possible that it could be accomplished with a custom score algorithm.

Of course, just because it will be slower to search all documents, this doesn't mean it will be too slow to be useful. Some faceting implementations do essentially get all matching documents, and these perform adequately for many people.

浴红衣 2025-01-09 21:51:09

我能够在 2.5 秒内返回 100,000 行,并索引了 2700 万个文档(每个文档有 1k 字节,包含大约 600B 的文本字段)。硬件并不普通,它有 128 GB 的 RAM。 Solr 的内存使用情况如下:Res 为 50GB Virt 为 106GB。

在处理 8000 万个文档后,我开始发现性能下降。目前正在研究如何将硬件与问题相匹配。希望对您有帮助。

I was able to get 100,000 rows back in 2.5 sec with 27 million documents indexed (each doc has 1k bytes with about 600B of text fields). The hardware is not ordinary it had 128 GB of RAM. Memory usage by Solr was like this: Res was 50GB Virt was 106GB.

I started seeing performance degradation after going to 80 million documents. Currently looking to investigate how to match the hardware to the problem. Hope that helps you.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文