在 Lucene 中按日期范围过滤

发布于 2024-09-10 02:38:15 字数 283 浏览 4 评论 0原文

我知道标题可能表明它是重复的,但我无法找到此特定问题的答案:

我必须根据日期范围过滤搜索结果。每个文档的日期都存储在每个文档上(但不建立索引)。使用过滤器时,我注意到过滤器是通过索引中的所有文档调用的。

这意味着过滤器会随着索引的增长而变慢(目前只有约 300,000 个文档),因为它必须迭代每个文档。

我无法使用 RangeQuery,因为日期未建立索引。

如何仅在查询结果的文档上应用过滤器以提高效率?

我更喜欢在收到结果之前就这样做,以免弄乱我拥有的乐谱和收藏家。

I know the title might suggest it is a duplicate but I haven't been able to find the answer to this specific issue:

I have to filter search results based on a date range. Date of each document is stored (but not indexed) on each one. When using a Filter I noticed the filter is called with all the documents in the index.

This means the filter will get slower as the index grows (currently only ~300,000 documents in it) as it has to iterate through every single document.

I can't using RangeQuery since the date is not indexed.

How can I apply the filter AFTER only on the documents that are the results of the query to make it more efficient?

I prefer to do it before I am handed the results not to mess up the scores and collectors I have.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

彡翼 2024-09-17 02:38:15

不太确定这是否有帮助,但我遇到了与您类似的问题,并提出了以下(+注释):

  1. 我认为您确实必须对日期字段建立索引。在查询/过滤等方面没有其他任何意义。
  2. 在 Lucene.net v2.9 中,与 v2.9 相比,有大量术语的范围查询似乎变得非常慢,
  3. 我修复了使用日期字段时的速度问题切换到使用数字字段和数字字段查询。这实际上使我的速度比 Lucene.net v2.4 基线有了很大的提升。
  4. 将查询包装在缓存包装过滤器中意味着您可以保留为过滤器设置的文档位。这也将大大加快使用相同过滤器的后续查询的速度。
  5. 过滤器不会参与一组查询结果的评分
  6. 将缓存的过滤器连接到查询的其余部分(我猜您已经获得了自定义分数和收集器)意味着它应该满足您的查询的最后部分因此

,总结一下:将日期字段索引为数字字段;将您的查询构建为数字范围查询;将它们转换为缓存的过滤器包装器并挂在它们上。

我想您会发现当前索引使用情况有一些惊人的加速。

祝你好运!

附注
当使用 Lucene 时,我绝不会猜测什么会快或慢。我总是对两个方向感到惊讶!

Not quite sure if this will help, but I had a similar problem to yours and came up with the following (+ notes):

  1. I think you're really going to have to index the date field. Nothing else makes any sense in terms of querying/filtering etc.
  2. In Lucene.net v2.9, range querying where there are lots of terms seems to have got terribly slow compared to v2.9
  3. I fixed my speed issues when using date fields by switching to using a numeric field and numeric field queries. This actually gave me quite a speed boost over my Lucene.net v2.4 baseline.
  4. Wrapping your query in a caching wrapper filter means you can hang onto the document bit set for the filter. This will also dramatically speed up subsequent queries using the same filter.
  5. A filter doesn't play a part in the scoring for a set of query results
  6. Joining your cached filter to the rest of your query (where I guess you've got your custom scores and collectors) means it should meet the final part of your criteria

So, to summarise: index your date fields as numeric fields; build your queries as numeric range queries; transform these into cached filter wrappers and hang onto them.

I think you'll see some spectacular speedups over your current index usage.

Good luck!

p.s.
I would never second guess what'll be fast or slow when using Lucene. I've always been surprised in both directions!

厌倦 2024-09-17 02:38:15

首先,要过滤字段,必须对其建立索引。

其次,使用过滤器被认为是限制要搜索的文档集的最佳方法。原因之一是您可以缓存过滤器结果以用于其他查询。过滤器数据结构非常高效:它是与过滤器匹配的文档的位集。

但如果你坚持不使用过滤器,我认为唯一的方法是使用布尔查询来进行过滤。

First, to filter on a field, it has to be indexed.

Second, using a Filter is considered to be the best way to restrict the set of document to search on. One reason for this is that you can cache the filter results to be used for other queries. And the filter data structure is pretty efficient: it is a bit set of documents matching the filter.

But if you insist on not using filters, I think the only way is to use a boolean query to do the filtering.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文