使用过多子句时 Lucene.Net 内存消耗和搜索速度变慢

发布于 2024-09-06 15:33:56 字数 345 浏览 1 评论 0原文

我有一个具有文本文件属性和文本文件主键 ID 的数据库, 索引了大约 100 万个文本文件及其 ID(数据库中的主键)。

现在,我正在两个层面上寻找。 首先是直接的数据库搜索,我得到主键作为结果(大约 2 或 300 万个 ID)

然后我进行布尔查询,例如如下

+Text:"test*" +(pkID:1 pkID:4 pkID:100 pkID:115 pkID:1041 .... )

并在我的索引文件中搜索它。

问题是这样的查询(有 200 万个子句)需要太多时间才能给出结果,并且消耗太多内存......

这个问题有任何优化解决方案吗?

I have a DB having text file attributes and text file primary key IDs and
indexed around 1 million text files along with their IDs (primary keys in DB).

Now, I am searching at two levels.
First is straight forward DB search, where i get primary keys as result (roughly 2 or 3 million IDs)

Then i make a Boolean query for instance as following

+Text:"test*" +(pkID:1 pkID:4 pkID:100 pkID:115 pkID:1041 .... )

and search it in my Index file.

The problem is that such query (having 2 million clauses) takes toooooo much time to give result and consumes reallly too much memory....

Is there any optimization solution for this problem ?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

谁许谁一生繁华 2024-09-13 15:33:56

假设您可以重用查询的 dbid 部分:

  1. 将查询拆分为两部分:一部分(文本查询)将成为查询,另一部分(pkID 查询)将成为过滤器
  2. 将两部分都变成查询
  3. 转换 pkid查询过滤器(通过使用 QueryWrapperFilter)
  4. 将过滤器转换为缓存过滤器(使用 CachingWrapperFilter)
  5. 挂在过滤器上,也许通过某种字典
  6. 下次执行搜索时,使用允许您使用查询和过滤器的重载

只要pkid搜索可以被重用,你应该会有相当大的改进。只要您不优化索引,缓存的效果甚至应该通过提交点起作用(我知道位集是按每个段计算的)。

HTH


附注

我认为如果我不注意到我认为您这样使用它,就会使您的索引遭受各种滥用,那将是我的失职!

Assuming you can reuse the dbid part of your queries:

  1. Split the query into two parts: one part (the text query) will become the query and the other part (the pkID query) will become the filter
  2. Make both parts into queries
  3. Convert the pkid query to a filter (by using QueryWrapperFilter)
  4. Convert the filter into a cached filter (using CachingWrapperFilter)
  5. Hang onto the filter, perhaps via some kind of dictionary
  6. Next time you do a search, use the overload that allows you to use a query and filter

As long as the pkid search can be reused, you should quite a large improvement. As long as you don't optimise your index, the effect of caching should even work through commit points (I understand the bit sets are calculated on a per-segment basis).

HTH


p.s.

I think it would be remiss of me not to note that I think you're putting your index through all sorts of abuse by using it like this!

眼泪淡了忧伤 2024-09-13 15:33:56

最好的优化是不要使用包含 200 万个子句的查询。任何具有 200 万个子句的 Lucene 查询无论如何优化都会运行缓慢。

在您的特定情况下,我认为首先使用 +Text:"test*" 查询搜索索引,然后通过对 Lucene 命中运行数据库查询来限制结果会更实用。

The best optimization is NOT to use the query with 2 million clauses. Any Lucene query with 2 million clauses will run slowly no matter how you optimize it.

In your particular case, I think it will be much more practical to search your index first with +Text:"test*" query and then limit the results by running a DB query on Lucene hits.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文