使用过多子句时 Lucene.Net 内存消耗和搜索速度变慢
我有一个具有文本文件属性和文本文件主键 ID 的数据库, 索引了大约 100 万个文本文件及其 ID(数据库中的主键)。
现在,我正在两个层面上寻找。 首先是直接的数据库搜索,我得到主键作为结果(大约 2 或 300 万个 ID)
然后我进行布尔查询,例如如下
+Text:"test*" +(pkID:1 pkID:4 pkID:100 pkID:115 pkID:1041 .... )
并在我的索引文件中搜索它。
问题是这样的查询(有 200 万个子句)需要太多时间才能给出结果,并且消耗太多内存......
这个问题有任何优化解决方案吗?
I have a DB having text file attributes and text file primary key IDs and
indexed around 1 million text files along with their IDs (primary keys in DB).
Now, I am searching at two levels.
First is straight forward DB search, where i get primary keys as result (roughly 2 or 3 million IDs)
Then i make a Boolean query for instance as following
+Text:"test*" +(pkID:1 pkID:4 pkID:100 pkID:115 pkID:1041 .... )
and search it in my Index file.
The problem is that such query (having 2 million clauses) takes toooooo much time to give result and consumes reallly too much memory....
Is there any optimization solution for this problem ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
假设您可以重用查询的 dbid 部分:
只要pkid搜索可以被重用,你应该会有相当大的改进。只要您不优化索引,缓存的效果甚至应该通过提交点起作用(我知道位集是按每个段计算的)。
HTH
附注
我认为如果我不注意到我认为您这样使用它,就会使您的索引遭受各种滥用,那将是我的失职!
Assuming you can reuse the dbid part of your queries:
As long as the pkid search can be reused, you should quite a large improvement. As long as you don't optimise your index, the effect of caching should even work through commit points (I understand the bit sets are calculated on a per-segment basis).
HTH
p.s.
I think it would be remiss of me not to note that I think you're putting your index through all sorts of abuse by using it like this!
最好的优化是不要使用包含 200 万个子句的查询。任何具有 200 万个子句的 Lucene 查询无论如何优化都会运行缓慢。
在您的特定情况下,我认为首先使用
+Text:"test*"
查询搜索索引,然后通过对 Lucene 命中运行数据库查询来限制结果会更实用。The best optimization is NOT to use the query with 2 million clauses. Any Lucene query with 2 million clauses will run slowly no matter how you optimize it.
In your particular case, I think it will be much more practical to search your index first with
+Text:"test*"
query and then limit the results by running a DB query on Lucene hits.