Lucene 搜索花费太长时间
我正在(当前)70Gig 索引上使用 Lucene.net (2.9.2.2)。我可以进行相当复杂的搜索,并在 1 ~ 2 秒内获取所有文档 ID。但要实际加载所有命中(在我的测试查询中大约有 70 万个)需要 5 分钟以上。
我们没有使用 lucene 作为 UI,这是一个进程之间的数据存储,我们有数亿个预缓存的数据元素,我正在处理的部分从每个找到的文档中导出一些特定的字段。 (因此,分页不会进行,因为这是进程之间的导出)。
我的问题是获取搜索结果中所有文档的最佳方法是什么?目前我正在使用一个自定义收集器,它对文档进行获取(使用 MapFieldSelector)作为其收集。我还尝试在收集器完成后迭代列表。但那更糟糕。
我对想法持开放态度:-)。
提前致谢。
I;m using Lucene.net (2.9.2.2) on a (currently) 70Gig index.. I can do a fairly complicated search and get all the document IDs back in 1 ~ 2 seconds.. But to actually load up all the hits (about 700 thousand in my test queries) takes 5+ minutes.
We aren't using lucene for UI, this is a datastore between processes where we have hundreds of millions of pre-cached data elements, and the part I am working on exports a few specific fields from each found document. (ergo, pagination doesn't make since as this is an export between processes).
My question is what is the best way to get all of the documents in a search result? currently I am using a custom collector that does a get on the document (with a MapFieldSelector) as its collecting.. I've also tried iterating through the list after the collector has finished.. but that was even worse.
I'm open to ideas :-).
Thanks in advance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您需要搜索哪些字段?您需要存储哪些字段?
Lucene.net 可能不是存储和检索实际文档文本的最有效方法。
您的场景建议不存储任何内容,对所需字段建立索引并返回文档 ID 列表。文档本身可以存储在辅助数据库中。
What fields do you need to search? What fields do you need to store?
Lucene.net is probably not the most efficient way to store and retrieve the actual document texts.
Your scenario suggests not storing anything, indexing the needed fields and returning a list of document ids. The documents themselves can be stored in an auxiliary database.
嗯,鉴于您在“获取”代码移至收集器之外时发现了问题,听起来您的问题与 I/O 相关。
考虑到索引的大小,我几乎不敢问这个问题,但是您是否尝试过:
如果是这样,对文档检索率是否有明显影响?顺便说一句,如果我的数学计算正确的话,我每秒检索 2333 个项目...
另外,对于您正在检索的字段子集,它们中的任何一个都适合压缩吗?或者您已经尝试过压缩?
顺便问一句,70万项在你们的指数中所占的比例是多少?了解 I/O 吞吐量会很有趣。您可能可以计算出您的机器/硬盘驱动器组合的最大理论数据速率,并查看是否已经接近极限。
Hmmm, given that you've found problems when your "get" code was moved outside the collector, it sounds like your problem is I/O related.
I'm almost dreading asking this given the size of your index, but have you tried:
If so, was there a noticeable effect on the rate documents are retrieved? BTW, I get 2333 items/second retrieved, if my shaky maths is correct...
Also, for the subset of fields you're retrieving, are any of them amenable to compression? Or have you already experimented with compression?
As a related matter, what kind of proportion of your index does 700 thousand items represent? It'd be interesting to get a feel for I/O throughput. You could probably work out the maximum theoretical data rate for your machine/hard-drive combination and see whether you're already close to the limit.