Lucene (.NET) 文档结构和性能建议
我正在索引大约 100M 个文档,这些文档由一些字符串标识符和一百个左右的数字术语组成。我不会进行范围查询,所以我没有深入研究数字字段,但我不认为它是选择这里就对了。
我的问题是,当我开始向查询中添加 OR 条件时,查询性能会迅速下降。我的所有查询都基于特定的数字术语。所以文档看起来像 StringField:[someString] 和 N DataField:[someNumber] ..我然后使用类似 DataField:((+1 +(2 3)) (+75 +(3 5 52)) (+99 +88 +(102 155 199))) 的内容进行查询。
目前,这些查询在我的笔记本电脑上运行大约需要 7 到 16 秒。我想确保这确实是他们能做的最好的事情。我愿意接受有关字段结构和查询结构的建议:-)。
谢谢
Josh
PS:我已经阅读了这里、Lucene wiki 和 lucid imiagination 上所有其他 lucene 性能讨论...我在兔子洞里走得更远了...
I am indexing about 100M documents that consist of a few string identifiers and a hundred or so numaric terms.. I won't be doing range queries, so I haven't dugg too deep into Numaric Field but I'm not thinking its the right choose here.
My problem is that the query performance degrades quickly when I start adding OR criteria to my query.. All my queries are on specific numaric terms.. So a document looks like StringField:[someString] and N DataField:[someNumber].. I then query it with something like DataField:((+1 +(2 3)) (+75 +(3 5 52)) (+99 +88 +(102 155 199))).
Currently these queries take about 7 to 16 seconds to run on my laptop.. I would like to make sure thats really the best they can do.. I am open to suggestions on field structure and query structure :-).
Thanks
Josh
PS: I have already read over all the other lucene performance discussions on here, and on the Lucene wiki and at lucid imiagination... I'm a bit further down the rabbit hole then that...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
既然您提到您正在执行特定数字查询而不是范围查询,那么我不会建议您查看 Lucene 3.0 中的真正快速的数字范围查询。
根据你的描述,我认为得分是造成问题的原因。当您有如此多的嵌套布尔查询时,评分会变得越来越复杂。而且分数是浮点数,运算速度较慢。如果您不关心分数,请编写自定义 收集器是个好主意。您可以查看我链接的 javadoc 中的示例,以编写您自己的示例。
Since you have mentioned that you are doing specific number queries and not range queries, I will not suggest you to take a look at really-fast numeric range queries in Lucene 3.0.
Going by your description, I suppose, scoring is causing the problem. When you have so many nested boolean queries, scoring keeps on getting complex. And scores being floating point numbers, arithmetic is slower. If you don't care about scores, writing custom Collector is a good idea. You can see the example, in javadoc I have linked, to write your own.