[Lucene]IndexReader/Searcher 的开销是多少
大多数 Lucene 文档都建议保留 indexReader 的单个实例并重用它,因为打开新 Reader 会产生开销。
然而,我发现很难看出这种开销的基础是什么以及影响它的因素。
与此相关的是打开 IndexReader 实际上会导致多少开销?
这个问题的背景是: 我们当前运行一个集群 tomcat 堆栈,我们在其中从 ServletContainer 执行全文。 这些搜索是在每个客户端的单独 Lucene 索引上完成的,因为每个客户端仅搜索自己的数据。每个索引都包含数千到(当前)大约 100,000 个文档。
由于集群化的 tomcat 节点,任何客户端都可以连接到任何 tomcat 节点。 因此,保持 IndexReader 打开实际上意味着在每个 tomcat 节点上保持几千个 indexReader 打开。这似乎是一个坏主意,但不断重新开放似乎也不是一个好主意。
虽然我可以在不需要的情况下稍微改变我们部署 Lucene 的方式,但我宁愿不这样做。
Most of the documentation of Lucene advises to keep a single instance of the indexReader and reuse it because of the overhead of opening a new Reader.
However i find it hard to see what this overhead is based and what influences it.
related to this is how much overhead does having an open IndexReader actualy cause?
The context for this question is:
We currently run a clustered tomcat stack where we do fulltext from the ServletContainer.
These searches are done on a separate Lucene indexes for each client because each client only seeks in his own data. Each of these indexes contains ranging from a few thousand to (currently) about 100.000 documents.
Because of the clustered tomcat nodes, any client can connect on any tomcat node.
Therefore keeping the IndexReader open would actually mean keep a few thousand indexReaders open on each tomcat node. This seems like a bad idea, however constantly reopening doesn't seem like a very good idea either.
While its possible for me to somewhat change the way we deploy Lucene if its not needed i'd rather not.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
通常,字段缓存是 Lucene 预热最慢的部分,尽管过滤器和段指针等其他内容也有贡献。缓存中保存的具体数量取决于您的使用情况,尤其是存储多少数据(而不是仅建立索引)之类的内容。
您可以使用适合您环境的任何内存使用情况调查工具来查看 Lucene 本身为您的应用程序占用了多少内存,但请记住,“预热成本”还指操作系统和文件系统保持打开状态的各种缓存可能不会出现在
top
或您使用的任何内容中。你是对的,拥有数千个索引并不是一种常见的做法。标准建议是让它们共享一个索引并使用过滤器来确保返回适当的结果。
由于您对性能感兴趣,因此您应该记住,服务器上有数千个索引将导致数千个文件散布在整个磁盘上,这将导致大量的寻道时间,而如果您只有一个索引,则不会发生这种情况大指数。根据您的要求,这可能是也可能不是问题。
附带说明:听起来您可能正在使用网络文件系统,这对 Lucene 的性能影响很大。
Usually the field cache is the slowest piece of Lucene to warm up, although other things like filters and segment pointers contribute. The specific amount kept in cache will depend on your usage, especially with stuff like how much data is stored (as opposed to just indexed).
You can use whatever memory usage investigation tool is appropriate for your environment to see how much Lucene itself takes up for your application, but keep in mind that "warm up cost" also refers to the various caches that the OS and file system keep open which will probably not appear in
top
or whatever you use.You are right that having thousands of indexes is not a common practice. The standard advice is to have them share an index and use filters to ensure that the appropriate results are returned.
Since you are interested in performance, you should keep in mind that having thousands of indices on the server will result in thousands of files strewn all across the disk, which will lead to tons of seek time that wouldn't happen if you just had one big index. Depending on your requirements, this may or may not be an issue.
As a side note: it sounds like you may be using a networked file system, which is a big performance hit for Lucene.