索引频繁更新的FieldCache
嗨
我的 lucene 索引经常使用新记录进行更新,我的索引中有 5,000,000 条记录,并且我正在使用 FieldCache 缓存我的数字字段之一。但是更新索引后,需要时间再次重新加载 FieldCache(我重新加载缓存,因为文档说 DocID 不可靠),所以如何通过仅将新添加的 DocID 添加到 FieldCache 来最小化这种开销,因为此功能变成了我的瓶颈应用。
IndexReader reader = IndexReader.Open(diskDir);
int[] dateArr = FieldCache_Fields.DEFAULT.GetInts(reader, "newsdate"); // This line takes 4 seconds to load the array
dateArr = FieldCache_Fields.DEFAULT.GetInts(reader, "newsdate"); // this line takes 0 second as we expected
// HERE we add some document to index and we need to reload the index to reflect changes
reader = reader.Reopen();
dateArr = FieldCache_Fields.DEFAULT.GetInts(reader, "newsdate"); // This takes 4 second again to load the array
我想要一种机制,通过仅将新添加的文档添加到数组中的索引来最小化这个时间,有一种像这样的技术 http://invertedindex.blogspot.com/2009/04/lucene-dociduid-mapping-and-payload.html 提高性能,但它仍然加载我们已有的所有文档,并且我认为如果我们找到一种仅将新添加的文档添加到数组中的方法,则无需重新加载所有文档
Hi
I have lucene index that is frequently updating with new records, I have 5,000,000 records in my index and I'm caching one of my numeric fields using FieldCache. but after updating index it takes time to reload the FieldCache again (im reloading the cache cause documentation said DocID is not reliable) so how can I minimize this overhead by adding only newly added DocIDs to the FieldCache, cause this capability turns to bottleneck in my application.
IndexReader reader = IndexReader.Open(diskDir);
int[] dateArr = FieldCache_Fields.DEFAULT.GetInts(reader, "newsdate"); // This line takes 4 seconds to load the array
dateArr = FieldCache_Fields.DEFAULT.GetInts(reader, "newsdate"); // this line takes 0 second as we expected
// HERE we add some document to index and we need to reload the index to reflect changes
reader = reader.Reopen();
dateArr = FieldCache_Fields.DEFAULT.GetInts(reader, "newsdate"); // This takes 4 second again to load the array
I want a mechanism that minimize this time by adding only newly added documents to the index in our array there is a technique like this http://invertedindex.blogspot.com/2009/04/lucene-dociduid-mapping-and-payload.html
to improve the performance but it still load all documents that we already have and i think there is no need to reload them all if we find a way to only adding newly added documents to the array
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
FieldCache 使用对索引读取器的弱引用作为其缓存的键。 (通过调用尚未废弃的
IndexReader.GetCacheKey
。)使用FSDirectory
对IndexReader.Open
的标准调用将使用一个池读者,每个部分一个。您应该始终将最里面的读取器传递给 FieldCache。查看
ReaderUtil
获取一些帮助程序,以检索文档包含的单个阅读器。文档 ID 不会在段内更改,当将其描述为不可预测/易失性时,它们的意思是它会在两次索引提交之间更改。可能会发生已删除的文档、已合并片段等操作。提交需要从磁盘中删除该段(合并/优化),这意味着新的读取器不会拥有池化的段读取器,并且一旦所有旧读取器关闭,垃圾收集就会将其删除。
永远、永远不要调用
FieldCache.PurgeAllCaches()
。它用于测试,而不是生产用途。2011-04-03 添加;使用子阅读器的示例代码。
The FieldCache uses weak references to index readers as keys for their cache. (By calling
IndexReader.GetCacheKey
which has been un-obsoleted.) A standard call toIndexReader.Open
with aFSDirectory
will use a pool of readers, one for every segment.You should always pass the innermost reader to the FieldCache. Check out
ReaderUtil
for some helper stuff to retrieve the individual reader a document is contained within. Document ids wont change within a segment, what they mean when describing it as unpredictable/volatile is that it will change between two index commits. Deleted documents could have been proned, segments have been merged, and such actions.A commit needs to remove the segment from disk (merged/optimized away), which means that new readers wont have the pooled segment reader, and the garbage collection will remove it as soon as all older readers are closed.
Never, ever, call
FieldCache.PurgeAllCaches()
. It's meant for testing, not production use.Added 2011-04-03; example code using subreaders.
这是我解决这个问题的一种方法。您需要创建一个后台线程来构造
IndexSearcher
实例,每隔一段时间一次创建一个实例。继续使用当前的IndexSearcher
实例,直到后台线程中的新实例准备就绪。然后将新的换成您当前的。每个实例都充当索引从首次打开时起的快照。请注意,FieldCache
的内存开销会加倍,因为内存中同时需要两个实例。发生这种情况时,您可以安全地写入IndexWriter
。如果需要,您可以更进一步,使索引更改立即可用于搜索,尽管这可能会很棘手。您需要将
RAMDirectory
与上面的每个快照实例相关联,以将更改保留在内存中。然后创建第二个指向该RAMDirectory
的IndexWriter
。对于每个索引写入,您需要写入两个IndexWriter
实例。对于搜索,您将在RAMDirectory
和磁盘上的普通索引中使用MultiSearcher
。一旦不再使用与之耦合的IndexSearcher
,RAMDirectory
就可以被丢弃。我在这里掩盖了一些细节,但这就是总体思路。希望这有帮助。
Here's one way I've solved this problem. You'll need to create a background thread to construct
IndexSearcher
instances, one at a time on some interval. Continue using your currentIndexSearcher
instance until a new one from the background thread is ready. Then swap out the new one to be your current one. Each instance acts as a snapshot of the index from the time that it was first opened. Note that the memory overhead forFieldCache
doubles because you need two instances in memory at once. You can safely write toIndexWriter
while this is happening.If you need to you can take this a step further by making index changes immediately available for search, although it can get tricky. You'll need to associate a
RAMDirectory
with each snapshot instance above to keep the changes in memory. Then create a secondIndexWriter
that points to thatRAMDirectory
. For each index write you'll need to write to bothIndexWriter
instances. For searches you'll use aMultiSearcher
across theRAMDirectory
and your normal index on disk. TheRAMDirectory
can be thrown away once theIndexSearcher
it was coupled with is no longer used. I'm glossing over some details here, however that's the general idea.Hope this helps.