索引频繁更新的FieldCache

发布于 2024-10-27 08:39:45 字数 1045 浏览 4 评论 0原文


我的 lucene 索引经常使用新记录进行更新,我的索引中有 5,000,000 条记录,并且我正在使用 FieldCache 缓存我的数字字段之一。但是更新索引后,需要时间再次重新加载 FieldCache(我重新加载缓存,因为文档说 DocID 不可靠),所以如何通过仅将新添加的 DocID 添加到 FieldCache 来最小化这种开销,因为此功能变成了我的瓶颈应用。


IndexReader reader = IndexReader.Open(diskDir);
int[] dateArr = FieldCache_Fields.DEFAULT.GetInts(reader, "newsdate"); // This line takes 4 seconds to load the array
dateArr = FieldCache_Fields.DEFAULT.GetInts(reader, "newsdate"); // this line takes 0 second as we expected
// HERE we add some document to index and we need to reload the index to reflect changes

reader = reader.Reopen();
dateArr = FieldCache_Fields.DEFAULT.GetInts(reader, "newsdate"); // This takes 4 second again to load the array

我想要一种机制,通过仅将新添加的文档添加到数组中的索引来最小化这个时间,有一种像这样的技术 http://invertedindex.blogspot.com/2009/04/lucene-dociduid-mapping-and-payload.html 提高性能,但它仍然加载我们已有的所有文档,并且我认为如果我们找到一种仅将新添加的文档添加到数组中的方法,则无需重新加载所有文档

Hi
I have lucene index that is frequently updating with new records, I have 5,000,000 records in my index and I'm caching one of my numeric fields using FieldCache. but after updating index it takes time to reload the FieldCache again (im reloading the cache cause documentation said DocID is not reliable) so how can I minimize this overhead by adding only newly added DocIDs to the FieldCache, cause this capability turns to bottleneck in my application.


IndexReader reader = IndexReader.Open(diskDir);
int[] dateArr = FieldCache_Fields.DEFAULT.GetInts(reader, "newsdate"); // This line takes 4 seconds to load the array
dateArr = FieldCache_Fields.DEFAULT.GetInts(reader, "newsdate"); // this line takes 0 second as we expected
// HERE we add some document to index and we need to reload the index to reflect changes

reader = reader.Reopen();
dateArr = FieldCache_Fields.DEFAULT.GetInts(reader, "newsdate"); // This takes 4 second again to load the array

I want a mechanism that minimize this time by adding only newly added documents to the index in our array there is a technique like this http://invertedindex.blogspot.com/2009/04/lucene-dociduid-mapping-and-payload.html
to improve the performance but it still load all documents that we already have and i think there is no need to reload them all if we find a way to only adding newly added documents to the array

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

迷雾森÷林ヴ 2024-11-03 08:39:45

FieldCache 使用对索引读取器的弱引用作为其缓存的键。 (通过调用尚未废弃的 IndexReader.GetCacheKey。)使用 FSDirectoryIndexReader.Open 的标准调用将使用一个池读者,每个部分一个。

您应该始终将最里面的读取器传递给 FieldCache。查看 ReaderUtil 获取一些帮助程序,以检索文档包含的单个阅读器。文档 ID 不会在段内更改,当将其描述为不可预测/易失性时,它们的意思是它会在两次索引提交之间更改。可能会发生已删除的文档、已合并片段等操作。

提交需要从磁盘中删除该段(合并/优化),这意味着新的读取器不会拥有池化的段读取器,并且一旦所有旧读取器关闭,垃圾收集就会将其删除。

永远、永远不要调用 FieldCache.PurgeAllCaches()。它用于测试,而不是生产用途。

2011-04-03 添加;使用子阅读器的示例代码。

var directory = FSDirectory.Open(new DirectoryInfo("index"));
var reader = IndexReader.Open(directory, readOnly: true);
var documentId = 1337;

// Grab all subreaders.
var subReaders = new List<IndexReader>();
ReaderUtil.GatherSubReaders(subReaders, reader);

// Loop through all subreaders. While subReaderId is higher than the
// maximum document id in the subreader, go to next.
var subReaderId = documentId;
var subReader = subReaders.First(sub => {
    if (sub.MaxDoc() < subReaderId) {
        subReaderId -= sub.MaxDoc();
        return false;
    }

    return true;
});

var values = FieldCache_Fields.DEFAULT.GetInts(subReader, "newsdate");
var value = values[subReaderId];

The FieldCache uses weak references to index readers as keys for their cache. (By calling IndexReader.GetCacheKey which has been un-obsoleted.) A standard call to IndexReader.Open with a FSDirectory will use a pool of readers, one for every segment.

You should always pass the innermost reader to the FieldCache. Check out ReaderUtil for some helper stuff to retrieve the individual reader a document is contained within. Document ids wont change within a segment, what they mean when describing it as unpredictable/volatile is that it will change between two index commits. Deleted documents could have been proned, segments have been merged, and such actions.

A commit needs to remove the segment from disk (merged/optimized away), which means that new readers wont have the pooled segment reader, and the garbage collection will remove it as soon as all older readers are closed.

Never, ever, call FieldCache.PurgeAllCaches(). It's meant for testing, not production use.

Added 2011-04-03; example code using subreaders.

var directory = FSDirectory.Open(new DirectoryInfo("index"));
var reader = IndexReader.Open(directory, readOnly: true);
var documentId = 1337;

// Grab all subreaders.
var subReaders = new List<IndexReader>();
ReaderUtil.GatherSubReaders(subReaders, reader);

// Loop through all subreaders. While subReaderId is higher than the
// maximum document id in the subreader, go to next.
var subReaderId = documentId;
var subReader = subReaders.First(sub => {
    if (sub.MaxDoc() < subReaderId) {
        subReaderId -= sub.MaxDoc();
        return false;
    }

    return true;
});

var values = FieldCache_Fields.DEFAULT.GetInts(subReader, "newsdate");
var value = values[subReaderId];
寄风 2024-11-03 08:39:45

这是我解决这个问题的一种方法。您需要创建一个后台线程来构造 IndexSearcher 实例,每隔一段时间一次创建一个实例。继续使用当前的 IndexSearcher 实例,直到后台线程中的新实例准备就绪。然后将新的换成您当前的。每个实例都充当索引从首次打开时起的快照。请注意,FieldCache 的内存开销会加倍,因为内存中同时需要两个实例。发生这种情况时,您可以安全地写入 IndexWriter

如果需要,您可以更进一步,使索引更改立即可用于搜索,尽管这可能会很棘手。您需要将 RAMDirectory 与上面的每个快照实例相关联,以将更改保留在内存中。然后创建第二个指向该 RAMDirectoryIndexWriter。对于每个索引写入,您需要写入两个 IndexWriter 实例。对于搜索,您将在 RAMDirectory 和磁盘上的普通索引中使用 MultiSearcher。一旦不再使用与之耦合的IndexSearcherRAMDirectory 就可以被丢弃。我在这里掩盖了一些细节,但这就是总体思路。

希望这有帮助。

Here's one way I've solved this problem. You'll need to create a background thread to construct IndexSearcher instances, one at a time on some interval. Continue using your current IndexSearcher instance until a new one from the background thread is ready. Then swap out the new one to be your current one. Each instance acts as a snapshot of the index from the time that it was first opened. Note that the memory overhead for FieldCache doubles because you need two instances in memory at once. You can safely write to IndexWriter while this is happening.

If you need to you can take this a step further by making index changes immediately available for search, although it can get tricky. You'll need to associate a RAMDirectory with each snapshot instance above to keep the changes in memory. Then create a second IndexWriter that points to that RAMDirectory. For each index write you'll need to write to both IndexWriter instances. For searches you'll use a MultiSearcher across the RAMDirectory and your normal index on disk. The RAMDirectory can be thrown away once the IndexSearcher it was coupled with is no longer used. I'm glossing over some details here, however that's the general idea.

Hope this helps.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文