索引频繁更新的FieldCache

发布于 2024-10-27 08:39:45 字数 1045 浏览 4 评论 0原文

嗨
我的 lucene 索引经常使用新记录进行更新，我的索引中有 5,000,000 条记录，并且我正在使用 FieldCache 缓存我的数字字段之一。但是更新索引后，需要时间再次重新加载 FieldCache（我重新加载缓存，因为文档说 DocID 不可靠），所以如何通过仅将新添加的 DocID 添加到 FieldCache 来最小化这种开销，因为此功能变成了我的瓶颈应用。


IndexReader reader = IndexReader.Open(diskDir);
int[] dateArr = FieldCache_Fields.DEFAULT.GetInts(reader, "newsdate"); // This line takes 4 seconds to load the array
dateArr = FieldCache_Fields.DEFAULT.GetInts(reader, "newsdate"); // this line takes 0 second as we expected
// HERE we add some document to index and we need to reload the index to reflect changes

reader = reader.Reopen();
dateArr = FieldCache_Fields.DEFAULT.GetInts(reader, "newsdate"); // This takes 4 second again to load the array

我想要一种机制，通过仅将新添加的文档添加到数组中的索引来最小化这个时间，有一种像这样的技术 http://invertedindex.blogspot.com/2009/04/lucene-dociduid-mapping-and-payload.html 提高性能，但它仍然加载我们已有的所有文档，并且我认为如果我们找到一种仅将新添加的文档添加到数组中的方法，则无需重新加载所有文档

原文

Hi
I have lucene index that is frequently updating with new records, I have 5,000,000 records in my index and I'm caching one of my numeric fields using FieldCache. but after updating index it takes time to reload the FieldCache again (im reloading the cache cause documentation said DocID is not reliable) so how can I minimize this overhead by adding only newly added DocIDs to the FieldCache, cause this capability turns to bottleneck in my application.


IndexReader reader = IndexReader.Open(diskDir);
int[] dateArr = FieldCache_Fields.DEFAULT.GetInts(reader, "newsdate"); // This line takes 4 seconds to load the array
dateArr = FieldCache_Fields.DEFAULT.GetInts(reader, "newsdate"); // this line takes 0 second as we expected
// HERE we add some document to index and we need to reload the index to reflect changes

reader = reader.Reopen();
dateArr = FieldCache_Fields.DEFAULT.GetInts(reader, "newsdate"); // This takes 4 second again to load the array

I want a mechanism that minimize this time by adding only newly added documents to the index in our array there is a technique like this http://invertedindex.blogspot.com/2009/04/lucene-dociduid-mapping-and-payload.html
to improve the performance but it still load all documents that we already have and i think there is no need to reload them all if we find a way to only adding newly added documents to the array

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

迷雾森÷林ヴ 2024-11-03 08:39:45

FieldCache 使用对索引读取器的弱引用作为其缓存的键。（通过调用尚未废弃的 IndexReader.GetCacheKey。）使用 FSDirectory 对 IndexReader.Open 的标准调用将使用一个池读者，每个部分一个。

您应该始终将最里面的读取器传递给 FieldCache。查看 ReaderUtil 获取一些帮助程序，以检索文档包含的单个阅读器。文档 ID 不会在段内更改，当将其描述为不可预测/易失性时，它们的意思是它会在两次索引提交之间更改。可能会发生已删除的文档、已合并片段等操作。

提交需要从磁盘中删除该段（合并/优化），这意味着新的读取器不会拥有池化的段读取器，并且一旦所有旧读取器关闭，垃圾收集就会将其删除。

永远、永远不要调用 FieldCache.PurgeAllCaches()。它用于测试，而不是生产用途。

2011-04-03 添加；使用子阅读器的示例代码。

var directory = FSDirectory.Open(new DirectoryInfo("index"));
var reader = IndexReader.Open(directory, readOnly: true);
var documentId = 1337;

// Grab all subreaders.
var subReaders = new List<IndexReader>();
ReaderUtil.GatherSubReaders(subReaders, reader);

// Loop through all subreaders. While subReaderId is higher than the
// maximum document id in the subreader, go to next.
var subReaderId = documentId;
var subReader = subReaders.First(sub => {
    if (sub.MaxDoc() < subReaderId) {
        subReaderId -= sub.MaxDoc();
        return false;
    }

    return true;
});

var values = FieldCache_Fields.DEFAULT.GetInts(subReader, "newsdate");
var value = values[subReaderId];

The FieldCache uses weak references to index readers as keys for their cache. (By calling IndexReader.GetCacheKey which has been un-obsoleted.) A standard call to IndexReader.Open with a FSDirectory will use a pool of readers, one for every segment.

You should always pass the innermost reader to the FieldCache. Check out ReaderUtil for some helper stuff to retrieve the individual reader a document is contained within. Document ids wont change within a segment, what they mean when describing it as unpredictable/volatile is that it will change between two index commits. Deleted documents could have been proned, segments have been merged, and such actions.

A commit needs to remove the segment from disk (merged/optimized away), which means that new readers wont have the pooled segment reader, and the garbage collection will remove it as soon as all older readers are closed.

Never, ever, call FieldCache.PurgeAllCaches(). It's meant for testing, not production use.

Added 2011-04-03; example code using subreaders.

var directory = FSDirectory.Open(new DirectoryInfo("index"));
var reader = IndexReader.Open(directory, readOnly: true);
var documentId = 1337;

// Grab all subreaders.
var subReaders = new List<IndexReader>();
ReaderUtil.GatherSubReaders(subReaders, reader);

// Loop through all subreaders. While subReaderId is higher than the
// maximum document id in the subreader, go to next.
var subReaderId = documentId;
var subReader = subReaders.First(sub => {
    if (sub.MaxDoc() < subReaderId) {
        subReaderId -= sub.MaxDoc();
        return false;
    }

    return true;
});

var values = FieldCache_Fields.DEFAULT.GetInts(subReader, "newsdate");
var value = values[subReaderId];

回复收藏 0 原文

寄风 2024-11-03 08:39:45

这是我解决这个问题的一种方法。您需要创建一个后台线程来构造 IndexSearcher 实例，每隔一段时间一次创建一个实例。继续使用当前的 IndexSearcher 实例，直到后台线程中的新实例准备就绪。然后将新的换成您当前的。每个实例都充当索引从首次打开时起的快照。请注意，FieldCache 的内存开销会加倍，因为内存中同时需要两个实例。发生这种情况时，您可以安全地写入 IndexWriter。

如果需要，您可以更进一步，使索引更改立即可用于搜索，尽管这可能会很棘手。您需要将 RAMDirectory 与上面的每个快照实例相关联，以将更改保留在内存中。然后创建第二个指向该 RAMDirectory 的 IndexWriter。对于每个索引写入，您需要写入两个 IndexWriter 实例。对于搜索，您将在 RAMDirectory 和磁盘上的普通索引中使用 MultiSearcher。一旦不再使用与之耦合的IndexSearcher，RAMDirectory 就可以被丢弃。我在这里掩盖了一些细节，但这就是总体思路。

希望这有帮助。