Lucene.Net 2.9.2:添加大量文档时出现OOM异常

发布于 2024-12-02 07:22:04 字数 2142 浏览 1 评论 0原文

我正在尝试使用 Lucene.NET 2.9.2 索引大约 10.000.000 个文档。这些文档(不同长度的论坛帖子)从 MSSQL 数据库中批量获取 10,000 个,然后传递到我的名为 LuceneCorpus 的 Lucene.NET 包装类:

public static void IndexPosts(LuceneCorpus luceneCorpus, IPostsRepository postsRepository, int chunkSize)
{
    // omitted: this whole method is executed in a background worker to enable GUI feedback
    // chunkSize is 10.000
    int count = 0;
    // totalSteps is ~10.000.000
    int totalSteps = postsRepository.All.Count();
    while (true)
    {
        var posts = postsRepository.All.Skip(count).Take(chunkSize).ToList();
        if (posts.Count == 0)
            break;
        luceneCorpus.AddPosts(posts);
        count += posts.Count;                   
    }
    luceneCorpus.OptimizeIndex();
}

我读到建议使用单个 IndexWriter,而不是打开和关闭一个新的 IndexWriter。每批文件一个。因此,我的 LuceneCorpus 类如下所示:

public class LuceneCorpus
{
    private Analyzer _analyzer;
    private Directory _indexDir;
    private IndexWriter _writer;

    public LuceneCorpus(DirectoryInfo indexDirectory)
    {
        _indexDir = FSDirectory.Open(indexDirectory);
        _analyzer = new StandardAnalyzer(Version.LUCENE_29);
        _writer = new IndexWriter(_indexDir, _analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED);
        _writer.SetRAMBufferSizeMB(128);
    }

    public void AddPosts(IEnumerable<Post> posts)
    {
        List<Document> docs = new List<Document>();
        foreach (var post in posts)
        {
            var doc = new Document();
            doc.Add(new Field("SimplifiedBody", post.SimplifiedBody, Field.Store.NO, Field.Index.ANALYZED));
            _writer.AddDocument(doc);
        }
        _writer.Commit();
    }

    public void OptimizeIndex()
    {
        _writer.Optimize();
    }
}

现在,我的问题是内存消耗不断被填满,直到在 IndexPosts 方法中的某个位置对大约 700.000 个文档进行索引后最终出现内存不足异常。

据我所知,索引编写器应该在达到 RAMBufferSize (128 MB) 或调用 Commit() 时刷新。事实上,作者肯定会刷新,甚至会跟踪刷新,但内存仍然会被填满。作者是否以某种方式保留对文档的引用,以便它们不会被垃圾收集,或者我在这里错过了什么?

提前致谢!

编辑:我还尝试在 AddPosts 方法的范围内而不是类范围内初始化 writer、analyzer 和 indexDir,但这也不能防止 OOM 异常。

I am trying to index about 10.000.000 documents with Lucene.NET 2.9.2. These documents (forum posts of different length) are taken in bulks of 10.000 from a MSSQL database and then passed to my Lucene.NET wrapper class called LuceneCorpus:

public static void IndexPosts(LuceneCorpus luceneCorpus, IPostsRepository postsRepository, int chunkSize)
{
    // omitted: this whole method is executed in a background worker to enable GUI feedback
    // chunkSize is 10.000
    int count = 0;
    // totalSteps is ~10.000.000
    int totalSteps = postsRepository.All.Count();
    while (true)
    {
        var posts = postsRepository.All.Skip(count).Take(chunkSize).ToList();
        if (posts.Count == 0)
            break;
        luceneCorpus.AddPosts(posts);
        count += posts.Count;                   
    }
    luceneCorpus.OptimizeIndex();
}

I read that it is recommended to use a single IndexWriter instead of opening and closing a new one for each bulk of documents. Therefore, my LuceneCorpus class looks like this:

public class LuceneCorpus
{
    private Analyzer _analyzer;
    private Directory _indexDir;
    private IndexWriter _writer;

    public LuceneCorpus(DirectoryInfo indexDirectory)
    {
        _indexDir = FSDirectory.Open(indexDirectory);
        _analyzer = new StandardAnalyzer(Version.LUCENE_29);
        _writer = new IndexWriter(_indexDir, _analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED);
        _writer.SetRAMBufferSizeMB(128);
    }

    public void AddPosts(IEnumerable<Post> posts)
    {
        List<Document> docs = new List<Document>();
        foreach (var post in posts)
        {
            var doc = new Document();
            doc.Add(new Field("SimplifiedBody", post.SimplifiedBody, Field.Store.NO, Field.Index.ANALYZED));
            _writer.AddDocument(doc);
        }
        _writer.Commit();
    }

    public void OptimizeIndex()
    {
        _writer.Optimize();
    }
}

Now, my problem is that the memory consumption is constantly filling up until I finally reach an out-of-memory exception after indexing about 700.000 documents somewhere in the IndexPosts method.

As far as I know, the index writer should flush when it either reached the RAMBufferSize (128 MB) or if Commit() is called. In fact, the writer definitely DOES flush and even keeps track of the flushes but the memory keeps filling up nevertheless. Is the writer somehow keeping a reference to the documents so that they aren't garbage collected or what am I missing here?

Thanks in advance!

Edit: I also tried initializing the writer, analyzer and indexDir in the scope of the AddPosts method instead of class-wide but that doesn't prevent the OOM exception either.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

墨落成白 2024-12-09 07:22:04

我读到建议使用单个 IndexWriter 而不是
为每批文档打开和关闭一个新文档。

一般来说,这可能是正确的,但你的特殊情况似乎需要另一种方法。您应该尝试每批使用一个作家。您的大内存需求迫使您使用效率不太理想的解决方案。用内存换取速度,反之亦然——这很常见。

I read that it is recommended to use a single IndexWriter instead of
opening and closing a new one for each bulk of documents.

That may be true in general, but your special case seems to demand another approach. You should try a writer per batch. Your large memory requirement is forcing you to use a less-than-optimal efficiency solution. Trade memory for speed and visa versa - it's common.

花开半夏魅人心 2024-12-09 07:22:04

显然 Lucene 并没有导致内存泄漏,而是我的 PostsRepository 的 DataContext 导致了内存泄漏。我通过为每个“Take”迭代使用临时的非跟踪 DC 来解决这个问题。

无论如何,抱歉并感谢!

Apparently Lucene wasn't causing the memory leak but the DataContext of my PostsRepository was. I solved it by using a temporary non-tracking DC for each "Take" iteration.

Sorry and thanks anyways!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文