Lucene.Net 2.9.2：添加大量文档时出现OOM异常

发布于 2024-12-02 07:22:04 字数 2142 浏览 1 评论 0原文

我正在尝试使用 Lucene.NET 2.9.2 索引大约 10.000.000 个文档。这些文档（不同长度的论坛帖子）从 MSSQL 数据库中批量获取 10,000 个，然后传递到我的名为 LuceneCorpus 的 Lucene.NET 包装类：

public static void IndexPosts(LuceneCorpus luceneCorpus, IPostsRepository postsRepository, int chunkSize)
{
    // omitted: this whole method is executed in a background worker to enable GUI feedback
    // chunkSize is 10.000
    int count = 0;
    // totalSteps is ~10.000.000
    int totalSteps = postsRepository.All.Count();
    while (true)
    {
        var posts = postsRepository.All.Skip(count).Take(chunkSize).ToList();
        if (posts.Count == 0)
            break;
        luceneCorpus.AddPosts(posts);
        count += posts.Count;                   
    }
    luceneCorpus.OptimizeIndex();
}

我读到建议使用单个 IndexWriter，而不是打开和关闭一个新的 IndexWriter。每批文件一个。因此，我的 LuceneCorpus 类如下所示：

public class LuceneCorpus
{
    private Analyzer _analyzer;
    private Directory _indexDir;
    private IndexWriter _writer;

    public LuceneCorpus(DirectoryInfo indexDirectory)
    {
        _indexDir = FSDirectory.Open(indexDirectory);
        _analyzer = new StandardAnalyzer(Version.LUCENE_29);
        _writer = new IndexWriter(_indexDir, _analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED);
        _writer.SetRAMBufferSizeMB(128);
    }

    public void AddPosts(IEnumerable<Post> posts)
    {
        List<Document> docs = new List<Document>();
        foreach (var post in posts)
        {
            var doc = new Document();
            doc.Add(new Field("SimplifiedBody", post.SimplifiedBody, Field.Store.NO, Field.Index.ANALYZED));
            _writer.AddDocument(doc);
        }
        _writer.Commit();
    }

    public void OptimizeIndex()
    {
        _writer.Optimize();
    }
}

现在，我的问题是内存消耗不断被填满，直到在 IndexPosts 方法中的某个位置对大约 700.000 个文档进行索引后最终出现内存不足异常。

据我所知，索引编写器应该在达到 RAMBufferSize (128 MB) 或调用 Commit() 时刷新。事实上，作者肯定会刷新，甚至会跟踪刷新，但内存仍然会被填满。作者是否以某种方式保留对文档的引用，以便它们不会被垃圾收集，或者我在这里错过了什么？

提前致谢！

编辑：我还尝试在 AddPosts 方法的范围内而不是类范围内初始化 writer、analyzer 和 indexDir，但这也不能防止 OOM 异常。

原文

I am trying to index about 10.000.000 documents with Lucene.NET 2.9.2. These documents (forum posts of different length) are taken in bulks of 10.000 from a MSSQL database and then passed to my Lucene.NET wrapper class called LuceneCorpus:

public static void IndexPosts(LuceneCorpus luceneCorpus, IPostsRepository postsRepository, int chunkSize)
{
    // omitted: this whole method is executed in a background worker to enable GUI feedback
    // chunkSize is 10.000
    int count = 0;
    // totalSteps is ~10.000.000
    int totalSteps = postsRepository.All.Count();
    while (true)
    {
        var posts = postsRepository.All.Skip(count).Take(chunkSize).ToList();
        if (posts.Count == 0)
            break;
        luceneCorpus.AddPosts(posts);
        count += posts.Count;                   
    }
    luceneCorpus.OptimizeIndex();
}

I read that it is recommended to use a single IndexWriter instead of opening and closing a new one for each bulk of documents. Therefore, my LuceneCorpus class looks like this:

public class LuceneCorpus
{
    private Analyzer _analyzer;
    private Directory _indexDir;
    private IndexWriter _writer;

    public LuceneCorpus(DirectoryInfo indexDirectory)
    {
        _indexDir = FSDirectory.Open(indexDirectory);
        _analyzer = new StandardAnalyzer(Version.LUCENE_29);
        _writer = new IndexWriter(_indexDir, _analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED);
        _writer.SetRAMBufferSizeMB(128);
    }

    public void AddPosts(IEnumerable<Post> posts)
    {
        List<Document> docs = new List<Document>();
        foreach (var post in posts)
        {
            var doc = new Document();
            doc.Add(new Field("SimplifiedBody", post.SimplifiedBody, Field.Store.NO, Field.Index.ANALYZED));
            _writer.AddDocument(doc);
        }
        _writer.Commit();
    }

    public void OptimizeIndex()
    {
        _writer.Optimize();
    }
}

Now, my problem is that the memory consumption is constantly filling up until I finally reach an out-of-memory exception after indexing about 700.000 documents somewhere in the IndexPosts method.

As far as I know, the index writer should flush when it either reached the RAMBufferSize (128 MB) or if Commit() is called. In fact, the writer definitely DOES flush and even keeps track of the flushes but the memory keeps filling up nevertheless. Is the writer somehow keeping a reference to the documents so that they aren't garbage collected or what am I missing here?

Thanks in advance!

Edit: I also tried initializing the writer, analyzer and indexDir in the scope of the AddPosts method instead of class-wide but that doesn't prevent the OOM exception either.

分享到QQ

分享到微博