Lucene.Net 2.9.2:添加大量文档时出现OOM异常
我正在尝试使用 Lucene.NET 2.9.2 索引大约 10.000.000 个文档。这些文档(不同长度的论坛帖子)从 MSSQL 数据库中批量获取 10,000 个,然后传递到我的名为 LuceneCorpus 的 Lucene.NET 包装类:
public static void IndexPosts(LuceneCorpus luceneCorpus, IPostsRepository postsRepository, int chunkSize)
{
// omitted: this whole method is executed in a background worker to enable GUI feedback
// chunkSize is 10.000
int count = 0;
// totalSteps is ~10.000.000
int totalSteps = postsRepository.All.Count();
while (true)
{
var posts = postsRepository.All.Skip(count).Take(chunkSize).ToList();
if (posts.Count == 0)
break;
luceneCorpus.AddPosts(posts);
count += posts.Count;
}
luceneCorpus.OptimizeIndex();
}
我读到建议使用单个 IndexWriter,而不是打开和关闭一个新的 IndexWriter。每批文件一个。因此,我的 LuceneCorpus 类如下所示:
public class LuceneCorpus
{
private Analyzer _analyzer;
private Directory _indexDir;
private IndexWriter _writer;
public LuceneCorpus(DirectoryInfo indexDirectory)
{
_indexDir = FSDirectory.Open(indexDirectory);
_analyzer = new StandardAnalyzer(Version.LUCENE_29);
_writer = new IndexWriter(_indexDir, _analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED);
_writer.SetRAMBufferSizeMB(128);
}
public void AddPosts(IEnumerable<Post> posts)
{
List<Document> docs = new List<Document>();
foreach (var post in posts)
{
var doc = new Document();
doc.Add(new Field("SimplifiedBody", post.SimplifiedBody, Field.Store.NO, Field.Index.ANALYZED));
_writer.AddDocument(doc);
}
_writer.Commit();
}
public void OptimizeIndex()
{
_writer.Optimize();
}
}
现在,我的问题是内存消耗不断被填满,直到在 IndexPosts 方法中的某个位置对大约 700.000 个文档进行索引后最终出现内存不足异常。
据我所知,索引编写器应该在达到 RAMBufferSize (128 MB) 或调用 Commit() 时刷新。事实上,作者肯定会刷新,甚至会跟踪刷新,但内存仍然会被填满。作者是否以某种方式保留对文档的引用,以便它们不会被垃圾收集,或者我在这里错过了什么?
提前致谢!
编辑:我还尝试在 AddPosts 方法的范围内而不是类范围内初始化 writer、analyzer 和 indexDir,但这也不能防止 OOM 异常。
I am trying to index about 10.000.000 documents with Lucene.NET 2.9.2. These documents (forum posts of different length) are taken in bulks of 10.000 from a MSSQL database and then passed to my Lucene.NET wrapper class called LuceneCorpus:
public static void IndexPosts(LuceneCorpus luceneCorpus, IPostsRepository postsRepository, int chunkSize)
{
// omitted: this whole method is executed in a background worker to enable GUI feedback
// chunkSize is 10.000
int count = 0;
// totalSteps is ~10.000.000
int totalSteps = postsRepository.All.Count();
while (true)
{
var posts = postsRepository.All.Skip(count).Take(chunkSize).ToList();
if (posts.Count == 0)
break;
luceneCorpus.AddPosts(posts);
count += posts.Count;
}
luceneCorpus.OptimizeIndex();
}
I read that it is recommended to use a single IndexWriter instead of opening and closing a new one for each bulk of documents. Therefore, my LuceneCorpus class looks like this:
public class LuceneCorpus
{
private Analyzer _analyzer;
private Directory _indexDir;
private IndexWriter _writer;
public LuceneCorpus(DirectoryInfo indexDirectory)
{
_indexDir = FSDirectory.Open(indexDirectory);
_analyzer = new StandardAnalyzer(Version.LUCENE_29);
_writer = new IndexWriter(_indexDir, _analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED);
_writer.SetRAMBufferSizeMB(128);
}
public void AddPosts(IEnumerable<Post> posts)
{
List<Document> docs = new List<Document>();
foreach (var post in posts)
{
var doc = new Document();
doc.Add(new Field("SimplifiedBody", post.SimplifiedBody, Field.Store.NO, Field.Index.ANALYZED));
_writer.AddDocument(doc);
}
_writer.Commit();
}
public void OptimizeIndex()
{
_writer.Optimize();
}
}
Now, my problem is that the memory consumption is constantly filling up until I finally reach an out-of-memory exception after indexing about 700.000 documents somewhere in the IndexPosts method.
As far as I know, the index writer should flush when it either reached the RAMBufferSize (128 MB) or if Commit() is called. In fact, the writer definitely DOES flush and even keeps track of the flushes but the memory keeps filling up nevertheless. Is the writer somehow keeping a reference to the documents so that they aren't garbage collected or what am I missing here?
Thanks in advance!
Edit: I also tried initializing the writer, analyzer and indexDir in the scope of the AddPosts method instead of class-wide but that doesn't prevent the OOM exception either.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
尝试最新和最好的。它修复了一些内存泄漏问题。
https://svn.apache.org/ repos/asf/incubator/lucene.net/branches/Lucene.Net_2_9_4g/src/
Try latest and greatest. It has some mem-leak fixes.
https://svn.apache.org/repos/asf/incubator/lucene.net/branches/Lucene.Net_2_9_4g/src/
一般来说,这可能是正确的,但你的特殊情况似乎需要另一种方法。您应该尝试每批使用一个作家。您的大内存需求迫使您使用效率不太理想的解决方案。用内存换取速度,反之亦然——这很常见。
That may be true in general, but your special case seems to demand another approach. You should try a writer per batch. Your large memory requirement is forcing you to use a less-than-optimal efficiency solution. Trade memory for speed and visa versa - it's common.
显然 Lucene 并没有导致内存泄漏,而是我的 PostsRepository 的 DataContext 导致了内存泄漏。我通过为每个“Take”迭代使用临时的非跟踪 DC 来解决这个问题。
无论如何,抱歉并感谢!
Apparently Lucene wasn't causing the memory leak but the DataContext of my PostsRepository was. I solved it by using a temporary non-tracking DC for each "Take" iteration.
Sorry and thanks anyways!