Lucene IndexWriter添加文档慢

发布于 2024-09-10 21:13:47 字数 799 浏览 1 评论 0原文

我编写了一个小循环,将 10,000 个文档添加到 IndexWriter 中,并且花了很长时间才完成。

还有其他方法可以索引大量文档吗?

我问这个问题是因为当它上线时,它必须加载 15,000 条记录。

另一个问题是如何防止在 Web 应用程序重新启动时再次加载所有记录?

编辑

这是我使用的代码;

for (int t = 0; t < 10000; t++){
    doc = new Document();
    text = "Value" + t.toString();
    doc.Add(new Field("Value", text, Field.Store.YES, Field.Index.TOKENIZED));
    iwriter.AddDocument(doc);
};

编辑2

        Analyzer analyzer = new StandardAnalyzer();
        Directory directory = new RAMDirectory();

        IndexWriter iwriter = new IndexWriter(directory, analyzer, true);

        iwriter.SetMaxFieldLength(25000);

然后添加文档的代码,然后;

        iwriter.Close();

I wrote a small loop which added 10,000 documents into the IndexWriter and it took for ever to do it.

Is there another way to index large volumes of documents?

I ask because when this goes live it has to load in 15,000 records.

The other question is how do I prevent having to load in all the records again when the web application is restarted?

Edit

Here is the code i used;

for (int t = 0; t < 10000; t++){
    doc = new Document();
    text = "Value" + t.toString();
    doc.Add(new Field("Value", text, Field.Store.YES, Field.Index.TOKENIZED));
    iwriter.AddDocument(doc);
};

Edit 2

        Analyzer analyzer = new StandardAnalyzer();
        Directory directory = new RAMDirectory();

        IndexWriter iwriter = new IndexWriter(directory, analyzer, true);

        iwriter.SetMaxFieldLength(25000);

then the code to add the documents, then;

        iwriter.Close();

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

孤单情人 2024-09-17 21:13:47

您应该这样做以获得最佳性能。在我的机器上,我在 1 秒内索引 1000 个文档

1)您应该重用(文档,字段),而不是每次添加这样的文档时都创建

private static void IndexingThread(object contextObj)
{
     Range<int> range = (Range<int>)contextObj;
     Document newDoc = new Document();
     newDoc.Add(new Field("title", "", Field.Store.NO, Field.Index.ANALYZED));
     newDoc.Add(new Field("body", "", Field.Store.NO, Field.Index.ANALYZED));
     newDoc.Add(new Field("newsdate", "", Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS));
     newDoc.Add(new Field("id", "", Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS));

     for (int counter = range.Start; counter <= range.End; counter++)
     {
         newDoc.GetField("title").SetValue(Entities[counter].Title);
         newDoc.GetField("body").SetValue(Entities[counter].Body);
         newDoc.GetField("newsdate").SetValue(Entities[counter].NewsDate);
         newDoc.GetField("id").SetValue(Entities[counter].ID.ToString());

         writer.AddDocument(newDoc);
     }
}

之后,您可以使用线程并将大集合分解为较小的集合,然后使用上面每个部分的代码
例如,如果您有 10,000 个文档,您可以使用 ThreadPool 创建 10 个线程,并将每个部分提供给
一个线程用于索引

那么您将获得最佳性能。

You should do this way to get the best performance. on my machine i'm indexing 1000 document in 1 second

1) You should reuse (Document, Field) not creating every time you add a document like this

private static void IndexingThread(object contextObj)
{
     Range<int> range = (Range<int>)contextObj;
     Document newDoc = new Document();
     newDoc.Add(new Field("title", "", Field.Store.NO, Field.Index.ANALYZED));
     newDoc.Add(new Field("body", "", Field.Store.NO, Field.Index.ANALYZED));
     newDoc.Add(new Field("newsdate", "", Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS));
     newDoc.Add(new Field("id", "", Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS));

     for (int counter = range.Start; counter <= range.End; counter++)
     {
         newDoc.GetField("title").SetValue(Entities[counter].Title);
         newDoc.GetField("body").SetValue(Entities[counter].Body);
         newDoc.GetField("newsdate").SetValue(Entities[counter].NewsDate);
         newDoc.GetField("id").SetValue(Entities[counter].ID.ToString());

         writer.AddDocument(newDoc);
     }
}

After that you could use threading and break your large collection into smaller ones, and use the above code for each section
for example if you have 10,000 document you can create 10 Thread using ThreadPool and feed each section to
one thread for indexing

Then you will gain the best performance.

野生奥特曼 2024-09-17 21:13:47

只是检查了一下,但是当您运行它时,您还没有附加调试器,是吗?

这会严重影响添加文档时的性能。

在我的机器上(Lucene 2.0.0.4):

使用平台目标 x86 构建:

  • 无调试器 - 5.2 秒

  • 附加调试器 - 113.8 秒

使用平台目标 x64 构建:

  • 无调试器 - 6.0 秒

  • 附加调试器 - 171.4 秒

保存和加载索引的粗略示例从 RAM 目录:

const int DocumentCount = 10 * 1000;
const string IndexFilePath = @"X:\Temp\tmp.idx";

Analyzer analyzer = new StandardAnalyzer();
Directory ramDirectory = new RAMDirectory();

IndexWriter indexWriter = new IndexWriter(ramDirectory, analyzer, true);

for (int i = 0; i < DocumentCount; i++)
{
    Document doc = new Document();
    string text = "Value" + i;
    doc.Add(new Field("Value", text, Field.Store.YES, Field.Index.TOKENIZED));
    indexWriter.AddDocument(doc);
}

indexWriter.Close();

//Save index
FSDirectory fileDirectory = FSDirectory.GetDirectory(IndexFilePath, true);
IndexWriter fileIndexWriter = new IndexWriter(fileDirectory, analyzer, true);
fileIndexWriter.AddIndexes(new[] { ramDirectory });
fileIndexWriter.Close();

//Load index
FSDirectory newFileDirectory = FSDirectory.GetDirectory(IndexFilePath, false);
Directory newRamDirectory = new RAMDirectory();
IndexWriter newIndexWriter = new IndexWriter(newRamDirectory, analyzer, true);
newIndexWriter.AddIndexes(new[] { newFileDirectory });

Console.WriteLine("New index writer document count:{0}.", newIndexWriter.DocCount());

Just checking, but you haven't got the debugger attached when you're running it have you?

This severely affects performance when adding documents.

On my machine (Lucene 2.0.0.4):

Built with platform target x86:

  • No debugger - 5.2 seconds

  • Debugger attached - 113.8 seconds

Built with platform target x64:

  • No debugger - 6.0 seconds

  • Debugger attached - 171.4 seconds

Rough example of saving and loading an index to and from a RAMDirectory:

const int DocumentCount = 10 * 1000;
const string IndexFilePath = @"X:\Temp\tmp.idx";

Analyzer analyzer = new StandardAnalyzer();
Directory ramDirectory = new RAMDirectory();

IndexWriter indexWriter = new IndexWriter(ramDirectory, analyzer, true);

for (int i = 0; i < DocumentCount; i++)
{
    Document doc = new Document();
    string text = "Value" + i;
    doc.Add(new Field("Value", text, Field.Store.YES, Field.Index.TOKENIZED));
    indexWriter.AddDocument(doc);
}

indexWriter.Close();

//Save index
FSDirectory fileDirectory = FSDirectory.GetDirectory(IndexFilePath, true);
IndexWriter fileIndexWriter = new IndexWriter(fileDirectory, analyzer, true);
fileIndexWriter.AddIndexes(new[] { ramDirectory });
fileIndexWriter.Close();

//Load index
FSDirectory newFileDirectory = FSDirectory.GetDirectory(IndexFilePath, false);
Directory newRamDirectory = new RAMDirectory();
IndexWriter newIndexWriter = new IndexWriter(newRamDirectory, analyzer, true);
newIndexWriter.AddIndexes(new[] { newFileDirectory });

Console.WriteLine("New index writer document count:{0}.", newIndexWriter.DocCount());
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文