Lucene IndexWriter添加文档慢
我编写了一个小循环,将 10,000 个文档添加到 IndexWriter 中,并且花了很长时间才完成。
还有其他方法可以索引大量文档吗?
我问这个问题是因为当它上线时,它必须加载 15,000 条记录。
另一个问题是如何防止在 Web 应用程序重新启动时再次加载所有记录?
编辑
这是我使用的代码;
for (int t = 0; t < 10000; t++){
doc = new Document();
text = "Value" + t.toString();
doc.Add(new Field("Value", text, Field.Store.YES, Field.Index.TOKENIZED));
iwriter.AddDocument(doc);
};
编辑2
Analyzer analyzer = new StandardAnalyzer();
Directory directory = new RAMDirectory();
IndexWriter iwriter = new IndexWriter(directory, analyzer, true);
iwriter.SetMaxFieldLength(25000);
然后添加文档的代码,然后;
iwriter.Close();
I wrote a small loop which added 10,000 documents into the IndexWriter and it took for ever to do it.
Is there another way to index large volumes of documents?
I ask because when this goes live it has to load in 15,000 records.
The other question is how do I prevent having to load in all the records again when the web application is restarted?
Edit
Here is the code i used;
for (int t = 0; t < 10000; t++){
doc = new Document();
text = "Value" + t.toString();
doc.Add(new Field("Value", text, Field.Store.YES, Field.Index.TOKENIZED));
iwriter.AddDocument(doc);
};
Edit 2
Analyzer analyzer = new StandardAnalyzer();
Directory directory = new RAMDirectory();
IndexWriter iwriter = new IndexWriter(directory, analyzer, true);
iwriter.SetMaxFieldLength(25000);
then the code to add the documents, then;
iwriter.Close();
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您应该这样做以获得最佳性能。在我的机器上,我在 1 秒内索引 1000 个文档
1)您应该重用(文档,字段),而不是每次添加这样的文档时都创建
之后,您可以使用线程并将大集合分解为较小的集合,然后使用上面每个部分的代码
例如,如果您有 10,000 个文档,您可以使用 ThreadPool 创建 10 个线程,并将每个部分提供给
一个线程用于索引
那么您将获得最佳性能。
You should do this way to get the best performance. on my machine i'm indexing 1000 document in 1 second
1) You should reuse (Document, Field) not creating every time you add a document like this
After that you could use threading and break your large collection into smaller ones, and use the above code for each section
for example if you have 10,000 document you can create 10 Thread using ThreadPool and feed each section to
one thread for indexing
Then you will gain the best performance.
只是检查了一下,但是当您运行它时,您还没有附加调试器,是吗?
这会严重影响添加文档时的性能。
在我的机器上(Lucene 2.0.0.4):
使用平台目标 x86 构建:
无调试器 - 5.2 秒
附加调试器 - 113.8 秒
使用平台目标 x64 构建:
无调试器 - 6.0 秒
附加调试器 - 171.4 秒
保存和加载索引的粗略示例从 RAM 目录:
Just checking, but you haven't got the debugger attached when you're running it have you?
This severely affects performance when adding documents.
On my machine (Lucene 2.0.0.4):
Built with platform target x86:
No debugger - 5.2 seconds
Debugger attached - 113.8 seconds
Built with platform target x64:
No debugger - 6.0 seconds
Debugger attached - 171.4 seconds
Rough example of saving and loading an index to and from a RAMDirectory: