Lucene:迭代所有条目
我有一个 Lucene 索引,我想对其进行迭代(在当前开发阶段进行一次评估) 我有 4 个文档,每个文档有几十万到数百万个条目,我想对其进行迭代以计算每个条目的单词数 (~2-10) 并计算频率分布。
我现在正在做的是:
for (int i = 0; i < reader.maxDoc(); i++) {
if (reader.isDeleted(i))
continue;
Document doc = reader.document(i);
Field text = doc.getField("myDocName#1");
String content = text.stringValue();
int wordLen = countNumberOfWords(content);
//store
}
到目前为止,它正在迭代一些东西。调试确认它至少对文档中存储的术语进行操作,但由于某种原因它只处理存储术语的一小部分。我想知道我做错了什么?我只是想迭代所有文档以及其中存储的所有内容?
I have a Lucene Index which I would like to iterate (for one time evaluation at the current stage in development)
I have 4 documents with each a few hundred thousand up to million entries, which I want to iterate to count the number of words for each entry (~2-10) and calculate the frequency distribution.
What I am doing at the moment is this:
for (int i = 0; i < reader.maxDoc(); i++) {
if (reader.isDeleted(i))
continue;
Document doc = reader.document(i);
Field text = doc.getField("myDocName#1");
String content = text.stringValue();
int wordLen = countNumberOfWords(content);
//store
}
So far, it is iterating something. The debug confirms that its at least operating on the terms stored in the document, but for some reason it only process a small part of the stored terms. I wonder what I am doing wrong? I simply want to iterate over all documents and everything that is stored in them?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
首先,您需要确保启用了 TermVectors 索引
然后您可以使用
IndexReader.getTermFreqVector
来计算术语Firstly you need to ensure you index with TermVectors enabled
Then you can use
IndexReader.getTermFreqVector
to count terms