Lucene:迭代所有条目

发布于 2024-12-06 17:24:39 字数 619 浏览 2 评论 0原文

我有一个 Lucene 索引,我想对其进行迭代(在当前开发阶段进行一次评估) 我有 4 个文档,每个文档有几十万到数百万个条目,我想对其进行迭代以计算每个条目的单词数 (~2-10) 并计算频率分布。

我现在正在做的是:

   for (int i = 0; i < reader.maxDoc(); i++) {
                    if (reader.isDeleted(i))
                        continue;

                    Document doc = reader.document(i);
                Field text = doc.getField("myDocName#1");

                String content = text.stringValue();


                int wordLen = countNumberOfWords(content);
//store
}

到目前为止,它正在迭代一些东西。调试确认它至少对文档中存储的术语进行操作,但由于某种原因它只处理存储术语的一小部分。我想知道我做错了什么?我只是想迭代所有文档以及其中存储的所有内容?

I have a Lucene Index which I would like to iterate (for one time evaluation at the current stage in development)
I have 4 documents with each a few hundred thousand up to million entries, which I want to iterate to count the number of words for each entry (~2-10) and calculate the frequency distribution.

What I am doing at the moment is this:

   for (int i = 0; i < reader.maxDoc(); i++) {
                    if (reader.isDeleted(i))
                        continue;

                    Document doc = reader.document(i);
                Field text = doc.getField("myDocName#1");

                String content = text.stringValue();


                int wordLen = countNumberOfWords(content);
//store
}

So far, it is iterating something. The debug confirms that its at least operating on the terms stored in the document, but for some reason it only process a small part of the stored terms. I wonder what I am doing wrong? I simply want to iterate over all documents and everything that is stored in them?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

-小熊_ 2024-12-13 17:24:39

首先,您需要确保启用了 TermVectors 索引

doc.add(new Field(TITLE, page.getTitle(), Field.Store.YES, Field.Index.ANALYZED, TermVector.WITH_POSITIONS_OFFSETS));

然后您可以使用 IndexReader.getTermFreqVector 来计算术语

TopDocs res = indexSearcher.search(YOUR_QUERY, null, 1000);

// iterate over documents in res, ommited for brevity

reader.getTermFreqVector(res.scoreDocs[i].doc, YOUR_FIELD, new TermVectorMapper() {
            public void map(String termval, int freq, TermVectorOffsetInfo[] offsets, int[] positions) {
                // increment frequency count of termval by freq
                freqs.increment(termval, freq);
            }

            public void setExpectations(String arg0, int arg1,boolean arg2, boolean arg3) {}
});

Firstly you need to ensure you index with TermVectors enabled

doc.add(new Field(TITLE, page.getTitle(), Field.Store.YES, Field.Index.ANALYZED, TermVector.WITH_POSITIONS_OFFSETS));

Then you can use IndexReader.getTermFreqVector to count terms

TopDocs res = indexSearcher.search(YOUR_QUERY, null, 1000);

// iterate over documents in res, ommited for brevity

reader.getTermFreqVector(res.scoreDocs[i].doc, YOUR_FIELD, new TermVectorMapper() {
            public void map(String termval, int freq, TermVectorOffsetInfo[] offsets, int[] positions) {
                // increment frequency count of termval by freq
                freqs.increment(termval, freq);
            }

            public void setExpectations(String arg0, int arg1,boolean arg2, boolean arg3) {}
});
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文