Lucene (Java) 中的过滤术语计数

发布于 2025-01-07 15:20:55 字数 1858 浏览 1 评论 0 原文

我目前正在尝试使用 Lucene 获取描述字段中每个单词的出现次数。 Fe

描述：BOX OF APPLES
描述：BOX OF BANANAS

输出：

BOX 2
OF 2
APPLES 1
BANANAS 1

我正在寻找单词和频率。

问题是我想将这些结果过滤到给定文档，我的意思是只计算给定文档的描述字段中的单词。

感谢您提供的任何帮助。

//回答评论：我有这样的东西：

public ArrayList<ObjectA> GetIndexTerms(String code) {
        try {

            ArrayList<Object> termlist = new ArrayList<ObjectA>();
            indexR = IndexReader.open(path); 
            TermEnum terms = indexR.terms();           

            while (terms.next()) {
                Term term = terms.term();
                String termText = term.text();                    
                int frequency = indexR.docFreq(term); 
                ObjectA newObj = new ObjectA(termText, frequency);
                termlist.add(newObj);                      
                }                   
            }               
            return termlist;
        } catch (Exception ex) {               
            ex.printStackTrace();
            return null;
        }
}

但我不知道如何按文档过滤它...

//今天！

使用 termfreqvec 我可以让它工作，但它需要 de doc id，我无法正确使用它。因为我使用了一个从 0 开始的查询 de“i”值，所以这不是正确的文档 ID。有什么想法可以让它正常工作吗？谢谢！

    TopDocs tp = indexS.search(query, Integer.MAX_VALUE);
        for (int i = 0; i < tp.scoreDocs.length; i++){  
            ScoreDoc sds = tp.scoreDocs[i];
            Document doc = indexS.doc(sds.doc);
            TermFreqVector tfv = indexR.getTermFreqVector(i,"description");

            for (int j = 0; j < tfv.getTerms().length; j++) {
                String item = tfv.getTerms()[j];
                termlist.add(new TerminoDescripcion(item.toUpperCase(), tfv.getTermFrequencies()[j]));
            }
        }

原文

I'm currently trying to get the amount of appearences of each word in a description field using Lucene.
F.e.

description: BOX OF APPLES
description: BOX OF BANANAS

output:

BOX 2
OF 2
APPLES 1
BANANAS 1

I am looking to get the word and the frequency.

The thing is I would like to filter those results to a given document, I mean only count the words in the description field of a given document.

Thanks for any assistance given.

//in answer to comment:
I have something like this:

public ArrayList<ObjectA> GetIndexTerms(String code) {
        try {

            ArrayList<Object> termlist = new ArrayList<ObjectA>();
            indexR = IndexReader.open(path); 
            TermEnum terms = indexR.terms();           

            while (terms.next()) {
                Term term = terms.term();
                String termText = term.text();                    
                int frequency = indexR.docFreq(term); 
                ObjectA newObj = new ObjectA(termText, frequency);
                termlist.add(newObj);                      
                }                   
            }               
            return termlist;
        } catch (Exception ex) {               
            ex.printStackTrace();
            return null;
        }
}

But i don't see how to filter it by document...

//TODAY!

Using the termfreqvec I can get it to work but it takes de doc id and I can't use it right. Since I used a query de "i" value starts in 0 and that's not the proper doc id. Any ideas to get this working properly?
Thanks!

    TopDocs tp = indexS.search(query, Integer.MAX_VALUE);
        for (int i = 0; i < tp.scoreDocs.length; i++){  
            ScoreDoc sds = tp.scoreDocs[i];
            Document doc = indexS.doc(sds.doc);
            TermFreqVector tfv = indexR.getTermFreqVector(i,"description");

            for (int j = 0; j < tfv.getTerms().length; j++) {
                String item = tfv.getTerms()[j];
                termlist.add(new TerminoDescripcion(item.toUpperCase(), tfv.getTermFrequencies()[j]));
            }
        }

分享到QQ

分享到微博