Lucene (Java) 中的过滤术语计数

发布于 2025-01-07 15:20:55 字数 1858 浏览 1 评论 0 原文

我目前正在尝试使用 Lucene 获取描述字段中每个单词的出现次数。 Fe

  • 描述:BOX OF APPLES
  • 描述:BOX OF BANANAS

输出:

  • BOX 2
  • OF 2
  • APPLES 1
  • BANANAS 1

我正在寻找单词和频率。

问题是我想将这些结果过滤到给定文档,我的意思是只计算给定文档的描述字段中的单词。

感谢您提供的任何帮助。

//回答评论: 我有这样的东西:

public ArrayList<ObjectA> GetIndexTerms(String code) {
        try {

            ArrayList<Object> termlist = new ArrayList<ObjectA>();
            indexR = IndexReader.open(path); 
            TermEnum terms = indexR.terms();           

            while (terms.next()) {
                Term term = terms.term();
                String termText = term.text();                    
                int frequency = indexR.docFreq(term); 
                ObjectA newObj = new ObjectA(termText, frequency);
                termlist.add(newObj);                      
                }                   
            }               
            return termlist;
        } catch (Exception ex) {               
            ex.printStackTrace();
            return null;
        }
}

但我不知道如何按文档过滤它...


//今天!

使用 termfreqvec 我可以让它工作,但它需要 de doc id,我无法正确使用它。因为我使用了一个从 0 开始的查询 de“i”值,所以这不是正确的文档 ID。有什么想法可以让它正常工作吗? 谢谢!

    TopDocs tp = indexS.search(query, Integer.MAX_VALUE);
        for (int i = 0; i < tp.scoreDocs.length; i++){  
            ScoreDoc sds = tp.scoreDocs[i];
            Document doc = indexS.doc(sds.doc);
            TermFreqVector tfv = indexR.getTermFreqVector(i,"description");

            for (int j = 0; j < tfv.getTerms().length; j++) {
                String item = tfv.getTerms()[j];
                termlist.add(new TerminoDescripcion(item.toUpperCase(), tfv.getTermFrequencies()[j]));
            }
        }

I'm currently trying to get the amount of appearences of each word in a description field using Lucene.
F.e.

  • description: BOX OF APPLES
  • description: BOX OF BANANAS

output:

  • BOX 2
  • OF 2
  • APPLES 1
  • BANANAS 1

I am looking to get the word and the frequency.

The thing is I would like to filter those results to a given document, I mean only count the words in the description field of a given document.

Thanks for any assistance given.

//in answer to comment:
I have something like this:

public ArrayList<ObjectA> GetIndexTerms(String code) {
        try {

            ArrayList<Object> termlist = new ArrayList<ObjectA>();
            indexR = IndexReader.open(path); 
            TermEnum terms = indexR.terms();           

            while (terms.next()) {
                Term term = terms.term();
                String termText = term.text();                    
                int frequency = indexR.docFreq(term); 
                ObjectA newObj = new ObjectA(termText, frequency);
                termlist.add(newObj);                      
                }                   
            }               
            return termlist;
        } catch (Exception ex) {               
            ex.printStackTrace();
            return null;
        }
}

But i don't see how to filter it by document...


//TODAY!

Using the termfreqvec I can get it to work but it takes de doc id and I can't use it right. Since I used a query de "i" value starts in 0 and that's not the proper doc id. Any ideas to get this working properly?
Thanks!

    TopDocs tp = indexS.search(query, Integer.MAX_VALUE);
        for (int i = 0; i < tp.scoreDocs.length; i++){  
            ScoreDoc sds = tp.scoreDocs[i];
            Document doc = indexS.doc(sds.doc);
            TermFreqVector tfv = indexR.getTermFreqVector(i,"description");

            for (int j = 0; j < tfv.getTerms().length; j++) {
                String item = tfv.getTerms()[j];
                termlist.add(new TerminoDescripcion(item.toUpperCase(), tfv.getTermFrequencies()[j]));
            }
        }

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

月隐月明月朦胧 2025-01-14 15:20:55

问题在于 Lucene 是一个倒排索引,这意味着它可以轻松地根据术语检索文档,而您正在寻找相反的内容,即根据文档检索术语。

希望这是一个经常出现的问题,Lucene 使您能够检索文档的术语(术语向量),前提是您在索引时启用了此功能。

请参阅 TermVector.YES字段构造函数 了解如何在索引时启用它们以及 IndexReader 了解如何检索术语向量在搜索时。

或者,您可以动态重新分析存储的字段,但这可能会更慢,尤其是在大型字段上。

The problem is that Lucene is an inverted index, meaning that it makes it easy to retrieve documents based on terms, whereas you are looking for the opposite, i.e. retrieveing terms based on documents.

Hopefully, this is a recurrent problem and Lucene gives you the ability to retrieve terms for a document (term vectors) provided that you enabled this feature at indexing time.

See TermVector.YES and Field constructor to know how to enable them at indexing time and IndexReader to know how to retrieve term vectors at search time.

Alternatively, you could re-analyze a stored field on the fly, but this may be slower, especially on large fields.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文