lucene索引中单词重要性

发布于 2024-09-11 09:37:24 字数 946 浏览 9 评论 0原文

嗯，我需要了解在 lucene 索引中建立索引的整个文档集合中的单词有多重要。我需要提取一些“可表示的单词”，比如说常见且可以代表整个集合的概念。或集合“关键字”。我进行了全文索引，我使用的唯一字段是文本内容，因为文档的标题大多无法表示（数字、代码等......）

编辑：我正在阅读可能包含 60 个文档的索引...

 int numDocs = fReader.numDocs();
 while(termEnum.next())
    {
        Term term = termEnum.term();
        double df = fReader.docFreq(term); 

       TermDocs termDocs = indexReader.termDocs(term);

    //HERE is what i mean when i say tfidf is per document,

             while(termDocs.next())
            {
               double tf = termDocs.freq();
               // Calculate tfidf.......
            }

            termDocs.close();

}

因此，我将获得该术语的 tfidf，但对于我们循环遍历的每个文档。我不需要这些结果：

tfidf(term1, doc1);

tfidf(term1, doc2);

tfidf(term1, doc3); ......等等。
我需要对这个术语在集合中的重要性进行一些衡量。凭直觉，它会类似于“如果术语“term1”在 5 个文档中具有良好的 tfidf，那么它很重要”

但是当然，更聪明的东西:)

谢谢！

原文

hmmm, i need to get how important is the word in entire document collection that is indexed in the lucene index. I need to extract some "representable words", lets say concepts that are common and can be representable to whole collection. Or collection "keywords". I did the fulltext indexing and the only field i am using are text contents, because titles of the documents are mostly not representable(numbers, codes etc....)

EDIT:
I am reading the index which contains maybe 60 documents....

 int numDocs = fReader.numDocs();
 while(termEnum.next())
    {
        Term term = termEnum.term();
        double df = fReader.docFreq(term); 

       TermDocs termDocs = indexReader.termDocs(term);

    //HERE is what i mean when i say tfidf is per document,

             while(termDocs.next())
            {
               double tf = termDocs.freq();
               // Calculate tfidf.......
            }

            termDocs.close();

}

So, I will get tfidf of this term, but for every document that we loop through. And I do not need these results:

tfidf(term1, doc1);

tfidf(term1, doc2);

tfidf(term1, doc3);
...........and so on.
I need some measure of importance of this term in the collection. By intuition, it would be something like "if term "term1" had good tfidf in 5 documents then it is important"

But ofcourse, something smarter :)

Thank you!!!

分享到QQ

分享到微博