lucene索引中单词重要性

发布于 2024-09-11 09:37:24 字数 946 浏览 6 评论 0原文

嗯,我需要了解在 lucene 索引中建立索引整个文档集合中的单词有多重要。我需要提取一些“可表示的单词”,比如说常见且可以代表整个集合的概念。或集合“关键字”。我进行了全文索引,我使用的唯一字段是文本内容,因为文档的标题大多无法表示(数字、代码等......)

编辑: 我正在阅读可能包含 60 个文档的索引...

 int numDocs = fReader.numDocs();
 while(termEnum.next())
    {
        Term term = termEnum.term();
        double df = fReader.docFreq(term); 

       TermDocs termDocs = indexReader.termDocs(term);

    //HERE is what i mean when i say tfidf is per document,

             while(termDocs.next())
            {
               double tf = termDocs.freq();
               // Calculate tfidf.......
            }

            termDocs.close();

}

因此,我将获得该术语的 tfidf,但对于我们循环遍历的每个文档。我不需要这些结果:

tfidf(term1, doc1);

tfidf(term1, doc2);

tfidf(term1, doc3); ......等等。
我需要对这个术语在集合中的重要性进行一些衡量。凭直觉,它会类似于“如果术语“term1”在 5 个文档中具有良好的 tfidf,那么它很重要”

但是当然,更聪明的东西:)

谢谢!

hmmm, i need to get how important is the word in entire document collection that is indexed in the lucene index. I need to extract some "representable words", lets say concepts that are common and can be representable to whole collection. Or collection "keywords". I did the fulltext indexing and the only field i am using are text contents, because titles of the documents are mostly not representable(numbers, codes etc....)

EDIT:
I am reading the index which contains maybe 60 documents....

 int numDocs = fReader.numDocs();
 while(termEnum.next())
    {
        Term term = termEnum.term();
        double df = fReader.docFreq(term); 

       TermDocs termDocs = indexReader.termDocs(term);

    //HERE is what i mean when i say tfidf is per document,

             while(termDocs.next())
            {
               double tf = termDocs.freq();
               // Calculate tfidf.......
            }

            termDocs.close();

}

So, I will get tfidf of this term, but for every document that we loop through. And I do not need these results:

tfidf(term1, doc1);

tfidf(term1, doc2);

tfidf(term1, doc3);
...........and so on.
I need some measure of importance of this term in the collection. By intuition, it would be something like "if term "term1" had good tfidf in 5 documents then it is important"

But ofcourse, something smarter :)

Thank you!!!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

无风消散 2024-09-18 09:37:24

因此,如果我计算 tfidf,它就会给出单个术语相对于单个文档的重要性。

不正确。 IDF 在整个语料库中进行全球测量。 IDF 的全部目的是提供一个简单的衡量标准来准确衡量您正在寻找的内容——一个术语的“重要性”程度。

因此,满足您要求的一个简单方法是找到语料库中最常出现的术语,并按文档频率对它们进行加权。

So, if i calculate tfidf, it gives me importance of single term with respect to single document.

Not true. IDF is measured globally across the entire corpus. The whole point of IDF is to provide a simple measure of exactly what you're looking for -- how "important" a term is.

So an easy way of doing what you ask is to find the most frequently occurring terms in the corpus, and weight them by document frequency.

眼藏柔 2024-09-18 09:37:24

您可以尝试使用 Luke 打开索引,它会为您提供排名靠前的字词。

You can try opening the index using Luke and it gives you the top-ranked terms.

断念 2024-09-18 09:37:24

编辑:我仍然不明白你想要实现的目标。
高 TF/IDF 值意味着该术语对于区分该文档与集合的其余部分很有用,即:该术语在特定文档中比在一般集合中相对更频繁。因此,它“代表”了集合背景下的文档。这是你想要的吗?

重新表述您的问题的一种可能方法是您希望使用很少的高频术语来压缩集合。这意味着在集合中出现很多的单词,并且可以通过取具有低 idf 的单词来完成。

另一种选择是您需要某种简洁的方式来表示更一般背景下的集合,例如更大的集合或整个 WWW。在这种情况下,您想要比较集合之间的词频,请考虑之间的相互信息单词类型和集合,或其他特征选择方法。

如果我还是没明白你的意思,请说出来。

EDIT: I still do not get what you are trying to achieve.
A high TF/IDF value means that this term is useful for differentiating this document from the rest of the collection, that is: this term is relatively more frequent in the specific document than in the collection in general. Therefore it "represents" the document against the collection background. Is this what you want?

One possible way to rephrase your question is that you want to compress the collection, using few high-frequency terms. This means words that appear a lot in the collection, and can be done by take words having low idf.

Another alternative is that you want some concise way to represent the collection against a more general background, say a larger collection or the whole WWW. In that case, you want to compare word frequency between collections, consider the mutual information between the word type and the collection, or other feature selection methods.

If I still miss your point, please say so.

朕就是辣么酷 2024-09-18 09:37:24

contrib/ 文件夹有一个类来生成最常用术语的列表: http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/contrib/misc/src/java /org/apache/lucene/misc/HighFreqTerms.java

如果您正在寻找语义特征提取,您可以查看 http://project.carrot2.org/

The contrib/ folder has a class to generate a list of the most frequent terms: http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/contrib/misc/src/java/org/apache/lucene/misc/HighFreqTerms.java

If you're instead looking for semantic feature extraction, you can check out http://project.carrot2.org/

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文