Lucene 评分问题

发布于 2024-08-10 13:56:16 字数 2744 浏览 5 评论 0原文

我对 Lucene 的评分功能有一个问题，我无法弄清楚。到目前为止，我已经能够编写这段代码来重现它。

package lucenebug;

import java.util.Arrays;
import java.util.List;

import org.apache.lucene.analysis.SimpleAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;

public class Test {
    private static final String TMP_LUCENEBUG_INDEX = "/tmp/lucenebug_index";

    public static void main(String[] args) throws Throwable {
        SimpleAnalyzer analyzer = new SimpleAnalyzer();
        IndexWriter w = new IndexWriter(TMP_LUCENEBUG_INDEX, analyzer, true);
        List<String> names = Arrays
                .asList(new String[] { "the rolling stones",
                        "rolling stones (karaoke)",
                        "the rolling stones tribute",
                        "rolling stones tribute band",
                        "karaoke - the rolling stones" });
        try {
            for (String name : names) {
                System.out.println("#name: " + name);
                Document doc = new Document();
                doc.add(new Field("name", name, Field.Store.YES,
                        Field.Index.TOKENIZED));
                w.addDocument(doc);
            }
            System.out.println("finished adding docs, total size: "
                    + w.docCount());

        } finally {
            w.close();
        }

        IndexSearcher s = new IndexSearcher(TMP_LUCENEBUG_INDEX);
        QueryParser p = new QueryParser("name", analyzer);
        Query q = p.parse("name:(rolling stones)");
        System.out.println("--------\nquery: " + q);

        TopDocs topdocs = s.search(q, null, 10);
        for (ScoreDoc sd : topdocs.scoreDocs) {
            System.out.println("" + sd.score + "\t"
                    + s.doc(sd.doc).getField("name").stringValue());
        }
    }
}

我运行它得到的输出是：

finished adding docs, total size: 5
--------
query: name:rolling name:stones
0.578186    the rolling stones
0.578186    rolling stones (karaoke)
0.578186    the rolling stones tribute
0.578186    rolling stones tribute band
0.578186    karaoke - the rolling stones

我只是不明白为什么滚石与滚石致敬具有相同的相关性。根据 lucene 的文档，一个字段拥有的 token 越多，归一化因子应该越小，因此 the Rolling Stones 贡品 的得分应该比 the Rolling Stones 低。

有什么想法吗？

原文

I have a problem with Lucene's scoring function that I can't figure out. So far, I've been able to write this code to reproduce it.

package lucenebug;

import java.util.Arrays;
import java.util.List;

import org.apache.lucene.analysis.SimpleAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;

public class Test {
    private static final String TMP_LUCENEBUG_INDEX = "/tmp/lucenebug_index";

    public static void main(String[] args) throws Throwable {
        SimpleAnalyzer analyzer = new SimpleAnalyzer();
        IndexWriter w = new IndexWriter(TMP_LUCENEBUG_INDEX, analyzer, true);
        List<String> names = Arrays
                .asList(new String[] { "the rolling stones",
                        "rolling stones (karaoke)",
                        "the rolling stones tribute",
                        "rolling stones tribute band",
                        "karaoke - the rolling stones" });
        try {
            for (String name : names) {
                System.out.println("#name: " + name);
                Document doc = new Document();
                doc.add(new Field("name", name, Field.Store.YES,
                        Field.Index.TOKENIZED));
                w.addDocument(doc);
            }
            System.out.println("finished adding docs, total size: "
                    + w.docCount());

        } finally {
            w.close();
        }

        IndexSearcher s = new IndexSearcher(TMP_LUCENEBUG_INDEX);
        QueryParser p = new QueryParser("name", analyzer);
        Query q = p.parse("name:(rolling stones)");
        System.out.println("--------\nquery: " + q);

        TopDocs topdocs = s.search(q, null, 10);
        for (ScoreDoc sd : topdocs.scoreDocs) {
            System.out.println("" + sd.score + "\t"
                    + s.doc(sd.doc).getField("name").stringValue());
        }
    }
}

The output I get from running it is:

finished adding docs, total size: 5
--------
query: name:rolling name:stones
0.578186    the rolling stones
0.578186    rolling stones (karaoke)
0.578186    the rolling stones tribute
0.578186    rolling stones tribute band
0.578186    karaoke - the rolling stones

I just can't understand why the rolling stones has the same relevance as the rolling stones tribute. According to lucene's documentation, the more tokens a field has, the smaller the normalization factor should be, and therefore the rolling stones tribute should have a lower score than the rolling stones.

Any ideas?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

把昨日还给我 2024-08-17 13:56:16

长度归一化因子的计算方式为 1 / sqrt(numTerms) （您可以在 DefaultSimilarity

该结果不直接存储在索引中。该值乘以指定字段的 boost 值。然后对最终结果进行编码如 Similarity.encodeNorm() 这是一种有损编码，这意味着细节会丢失。

如果您想查看长度标准化的效果，请尝试使用以下句子创建文档，

the rolling stones tribute a b c d e f g h i j k

这将在长度标准化值中产生足够的差异。

。现在，如果您的字段按照您使用的示例只有很少的标记，您可以根据您自己的公式设置文档/字段的提升值，这对于短字段来说本质上是更高的提升或者，您可以创建自定义相似度并覆盖legthNorm()方法。

The length normalization factor is calculated as 1 / sqrt(numTerms) (You can see this in DefaultSimilarity

This result is not stored in the index directly. This value is multiplied by the boost value for the field specified. The final result is then encoded in 8 bits as explained in Similarity.encodeNorm() This is a lossy encoding, which means fine details get lost.

If you want to see length normalization in action, try creating document with following sentence.

the rolling stones tribute a b c d e f g h i j k

This will create sufficient difference in the length normalization values which you could see.

Now if your field have very few tokens as per the examples you have used, you could set boost values for the documents/fields based on your own formula which is essentially higher boost for short field. Alternatively, you could create custom Similarity and override legthNorm() method.

回复收藏 0 原文