Lucene:使用 PrefixQuery 进行分数计算

发布于 2024-09-06 15:50:17 字数 1554 浏览 9 评论 0原文

我在使用 PrefixQuery 进行分数计算时遇到问题。为了更改每个文档的分数,当将文档添加到索引中时,我使用 setBoost 来更改文档的提升。然后我创建PrefixQuery来搜索,但是结果没有根据boost改变。看来 setBoost 完全不适用于 PrefixQuery。请检查下面的代码:

 @Test
 public void testNormsDocBoost() throws Exception {
    Directory dir = new RAMDirectory();
    IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_CURRENT), true,
            IndexWriter.MaxFieldLength.LIMITED);
    Document doc1 = new Document();
    Field f1 = new Field("contents", "common1", Field.Store.YES, Field.Index.ANALYZED);
    doc1.add(f1);
    doc1.setBoost(100);
    writer.addDocument(doc1);
    Document doc2 = new Document();
    Field f2 = new Field("contents", "common2", Field.Store.YES, Field.Index.ANALYZED);
    doc2.add(f2);
    doc2.setBoost(200);
    writer.addDocument(doc2);
    Document doc3 = new Document();
    Field f3 = new Field("contents", "common3", Field.Store.YES, Field.Index.ANALYZED);
    doc3.add(f3);
    doc3.setBoost(300);
    writer.addDocument(doc3);
    writer.close();

    IndexReader reader = IndexReader.open(dir);
    IndexSearcher searcher = new IndexSearcher(reader);

    TopDocs docs = searcher.search(new PrefixQuery(new Term("contents", "common")), 10);
    for (ScoreDoc doc : docs.scoreDocs) {
        System.out.println("docid : " + doc.doc + " score : " + doc.score + " "
                + searcher.doc(doc.doc).get("contents"));
    }
} 

输出是:

 docid : 0 score : 1.0 common1
 docid : 1 score : 1.0 common2
 docid : 2 score : 1.0 common3

I have a problem with the score calculation with a PrefixQuery. To change score of each document, when add document into index, I have used setBoost to change the boost of the document. Then I create PrefixQuery to search, but the result have not been changed according to the boost. It seems setBoost totally doesn't work for a PrefixQuery. Please check my code below:

 @Test
 public void testNormsDocBoost() throws Exception {
    Directory dir = new RAMDirectory();
    IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_CURRENT), true,
            IndexWriter.MaxFieldLength.LIMITED);
    Document doc1 = new Document();
    Field f1 = new Field("contents", "common1", Field.Store.YES, Field.Index.ANALYZED);
    doc1.add(f1);
    doc1.setBoost(100);
    writer.addDocument(doc1);
    Document doc2 = new Document();
    Field f2 = new Field("contents", "common2", Field.Store.YES, Field.Index.ANALYZED);
    doc2.add(f2);
    doc2.setBoost(200);
    writer.addDocument(doc2);
    Document doc3 = new Document();
    Field f3 = new Field("contents", "common3", Field.Store.YES, Field.Index.ANALYZED);
    doc3.add(f3);
    doc3.setBoost(300);
    writer.addDocument(doc3);
    writer.close();

    IndexReader reader = IndexReader.open(dir);
    IndexSearcher searcher = new IndexSearcher(reader);

    TopDocs docs = searcher.search(new PrefixQuery(new Term("contents", "common")), 10);
    for (ScoreDoc doc : docs.scoreDocs) {
        System.out.println("docid : " + doc.doc + " score : " + doc.score + " "
                + searcher.doc(doc.doc).get("contents"));
    }
} 

The output is :

 docid : 0 score : 1.0 common1
 docid : 1 score : 1.0 common2
 docid : 2 score : 1.0 common3

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

孤千羽 2024-09-13 15:50:17

默认情况下,PrefixQuery 重写查询以使用 ConstantScoreQuery,它为每个匹配文档提供 1.0 的分数。我认为这是为了让 PrefixQuery 更快。所以你的提升会被忽略。

如果您希望增强在 PrefixQuery 中生效,则需要在前缀查询实例上使用 SCORING_BOOLEAN_QUERY_REWRITE 常量来调用 setRewriteMethod()。请参阅http://lucene.apache.org/java/2_9_1/api/ all/index.html

对于调试,您可以使用 searcher.explain()。

By default, PrefixQuery rewrites the query to use ConstantScoreQuery, which gives every single matching document a score of 1.0. I think this is to make PrefixQuery faster. So your boosts are getting ignored.

If you want the boosts to take effect in your PrefixQuery, you need to call setRewriteMethod(), using the SCORING_BOOLEAN_QUERY_REWRITE constant on your prefix query instance. See http://lucene.apache.org/java/2_9_1/api/all/index.html .

For debugging, you can use searcher.explain().

百合的盛世恋 2024-09-13 15:50:17

这是预期的行为。以下是 Lucene 创建者 Doug Cutting 的解释:

PrefixQuery 相当于包含所有与
前缀,因此通常包含很多术语。有了这么大的
查询,匹配文档可能包含较少的查询术语,并且
比赛因此较弱。

请阅读引用来源的原始帖子

对于 Lucene,通常最好仅使用分数作为一组文档中相关性的相对度量。分数的绝对值会根据许多因素而变化,因此不应按原样使用。

更新
Cutting 的解释是指旧版本的 Lucene。因此 bajafresh4life 的答案是正确的。

It is the expected behavior. Here is the explanation of Lucene creator's Doug Cutting:

A PrefixQuery is equivalent to a query containing all the terms matching the
prefix, and is hence usually contains a lot of terms. With such a big
query, matching documents are likely to contain fewer of the query terms and
the match is thus weaker.

Read the original post where the quote is taking from.

With Lucene, it is generally better to use the score only as a relative measure of relevancy in a set of documents. The absolute value of the score will change depending on so many factors that it should not be used as is.

UPDATE
The explanation from Cutting refers to an older version of Lucene. Thus the answer from bajafresh4life is the correct one.

勿挽旧人 2024-09-13 15:50:17

更改重写方法

Bajafresh4life 建议调用setRewriteMethod。然而,这不是在 Lucene.Net 中更改此设置的方式。下面是在 C# 中执行此操作的方法:

默认情况下,每个 PrefixQueryQueryParserNewPrefixQuery 方法返回,如下所示:

protected internal virtual Query NewPrefixQuery(Term prefix)
{
    return new PrefixQuery(prefix) { RewriteMethod = multiTermRewriteMethod };
}

您可以在之后更改此设置使用 QueryParser.MultiTermRewriteMethodset 属性实例化您的解析器,如下所示:

var parser = new QueryParser( Version.LUCENE_30, field, analyzer );
parser.MultiTermRewriteMethod = MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE;

请注意,这也会更改其他查询的行为,而不仅仅是前缀查询。要仅影响前缀查询,您可以子类化 QueryParser 并重写 NewPrefixQuery,以便返回的 PrefixQuery 的构造函数使用您选择的重写方法。

使用哪种重写方法

不过,这似乎并没有解决我的问题。实际上,我使用 MultiTermQuery.CONSTANT_SCORE_BOOLEAN_QUERY_REWRITE 的运气更好。在这个方法的描述中,它说

与 SCORING_BOOLEAN_QUERY_REWRITE 类似,但不计算分数。相反,每个匹配文档都会收到一个等于查询提升的恒定分数。

但这可能是因为我还对 PrefixQuery 进行了子类化并覆盖了 ReWrite 以分配我想要的分数作为提升。

经过大量调试后,我最终发现,当我尝试使用 SCORING_BOOLEAN_QUERY_REWRITE 时,DefaultSimilarity.QueryNorm 在使用它返回的值时会干扰我的分数在 Weight.Normalize 中,它在 Query.Weight 中调用。

Changing the Rewrite Method

Bajafresh4life suggested calling setRewriteMethod. However, that's not how you change this in Lucene.Net. Here's how to do it in C#:

By default, each PrefixQuery is returned by the NewPrefixQuery method of QueryParser like so:

protected internal virtual Query NewPrefixQuery(Term prefix)
{
    return new PrefixQuery(prefix) { RewriteMethod = multiTermRewriteMethod };
}

You can change this after instantiating your parser by using the set property of QueryParser.MultiTermRewriteMethod, like so:

var parser = new QueryParser( Version.LUCENE_30, field, analyzer );
parser.MultiTermRewriteMethod = MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE;

Note that this will change the behavior for other queries as well, not just the prefix query. To affect just the prefix query, you can subclass QueryParser and override NewPrefixQuery so that the constructor for the returned PrefixQuery uses the rewrite method of your choice.

Which Rewrite Method to Use

That doesn't seem to have fixed it for me, though. I actually had better luck using MultiTermQuery.CONSTANT_SCORE_BOOLEAN_QUERY_REWRITE. In the description for this method, it says

Like SCORING_BOOLEAN_QUERY_REWRITE except scores are not computed. Instead, each matching document receives a constant score equal to the query's boost.

But that could be because I also subclassed PrefixQuery and overrode ReWrite to assign the scores I want as boosts.

After a fair amount of debugging, I eventually figured out that, while I was trying to use SCORING_BOOLEAN_QUERY_REWRITE, DefaultSimilarity.QueryNorm was interfering with my scores when the value it returns is used in Weight.Normalize, which is called in Query.Weight.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文