Lucene:使用 PrefixQuery 进行分数计算
我在使用 PrefixQuery 进行分数计算时遇到问题。为了更改每个文档的分数,当将文档添加到索引中时,我使用 setBoost 来更改文档的提升。然后我创建PrefixQuery来搜索,但是结果没有根据boost改变。看来 setBoost 完全不适用于 PrefixQuery。请检查下面的代码:
@Test
public void testNormsDocBoost() throws Exception {
Directory dir = new RAMDirectory();
IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_CURRENT), true,
IndexWriter.MaxFieldLength.LIMITED);
Document doc1 = new Document();
Field f1 = new Field("contents", "common1", Field.Store.YES, Field.Index.ANALYZED);
doc1.add(f1);
doc1.setBoost(100);
writer.addDocument(doc1);
Document doc2 = new Document();
Field f2 = new Field("contents", "common2", Field.Store.YES, Field.Index.ANALYZED);
doc2.add(f2);
doc2.setBoost(200);
writer.addDocument(doc2);
Document doc3 = new Document();
Field f3 = new Field("contents", "common3", Field.Store.YES, Field.Index.ANALYZED);
doc3.add(f3);
doc3.setBoost(300);
writer.addDocument(doc3);
writer.close();
IndexReader reader = IndexReader.open(dir);
IndexSearcher searcher = new IndexSearcher(reader);
TopDocs docs = searcher.search(new PrefixQuery(new Term("contents", "common")), 10);
for (ScoreDoc doc : docs.scoreDocs) {
System.out.println("docid : " + doc.doc + " score : " + doc.score + " "
+ searcher.doc(doc.doc).get("contents"));
}
}
输出是:
docid : 0 score : 1.0 common1
docid : 1 score : 1.0 common2
docid : 2 score : 1.0 common3
I have a problem with the score calculation with a PrefixQuery. To change score of each document, when add document into index, I have used setBoost to change the boost of the document. Then I create PrefixQuery to search, but the result have not been changed according to the boost. It seems setBoost totally doesn't work for a PrefixQuery. Please check my code below:
@Test
public void testNormsDocBoost() throws Exception {
Directory dir = new RAMDirectory();
IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_CURRENT), true,
IndexWriter.MaxFieldLength.LIMITED);
Document doc1 = new Document();
Field f1 = new Field("contents", "common1", Field.Store.YES, Field.Index.ANALYZED);
doc1.add(f1);
doc1.setBoost(100);
writer.addDocument(doc1);
Document doc2 = new Document();
Field f2 = new Field("contents", "common2", Field.Store.YES, Field.Index.ANALYZED);
doc2.add(f2);
doc2.setBoost(200);
writer.addDocument(doc2);
Document doc3 = new Document();
Field f3 = new Field("contents", "common3", Field.Store.YES, Field.Index.ANALYZED);
doc3.add(f3);
doc3.setBoost(300);
writer.addDocument(doc3);
writer.close();
IndexReader reader = IndexReader.open(dir);
IndexSearcher searcher = new IndexSearcher(reader);
TopDocs docs = searcher.search(new PrefixQuery(new Term("contents", "common")), 10);
for (ScoreDoc doc : docs.scoreDocs) {
System.out.println("docid : " + doc.doc + " score : " + doc.score + " "
+ searcher.doc(doc.doc).get("contents"));
}
}
The output is :
docid : 0 score : 1.0 common1
docid : 1 score : 1.0 common2
docid : 2 score : 1.0 common3
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
默认情况下,PrefixQuery 重写查询以使用 ConstantScoreQuery,它为每个匹配文档提供 1.0 的分数。我认为这是为了让 PrefixQuery 更快。所以你的提升会被忽略。
如果您希望增强在 PrefixQuery 中生效,则需要在前缀查询实例上使用 SCORING_BOOLEAN_QUERY_REWRITE 常量来调用 setRewriteMethod()。请参阅http://lucene.apache.org/java/2_9_1/api/ all/index.html 。
对于调试,您可以使用 searcher.explain()。
By default, PrefixQuery rewrites the query to use ConstantScoreQuery, which gives every single matching document a score of 1.0. I think this is to make PrefixQuery faster. So your boosts are getting ignored.
If you want the boosts to take effect in your PrefixQuery, you need to call setRewriteMethod(), using the SCORING_BOOLEAN_QUERY_REWRITE constant on your prefix query instance. See http://lucene.apache.org/java/2_9_1/api/all/index.html .
For debugging, you can use searcher.explain().
这是预期的行为。以下是 Lucene 创建者 Doug Cutting 的解释:
请阅读引用来源的原始帖子。
对于 Lucene,通常最好仅使用分数作为一组文档中相关性的相对度量。分数的绝对值会根据许多因素而变化,因此不应按原样使用。
更新
Cutting 的解释是指旧版本的 Lucene。因此 bajafresh4life 的答案是正确的。
It is the expected behavior. Here is the explanation of Lucene creator's Doug Cutting:
Read the original post where the quote is taking from.
With Lucene, it is generally better to use the score only as a relative measure of relevancy in a set of documents. The absolute value of the score will change depending on so many factors that it should not be used as is.
UPDATE
The explanation from Cutting refers to an older version of Lucene. Thus the answer from bajafresh4life is the correct one.
更改重写方法
Bajafresh4life 建议调用
setRewriteMethod
。然而,这不是在 Lucene.Net 中更改此设置的方式。下面是在 C# 中执行此操作的方法:默认情况下,每个
PrefixQuery
由QueryParser
的NewPrefixQuery
方法返回,如下所示:您可以在之后更改此设置使用
QueryParser.MultiTermRewriteMethod
的set
属性实例化您的解析器,如下所示:请注意,这也会更改其他查询的行为,而不仅仅是前缀查询。要仅影响前缀查询,您可以子类化
QueryParser
并重写NewPrefixQuery
,以便返回的PrefixQuery
的构造函数使用您选择的重写方法。使用哪种重写方法
不过,这似乎并没有解决我的问题。实际上,我使用 MultiTermQuery.CONSTANT_SCORE_BOOLEAN_QUERY_REWRITE 的运气更好。在这个方法的描述中,它说
但这可能是因为我还对
PrefixQuery
进行了子类化并覆盖了ReWrite
以分配我想要的分数作为提升。经过大量调试后,我最终发现,当我尝试使用
SCORING_BOOLEAN_QUERY_REWRITE
时,DefaultSimilarity.QueryNorm
在使用它返回的值时会干扰我的分数在Weight.Normalize
中,它在Query.Weight
中调用。Changing the Rewrite Method
Bajafresh4life suggested calling
setRewriteMethod
. However, that's not how you change this in Lucene.Net. Here's how to do it in C#:By default, each
PrefixQuery
is returned by theNewPrefixQuery
method ofQueryParser
like so:You can change this after instantiating your parser by using the
set
property ofQueryParser.MultiTermRewriteMethod
, like so:Note that this will change the behavior for other queries as well, not just the prefix query. To affect just the prefix query, you can subclass
QueryParser
and overrideNewPrefixQuery
so that the constructor for the returnedPrefixQuery
uses the rewrite method of your choice.Which Rewrite Method to Use
That doesn't seem to have fixed it for me, though. I actually had better luck using
MultiTermQuery.CONSTANT_SCORE_BOOLEAN_QUERY_REWRITE
. In the description for this method, it saysBut that could be because I also subclassed
PrefixQuery
and overrodeReWrite
to assign the scores I want as boosts.After a fair amount of debugging, I eventually figured out that, while I was trying to use
SCORING_BOOLEAN_QUERY_REWRITE
,DefaultSimilarity.QueryNorm
was interfering with my scores when the value it returns is used inWeight.Normalize
, which is called inQuery.Weight
.