根据同名特定字段的权重调整 Lucene 搜索结果分数
我目前正在使用 Lucene 作为我们的全文搜索引擎。但我们需要根据特定字段对搜索结果进行排序。
例如,如果我们的索引中有以下三个文档,其内容完全相同,除了 id 字段之外。
val document01 = new Document()
val field0100 = new Field("id", "1", Field.Store.YES, Field.Index.ANALYZED)
val field0101 = new Field("contents", "This is a test: Linux", Field.Store.YES, Field.Index.ANALYZED)
val field0102 = new Field("contents", "This is a test: Windows", Field.Store.YES, Field.Index.ANALYZED)
document01.add(field0100)
document01.add(field0101)
document01.add(field0102)
val document02 = new Document()
val field0200 = new Field("id", "2", Field.Store.YES, Field.Index.ANALYZED)
val field0201 = new Field("contents", "This is a test: Linux", Field.Store.YES, Field.Index.ANALYZED)
val field0202 = new Field("contents", "This is a test: Windows", Field.Store.YES, Field.Index.ANALYZED)
document02.add(field0200)
document02.add(field0201)
document02.add(field0202)
val document03 = new Document()
val field0300 = new Field("id", "3", Field.Store.YES, Field.Index.ANALYZED)
val field0301 = new Field("contents", "This is a test: Linux", Field.Store.YES, Field.Index.ANALYZED)
val field0302 = new Field("contents", "This is a test: Windows", Field.Store.YES, Field.Index.ANALYZED)
document03.add(field0300)
document03.add(field0301)
document03.add(field0302)
现在,当我使用 IndexSearcher 搜索 Linux
时,得到以下结果:
Document<stored,indexed,tokenized<id:1> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>>
Document<stored,indexed,tokenized<id:2> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>>
Document<stored,indexed,tokenized<id:3> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>>
当我搜索 Windows
时,我得到相同顺序的相同结果。
Document<stored,indexed,tokenized<id:1> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>>
Document<stored,indexed,tokenized<id:2> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>>
Document<stored,indexed,tokenized<id:3> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>>
问题是建立索引时是否可以对特定字段进行加权?例如,如果搜索时匹配,我希望 field0201
具有更高的分数。
换句话说,当我搜索Linux
时,我希望按以下顺序获得结果:
Document<stored,indexed,tokenized<id:2> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>>
Document<stored,indexed,tokenized<id:1> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>>
Document<stored,indexed,tokenized<id:3> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>>
而当我搜索Windows
时,它仍然保持原来的顺序,就像以下:
Document<stored,indexed,tokenized<id:1> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>>
Document<stored,indexed,tokenized<id:2> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>>
Document<stored,indexed,tokenized<id:3> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>>
我尝试使用 field0201.setBoost()
,但当我搜索 Linux
或 Windows
时,它都会更改搜索结果的顺序。
I'm currently using Lucene as our full text search engine. But we need sorting the search result according to a particular field.
For example, if we have the following three documents in our index with exactly contents excepts the id
field.
val document01 = new Document()
val field0100 = new Field("id", "1", Field.Store.YES, Field.Index.ANALYZED)
val field0101 = new Field("contents", "This is a test: Linux", Field.Store.YES, Field.Index.ANALYZED)
val field0102 = new Field("contents", "This is a test: Windows", Field.Store.YES, Field.Index.ANALYZED)
document01.add(field0100)
document01.add(field0101)
document01.add(field0102)
val document02 = new Document()
val field0200 = new Field("id", "2", Field.Store.YES, Field.Index.ANALYZED)
val field0201 = new Field("contents", "This is a test: Linux", Field.Store.YES, Field.Index.ANALYZED)
val field0202 = new Field("contents", "This is a test: Windows", Field.Store.YES, Field.Index.ANALYZED)
document02.add(field0200)
document02.add(field0201)
document02.add(field0202)
val document03 = new Document()
val field0300 = new Field("id", "3", Field.Store.YES, Field.Index.ANALYZED)
val field0301 = new Field("contents", "This is a test: Linux", Field.Store.YES, Field.Index.ANALYZED)
val field0302 = new Field("contents", "This is a test: Windows", Field.Store.YES, Field.Index.ANALYZED)
document03.add(field0300)
document03.add(field0301)
document03.add(field0302)
Now, when I search Linux
using IndexSearcher, I got the following result:
Document<stored,indexed,tokenized<id:1> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>>
Document<stored,indexed,tokenized<id:2> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>>
Document<stored,indexed,tokenized<id:3> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>>
When I search Windows
, I get same result with same ordering.
Document<stored,indexed,tokenized<id:1> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>>
Document<stored,indexed,tokenized<id:2> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>>
Document<stored,indexed,tokenized<id:3> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>>
The question is that is it possible weight a particular fields when building index? For example, I would like make field0201
has higher score if its been matched when search.
In other words, when I search Linux
, I would like get the result in the following order:
Document<stored,indexed,tokenized<id:2> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>>
Document<stored,indexed,tokenized<id:1> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>>
Document<stored,indexed,tokenized<id:3> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>>
And when I search for Windows
, it still remains the original ordering, like the following:
Document<stored,indexed,tokenized<id:1> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>>
Document<stored,indexed,tokenized<id:2> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>>
Document<stored,indexed,tokenized<id:3> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>>
I tried using field0201.setBoost()
, but it will change the ordering of search result both when I search Linux
or Windows
.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我认为如果您将不同来源的数据放在具有不同名称的字段中应该是可能的。您可以在索引时设置提升,但如果您使用相同的名称,我认为提升将适用于具有相同名称的所有字段 - 基于
setBoost
javadoc。因此,如果您这样做:然后使用
content-high:Linux content-low:Linux
进行查询(使用带有两个 should 子句的布尔查询,均设置为术语 Linux),那么如果匹配项位于该字段中,content-high 的提升应该会增加文档分数。使用解释
看看是否有效。I think it should be possible if you put your data for different sources in fields with different names. You can set a boost at index time, but if you use the same name I think the boost would apply to all fields with the same name - based on the
setBoost
javadoc. So if you do this instead:And then query with
content-high:Linux content-low:Linux
(using a boolean query with two should clauses both set to term Linux), then the boost for content-high should increase the document score if the match is in that field. Useexplain
to see whether that works.