Lucene 2.9.2:如何以随机顺序显示结果?
默认情况下,Lucene按照相关性(分数)的顺序返回查询结果。 您可以传递一个(或多个)排序字段,然后结果将按该字段排序。
我现在正在寻找一个很好的解决方案来以随机顺序获取搜索结果。
糟糕的方法:
当然,我可以获取所有结果,然后对集合进行洗牌,但如果有 5 个 Mio 搜索结果,则效果不佳。
优雅的分页方法:
通过这种方法,您将能够告诉 Lucene 以下内容:
a) 按随机顺序从 5Mio 结果中给我结果 1 到 10
b) 然后给我 11 到 20(基于 a 中使用的相同随机序列)。
c) 只是为了澄清:如果你调用 a) 两次,你会得到相同的随机元素。
你如何实施这种方法?
2012 年 7 月 27 日更新:请注意,此处描述的针对 Lucene 2.9.x 的解决方案无法正常工作。使用RandomOrderScoreDocComparator
将导致某些结果在结果列表中出现两次。
By default, Lucene returns the query results in the order of relevance (score).
You can pass a sort field (or multiple), then the results get sorted by that field.
I am looking now for a nice solution to get the search results in random order.
The bad approach:
Of course I could take ALL results and then shuffle the collection, but in case of 5 Mio search results, that's not performing well.
The elegant paged approach:
With this approach you would be able to tell Lucene the following:
a) Give me results 1 to 10 out of 5Mio results in random order
b) Then give me 11 to 20 (based on the same random sequence used in a).
c) Just to clarify: If you call a) twice you get the same random elements.
How can you implement this approach??
Update Jul27 2012: Be aware that the solution described here for Lucene 2.9.x is not working properly. Using the RandomOrderScoreDocComparator
will result in having certain results twice in the resulting list.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可以编写一个自定义
FieldComparator
:这在打乱结果时不会消耗任何 I/O。这是我的示例程序,演示了如何使用它:
它产生以下输出:
奖金!对于那些陷入 Lucene 2 困境的人来说,
您需要更改的是
Sort
对象:You could write a custom
FieldComparator
:This doesn't consume any I/O when shuffling the results. Here is my sample program that demonstrates how you use this:
It yields up this output:
Bonus! For those of you trapped in Lucene 2
All you have to change is the
Sort
object:这是我的解决方案,到目前为止已被证明可以避免重复的结果。它是用 C#(用于 Lucene.Net)编写的,但我是从 Java 示例开始的,所以它应该很容易转换回来。
我使用的技巧是每次搜索都有一个唯一的 ID,当用户单击分页时该 ID 保持不变。我已经有了这个用于报告目的的唯一 ID。当用户单击搜索按钮时,我会初始化它,然后在查询字符串中加密传递它。
它最终作为种子参数传递到 RandomOrderFieldComparatorSource 中(实际上是 ID.GetHashCode())。
这意味着我使用相同系列的随机数进行相同的搜索,因此即使用户导航到其他结果页面,每个 docId 也会获得相同的排序索引。
还要注意的是,slots 可能是一个等于页面大小的向量。
Here's my solution that so far has proven to avoid duplicate results. It's in C# (for Lucene.Net) but I've started it from the Java sample so it should be easily converted back.
The trick I used was a unique ID per search that stays unchanged while the user clicks on pagination. I already had this unique ID that I used for reporting purposes. I initialize it when the user clicks the search button and then I pass it encrypted in the query string.
It's finally passed as the seed parameter in RandomOrderFieldComparatorSource (actually it's ID.GetHashCode()).
What this means is I use the same series of random numbers for same search, so each docId gets the same sorting index even when users navigate to other result pages.
One more note, slots could probably be a vector equal to pagesize.
这是 Lucene.Net 4.8 的更新版本。
Here is an updated version for Lucene.Net 4.8.