Lucene 2.9.2:如何以随机顺序显示结果?

发布于 2024-12-01 10:17:20 字数 595 浏览 0 评论 0原文

默认情况下,Lucene按照相关性(分数)的顺序返回查询结果。 您可以传递一个(或多个)排序字段,然后结果将按该字段排序。

我现在正在寻找一个很好的解决方案来以随机顺序获取搜索结果。

糟糕的方法:
当然,我可以获取所有结果,然后对集合进行洗牌,但如果有 5 个 Mio 搜索结果,则效果不佳。

优雅的分页方法:
通过这种方法,您将能够告诉 Lucene 以下内容:
a) 按随机顺序从 5Mio 结果中给我结果 1 到 10
b) 然后给我 11 到 20(基于 a 中使用的相同随机序列)。
c) 只是为了澄清:如果你调用 a) 两次,你会得到相同的随机元素。

你如何实施这种方法?


2012 年 7 月 27 日更新:请注意,此处描述的针对 Lucene 2.9.x 的解决方案无法正常工作。使用RandomOrderScoreDocComparator将导致某些结果在结果列表中出现两次

By default, Lucene returns the query results in the order of relevance (score).
You can pass a sort field (or multiple), then the results get sorted by that field.

I am looking now for a nice solution to get the search results in random order.

The bad approach:
Of course I could take ALL results and then shuffle the collection, but in case of 5 Mio search results, that's not performing well.

The elegant paged approach:
With this approach you would be able to tell Lucene the following:
a) Give me results 1 to 10 out of 5Mio results in random order
b) Then give me 11 to 20 (based on the same random sequence used in a).
c) Just to clarify: If you call a) twice you get the same random elements.

How can you implement this approach??


Update Jul27 2012: Be aware that the solution described here for Lucene 2.9.x is not working properly. Using the RandomOrderScoreDocComparator will result in having certain results twice in the resulting list.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

岁月静好 2024-12-08 10:17:20

您可以编写一个自定义 FieldComparator

public class RandomOrderFieldComparator extends FieldComparator<Integer> {

    private final Random random = new Random();

    @Override
    public int compare(int slot1, int slot2) {
        return random.nextInt();
    }

    @Override
    public int compareBottom(int doc) throws IOException {
        return random.nextInt();
    }

    @Override
    public void copy(int slot, int doc) throws IOException {
    }

    @Override
    public void setBottom(int bottom) {
    }

    @Override
    public void setNextReader(IndexReader reader, int docBase) throws IOException {
    }

    @Override
    public Integer value(int slot) {
        return random.nextInt();
    }

}

这在打乱结果时不会消耗任何 I/O。这是我的示例程序,演示了如何使用它:

public static void main(String... args) throws Exception {
    RAMDirectory directory = new RAMDirectory();

    Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_33);

    IndexWriter writer = new IndexWriter(
            directory,
            new IndexWriterConfig(Version.LUCENE_33, analyzer).setOpenMode(OpenMode.CREATE_OR_APPEND)
        );

    Document alice = new Document();
    alice.add( new Field("name", "Alice", Field.Store.YES, Field.Index.ANALYZED) );
    writer.addDocument( alice );

    Document bob = new Document();
    bob.add( new Field("name", "Bob", Field.Store.YES, Field.Index.ANALYZED) );
    writer.addDocument( bob );

    Document chris = new Document();
    chris.add( new Field("name", "Chris", Field.Store.YES, Field.Index.ANALYZED) );
    writer.addDocument( chris );

    writer.close();


    IndexSearcher searcher = new IndexSearcher( directory );



    for (int pass = 1; pass <= 10; pass++) {
        Query query = new MatchAllDocsQuery();

        Sort sort = new Sort(
                new SortField(
                        "",
                        new FieldComparatorSource() {

                            @Override
                            public FieldComparator<Integer> newComparator(String fieldname, int numHits, int sortPos, boolean reversed) throws IOException {
                                return new RandomOrderFieldComparator();
                            }

                        }
                    )
            );

        TopFieldDocs topFieldDocs = searcher.search( query, 10, sort );

        System.out.print("Pass #" + pass + ":");
        for (int i = 0; i < topFieldDocs.totalHits; i++) {
            System.out.print( " " + topFieldDocs.scoreDocs[i].doc );
        }
        System.out.println();
    }
}

它产生以下输出:

Pass #1: 1 0 2
Pass #2: 1 0 2
Pass #3: 0 1 2
Pass #4: 0 1 2
Pass #5: 0 1 2
Pass #6: 1 0 2
Pass #7: 0 2 1
Pass #8: 1 2 0
Pass #9: 2 0 1
Pass #10: 0 2 1

奖金!对于那些陷入 Lucene 2 困境的人来说,

public class RandomOrderScoreDocComparator implements ScoreDocComparator {

    private final Random random = new Random();

    public int compare(ScoreDoc i, ScoreDoc j) {
        return random.nextInt();
    }

    public Comparable<?> sortValue(ScoreDoc i) {
        return Integer.valueOf( random.nextInt() );
    }

    public int sortType() {
        return SortField.CUSTOM;
    }

}

您需要更改的是 Sort 对象:

Sort sort = new Sort(
    new SortField(
        "",
        new SortComparatorSource() {
            public ScoreDocComparator newComparator(IndexReader reader, String fieldName) throws IOException {
                return new RandomOrderScoreDocComparator();
            }
        }
    )
);

You could write a custom FieldComparator:

public class RandomOrderFieldComparator extends FieldComparator<Integer> {

    private final Random random = new Random();

    @Override
    public int compare(int slot1, int slot2) {
        return random.nextInt();
    }

    @Override
    public int compareBottom(int doc) throws IOException {
        return random.nextInt();
    }

    @Override
    public void copy(int slot, int doc) throws IOException {
    }

    @Override
    public void setBottom(int bottom) {
    }

    @Override
    public void setNextReader(IndexReader reader, int docBase) throws IOException {
    }

    @Override
    public Integer value(int slot) {
        return random.nextInt();
    }

}

This doesn't consume any I/O when shuffling the results. Here is my sample program that demonstrates how you use this:

public static void main(String... args) throws Exception {
    RAMDirectory directory = new RAMDirectory();

    Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_33);

    IndexWriter writer = new IndexWriter(
            directory,
            new IndexWriterConfig(Version.LUCENE_33, analyzer).setOpenMode(OpenMode.CREATE_OR_APPEND)
        );

    Document alice = new Document();
    alice.add( new Field("name", "Alice", Field.Store.YES, Field.Index.ANALYZED) );
    writer.addDocument( alice );

    Document bob = new Document();
    bob.add( new Field("name", "Bob", Field.Store.YES, Field.Index.ANALYZED) );
    writer.addDocument( bob );

    Document chris = new Document();
    chris.add( new Field("name", "Chris", Field.Store.YES, Field.Index.ANALYZED) );
    writer.addDocument( chris );

    writer.close();


    IndexSearcher searcher = new IndexSearcher( directory );



    for (int pass = 1; pass <= 10; pass++) {
        Query query = new MatchAllDocsQuery();

        Sort sort = new Sort(
                new SortField(
                        "",
                        new FieldComparatorSource() {

                            @Override
                            public FieldComparator<Integer> newComparator(String fieldname, int numHits, int sortPos, boolean reversed) throws IOException {
                                return new RandomOrderFieldComparator();
                            }

                        }
                    )
            );

        TopFieldDocs topFieldDocs = searcher.search( query, 10, sort );

        System.out.print("Pass #" + pass + ":");
        for (int i = 0; i < topFieldDocs.totalHits; i++) {
            System.out.print( " " + topFieldDocs.scoreDocs[i].doc );
        }
        System.out.println();
    }
}

It yields up this output:

Pass #1: 1 0 2
Pass #2: 1 0 2
Pass #3: 0 1 2
Pass #4: 0 1 2
Pass #5: 0 1 2
Pass #6: 1 0 2
Pass #7: 0 2 1
Pass #8: 1 2 0
Pass #9: 2 0 1
Pass #10: 0 2 1

Bonus! For those of you trapped in Lucene 2

public class RandomOrderScoreDocComparator implements ScoreDocComparator {

    private final Random random = new Random();

    public int compare(ScoreDoc i, ScoreDoc j) {
        return random.nextInt();
    }

    public Comparable<?> sortValue(ScoreDoc i) {
        return Integer.valueOf( random.nextInt() );
    }

    public int sortType() {
        return SortField.CUSTOM;
    }

}

All you have to change is the Sort object:

Sort sort = new Sort(
    new SortField(
        "",
        new SortComparatorSource() {
            public ScoreDocComparator newComparator(IndexReader reader, String fieldName) throws IOException {
                return new RandomOrderScoreDocComparator();
            }
        }
    )
);
终陌 2024-12-08 10:17:20

这是我的解决方案,到目前为止已被证明可以避免重复的结果。它是用 C#(用于 Lucene.Net)编写的,但我是从 Java 示例开始的,所以它应该很容易转换回来。

我使用的技巧是每次搜索都有一个唯一的 ID,当用户单击分页时该 ID 保持不变。我已经有了这个用于报告目的的唯一 ID。当用户单击搜索按钮时,我会初始化它,然后在查询字符串中加密传递它。
它最终作为种子参数传递到 RandomOrderFieldComparatorSource 中(实际上是 ID.GetHashCode())。

这意味着我使用相同系列的随机数进行相同的搜索,因此即使用户导航到其他结果页面,每个 docId 也会获得相同的排序索引。

还要注意的是,slots 可能是一个等于页面大小的向量。

public class RandomOrderFieldComparator : FieldComparator
{
    Random random;
    List<int> slots = new List<int>();
    Dictionary<int, int> docSortIndeces = new Dictionary<int, int>();

    public RandomOrderFieldComparator(int? seed)
    {
        random = seed == null ? new Random() : new Random(seed.Value);
    }

    private int bottom; // Value of bottom of queue

    public override int Compare(int slot1, int slot2)
    {
        // TODO: there are sneaky non-branch ways to compute
        // -1/+1/0 sign
        // Cannot return values[slot1] - values[slot2] because that
        // may overflow
        int v1 = slots[slot1];
        int v2 = slots[slot2];
        if (v1 > v2) {
            return 1;
        } else if (v1 < v2) {
            return -1;
        } else {
            return 0;
        }
    }

    public override int CompareBottom(int doc)
    {
        // TODO: there are sneaky non-branch ways to compute
        // -1/+1/0 sign
        // Cannot return bottom - values[slot2] because that
        // may overflow
        int v2 = GetDocSortingIndex(doc);
        if (bottom > v2) {
            return 1;
        } else if (bottom < v2) {
            return -1;
        } else {
            return 0;
        }
    }

    public override void Copy(int slot, int doc)
    {
        if (slots.Count > slot) {
            slots[slot] = GetDocSortingIndex(doc);
        } else {
            slots.Add(GetDocSortingIndex(doc));
        }
        //slots[slot] = GetDocSortingIndex(doc);
    }

    public override void SetBottom(int bottom)
    {
        this.bottom = slots[bottom];
    }

    public override System.IComparable Value(int slot)
    {
        return (System.Int32)slots[slot];
    }

    public override void SetNextReader(IndexReader reader, int docBase)
    {
    }

    int GetDocSortingIndex(int docId)
    {
        if (!docSortIndeces.ContainsKey(docId))
            docSortIndeces[docId] = random.Next(int.MinValue, int.MaxValue);

        return docSortIndeces[docId];
    }
}

Here's my solution that so far has proven to avoid duplicate results. It's in C# (for Lucene.Net) but I've started it from the Java sample so it should be easily converted back.

The trick I used was a unique ID per search that stays unchanged while the user clicks on pagination. I already had this unique ID that I used for reporting purposes. I initialize it when the user clicks the search button and then I pass it encrypted in the query string.
It's finally passed as the seed parameter in RandomOrderFieldComparatorSource (actually it's ID.GetHashCode()).

What this means is I use the same series of random numbers for same search, so each docId gets the same sorting index even when users navigate to other result pages.

One more note, slots could probably be a vector equal to pagesize.

public class RandomOrderFieldComparator : FieldComparator
{
    Random random;
    List<int> slots = new List<int>();
    Dictionary<int, int> docSortIndeces = new Dictionary<int, int>();

    public RandomOrderFieldComparator(int? seed)
    {
        random = seed == null ? new Random() : new Random(seed.Value);
    }

    private int bottom; // Value of bottom of queue

    public override int Compare(int slot1, int slot2)
    {
        // TODO: there are sneaky non-branch ways to compute
        // -1/+1/0 sign
        // Cannot return values[slot1] - values[slot2] because that
        // may overflow
        int v1 = slots[slot1];
        int v2 = slots[slot2];
        if (v1 > v2) {
            return 1;
        } else if (v1 < v2) {
            return -1;
        } else {
            return 0;
        }
    }

    public override int CompareBottom(int doc)
    {
        // TODO: there are sneaky non-branch ways to compute
        // -1/+1/0 sign
        // Cannot return bottom - values[slot2] because that
        // may overflow
        int v2 = GetDocSortingIndex(doc);
        if (bottom > v2) {
            return 1;
        } else if (bottom < v2) {
            return -1;
        } else {
            return 0;
        }
    }

    public override void Copy(int slot, int doc)
    {
        if (slots.Count > slot) {
            slots[slot] = GetDocSortingIndex(doc);
        } else {
            slots.Add(GetDocSortingIndex(doc));
        }
        //slots[slot] = GetDocSortingIndex(doc);
    }

    public override void SetBottom(int bottom)
    {
        this.bottom = slots[bottom];
    }

    public override System.IComparable Value(int slot)
    {
        return (System.Int32)slots[slot];
    }

    public override void SetNextReader(IndexReader reader, int docBase)
    {
    }

    int GetDocSortingIndex(int docId)
    {
        if (!docSortIndeces.ContainsKey(docId))
            docSortIndeces[docId] = random.Next(int.MinValue, int.MaxValue);

        return docSortIndeces[docId];
    }
}
可遇━不可求 2024-12-08 10:17:20

这是 Lucene.Net 4.8 的更新版本。

class RandomFieldComparatorSource : FieldComparerSource
    {
        public override FieldComparer NewComparer(string fieldname, int numHits, int sortPos, bool reversed)
        {
            return new RandomOrderFieldComparator(fieldname, null);
        }
    }

    class RandomOrderFieldComparator : NumericComparer<int>
    {
        static Random random = new Random();

        public RandomOrderFieldComparator(string field, int? missingValue) : base(field, missingValue)
        {
        }

        public override int Compare(int slot1, int slot2)
        {
            return random.Next(-1, 2);
        }
        public override int CompareBottom(int doc)
        {
            return random.Next(-1, 2);
        }
        public override void Copy(int slot, int doc)
        {
        }

        public override void SetBottom(int slot)
        {

        }

        public override void SetTopValue(object value)
        {

        }

        public override int CompareTop(int doc)
        {
            return random.Next(-1,2);
        }

        public override IComparable this[int slot] => random.Next();
    }

Here is an updated version for Lucene.Net 4.8.

class RandomFieldComparatorSource : FieldComparerSource
    {
        public override FieldComparer NewComparer(string fieldname, int numHits, int sortPos, bool reversed)
        {
            return new RandomOrderFieldComparator(fieldname, null);
        }
    }

    class RandomOrderFieldComparator : NumericComparer<int>
    {
        static Random random = new Random();

        public RandomOrderFieldComparator(string field, int? missingValue) : base(field, missingValue)
        {
        }

        public override int Compare(int slot1, int slot2)
        {
            return random.Next(-1, 2);
        }
        public override int CompareBottom(int doc)
        {
            return random.Next(-1, 2);
        }
        public override void Copy(int slot, int doc)
        {
        }

        public override void SetBottom(int slot)
        {

        }

        public override void SetTopValue(object value)
        {

        }

        public override int CompareTop(int doc)
        {
            return random.Next(-1,2);
        }

        public override IComparable this[int slot] => random.Next();
    }
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文