使用 Lucene 对类别中的结果进行计数

发布于 2024-07-06 13:32:50 字数 224 浏览 11 评论 0原文

我正在尝试使用 Lucene Java 2.3.2 来实现对产品目录的搜索。 除了产品的常规字段之外,还有一个名为“类别”的字段。 一个产品可以属于多个类别。 目前,我使用 FilteredQuery 搜索每个类别的相同搜索词,以获得每个类别的结果数。

这会导致每个查询需要 20-30 个内部搜索调用来显示结果。 这大大减慢了搜索速度。 使用 Lucene 是否有更快的方法来达到相同的结果?

I am trying to use Lucene Java 2.3.2 to implement search on a catalog of products. Apart from the regular fields for a product, there is field called 'Category'. A product can fall in multiple categories. Currently, I use FilteredQuery to search for the same search term with every Category to get the number of results per category.

This results in 20-30 internal search calls per query to display the results. This is slowing down the search considerably. Is there a faster way of achieving the same result using Lucene?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

写下不归期 2024-07-13 13:32:50

这就是我所做的,尽管它对内存有点重:

你需要的是提前创建一堆 BitSet,每个类别一个,包含该类别中所有文档的 doc id。 现在,在搜索时,您使用 HitCollector 并根据 BitSets 检查文档 ID。

以下是创建位集的代码:

public BitSet[] getBitSets(IndexSearcher indexSearcher, 
                           Category[] categories) {
    BitSet[] bitSets = new BitSet[categories.length];
    for(int i=0; i<categories.length; i++)
    {
        Query query = categories[i].getQuery();
        final BitSet bitset = new BitSet()
        indexSearcher.search(query, new HitCollector() {
            public void collect(int doc, float score) {
                bitSet.set(doc);
            }
        });
        bitSets[i] = bitSet;
    }
    return bitSets;
}

这只是执行此操作的一种方法。 您可能可以使用 TermDocs如果您的类别足够简单,则不必运行完整搜索,但无论如何,这应该只在您加载索引时运行一次。

现在,当需要计算搜索结果的类别时,您可以这样做:

public int[] getCategroryCount(IndexSearcher indexSearcher, 
                               Query query, 
                               final BitSet[] bitSets) {
    final int[] count = new int[bitSets.length];
    indexSearcher.search(query, new HitCollector() {
        public void collect(int doc, float score) {
            for(int i=0; i<bitSets.length; i++) {
                if(bitSets[i].get(doc)) count[i]++;
            }
        }
    });
    return count;
}

最终得到的是一个包含搜索结果中每个类别的计数的数组。 如果您还需要搜索结果,则应该将 TopDocCollector 添加到您的命中收集器中(哟天哪...)。 或者,您可以再次运行搜索。 2 次搜索优于 30 次。

Here's what I did, though it's a bit heavy on memory:

What you need is to create in advance a bunch of BitSets, one for each category, containing the doc id of all the documents in a category. Now, on search time you use a HitCollector and check the doc ids against the BitSets.

Here's the code to create the bit sets:

public BitSet[] getBitSets(IndexSearcher indexSearcher, 
                           Category[] categories) {
    BitSet[] bitSets = new BitSet[categories.length];
    for(int i=0; i<categories.length; i++)
    {
        Query query = categories[i].getQuery();
        final BitSet bitset = new BitSet()
        indexSearcher.search(query, new HitCollector() {
            public void collect(int doc, float score) {
                bitSet.set(doc);
            }
        });
        bitSets[i] = bitSet;
    }
    return bitSets;
}

This is just one way to do this. You could probably use TermDocs instead of running a full search if your categories are simple enough, but this should only run once when you load the index anyway.

Now, when it's time to count categories of search results you do this:

public int[] getCategroryCount(IndexSearcher indexSearcher, 
                               Query query, 
                               final BitSet[] bitSets) {
    final int[] count = new int[bitSets.length];
    indexSearcher.search(query, new HitCollector() {
        public void collect(int doc, float score) {
            for(int i=0; i<bitSets.length; i++) {
                if(bitSets[i].get(doc)) count[i]++;
            }
        }
    });
    return count;
}

What you end up with is an array containing the count of every category within the search results. If you also need the search results, you should add a TopDocCollector to your hit collector (yo dawg...). Or, you could just run the search again. 2 searches are better than 30.

萌面超妹 2024-07-13 13:32:50

我没有足够的声誉来发表评论(!),但在马特·奎尔的回答中,我很确定您可以将其替换

int numDocs = 0;
td.seek(terms);
while (td.next()) {
    numDocs++;
}

为:,

int numDocs = terms.docFreq()

然后完全摆脱 td 变量。 这应该使它更快。

I don't have enough reputation to comment (!) but in Matt Quail's answer I'm pretty sure you could replace this:

int numDocs = 0;
td.seek(terms);
while (td.next()) {
    numDocs++;
}

with this:

int numDocs = terms.docFreq()

and then get rid of the td variable altogether. This should make it even faster.

像极了他 2024-07-13 13:32:50

您可能需要考虑使用 TermDocs 迭代器

此示例代码遍历每个“类别”术语,然后计算与该术语匹配的文档数量。

public static void countDocumentsInCategories(IndexReader reader) throws IOException {
    TermEnum terms = null;
    TermDocs td = null;


    try {
        terms = reader.terms(new Term("Category", ""));
        td = reader.termDocs();
        do {
            Term currentTerm = terms.term();

            if (!currentTerm.field().equals("Category")) {
                break;
            }

            int numDocs = 0;
            td.seek(terms);
            while (td.next()) {
                numDocs++;
            }

            System.out.println(currentTerm.field() + " : " + currentTerm.text() + " --> " + numDocs);
        } while (terms.next());
    } finally {
        if (td != null) td.close();
        if (terms != null) terms.close();
    }
}

即使对于大型索引,此代码也应该运行得相当快。

这是测试该方法的一些代码:

public static void main(String[] args) throws Exception {
    RAMDirectory store = new RAMDirectory();

    IndexWriter w = new IndexWriter(store, new StandardAnalyzer());
    addDocument(w, 1, "Apple", "fruit", "computer");
    addDocument(w, 2, "Orange", "fruit", "colour");
    addDocument(w, 3, "Dell", "computer");
    addDocument(w, 4, "Cumquat", "fruit");
    w.close();

    IndexReader r = IndexReader.open(store);
    countDocumentsInCategories(r);
    r.close();
}

private static void addDocument(IndexWriter w, int id, String name, String... categories) throws IOException {
    Document d = new Document();
    d.add(new Field("ID", String.valueOf(id), Field.Store.YES, Field.Index.UN_TOKENIZED));
    d.add(new Field("Name", name, Field.Store.NO, Field.Index.UN_TOKENIZED));

    for (String category : categories) {
        d.add(new Field("Category", category, Field.Store.NO, Field.Index.UN_TOKENIZED));
    }

    w.addDocument(d);
}

You may want to consider looking through all the documents that match categories using a TermDocs iterator.

This example code goes through each "Category" term, and then counts the number of documents that match that term.

public static void countDocumentsInCategories(IndexReader reader) throws IOException {
    TermEnum terms = null;
    TermDocs td = null;


    try {
        terms = reader.terms(new Term("Category", ""));
        td = reader.termDocs();
        do {
            Term currentTerm = terms.term();

            if (!currentTerm.field().equals("Category")) {
                break;
            }

            int numDocs = 0;
            td.seek(terms);
            while (td.next()) {
                numDocs++;
            }

            System.out.println(currentTerm.field() + " : " + currentTerm.text() + " --> " + numDocs);
        } while (terms.next());
    } finally {
        if (td != null) td.close();
        if (terms != null) terms.close();
    }
}

This code should run reasonably fast even for large indexes.

Here is some code that tests that method:

public static void main(String[] args) throws Exception {
    RAMDirectory store = new RAMDirectory();

    IndexWriter w = new IndexWriter(store, new StandardAnalyzer());
    addDocument(w, 1, "Apple", "fruit", "computer");
    addDocument(w, 2, "Orange", "fruit", "colour");
    addDocument(w, 3, "Dell", "computer");
    addDocument(w, 4, "Cumquat", "fruit");
    w.close();

    IndexReader r = IndexReader.open(store);
    countDocumentsInCategories(r);
    r.close();
}

private static void addDocument(IndexWriter w, int id, String name, String... categories) throws IOException {
    Document d = new Document();
    d.add(new Field("ID", String.valueOf(id), Field.Store.YES, Field.Index.UN_TOKENIZED));
    d.add(new Field("Name", name, Field.Store.NO, Field.Index.UN_TOKENIZED));

    for (String category : categories) {
        d.add(new Field("Category", category, Field.Store.NO, Field.Index.UN_TOKENIZED));
    }

    w.addDocument(d);
}
你是暖光i 2024-07-13 13:32:50

Sachin,我相信您想要分面搜索。 Lucene 并没有开箱即用。 我建议您尝试使用 SOLR,它具有 分面 作为一项主要且方便的功能。

Sachin, I believe you want faceted search. It does not come out of the box with Lucene. I suggest you try using SOLR, that has faceting as a major and convenient feature.

歌入人心 2024-07-13 13:32:50

那么让我看看我是否正确理解了这个问题:给定用户的查询,您想要显示每个类别中该查询有多少个匹配项。 正确的?

可以这样想:您的查询实际上是originalQuery AND (category1 OR Category2 or ...),除了您想要获得每个类别的数字的总体分数。 不幸的是,Lucene 中收集命中的界面非常狭窄,只能为您提供查询的总体分数。 但您可以实现自定义记分器/收集器。

查看 org.apache.lucene.search.DisjunctionSumScorer 的源代码。 您可以复制其中的一些内容来编写一个自定义记分器,该记分器在您的主要搜索正在进行时迭代类别匹配。 您可以保留一个 Map 来跟踪每个类别中的匹配项。

So let me see if I understand the question correctly: Given a query from the user, you want to show how many matches there are for the query in each category. Correct?

Think of it like this: your query is actually originalQuery AND (category1 OR category2 or ...) except as well an overall score you want to get a number for each of the categories. Unfortunately the interface for collecting hits in Lucene is very narrow, only giving you an overall score for a query. But you could implement a custom Scorer/Collector.

Have a look at the source for org.apache.lucene.search.DisjunctionSumScorer. You could copy some of that to write a custom scorer that iterates through category matches while your main search is going on. And you could keep a Map<String,Long> to keep track of matches in each category.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文