Lucene SpanNearQuery 部分匹配

发布于 2024-08-17 06:27:55 字数 4234 浏览 5 评论 0原文

给定文档 {'foo', 'bar', 'baz'}，我想使用 SpanNearQuery 与标记 {'baz', 'extra'} 进行匹配，

但这失败了。

我该如何解决这个问题？

示例测试（使用 lucene 2.9.1）具有以下结果：

给定单匹配 - 通过
给定两个匹配 -
通过给定三匹配 - 通过
给定单匹配_和额外术语 - 失败

...

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.spans.SpanNearQuery;
import org.apache.lucene.search.spans.SpanQuery;
import org.apache.lucene.search.spans.SpanTermQuery;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
import org.junit.After;
import org.junit.Assert;
import org.junit.Before;
import org.junit.Test;

import java.io.IOException;

public class SpanNearQueryTest {

    private RAMDirectory directory = null;

    private static final String BAZ = "baz";
    private static final String BAR = "bar";
    private static final String FOO = "foo";
    private static final String TERM_FIELD = "text";

    @Before
    public void given() throws IOException {
        directory = new RAMDirectory();
        IndexWriter writer = new IndexWriter(
                directory,
                new StandardAnalyzer(Version.LUCENE_29),
                IndexWriter.MaxFieldLength.UNLIMITED);

        Document doc = new Document();
        doc.add(new Field(TERM_FIELD, FOO, Field.Store.NO, Field.Index.ANALYZED));
        doc.add(new Field(TERM_FIELD, BAR, Field.Store.NO, Field.Index.ANALYZED));
        doc.add(new Field(TERM_FIELD, BAZ, Field.Store.NO, Field.Index.ANALYZED));

        writer.addDocument(doc);
        writer.commit();
        writer.optimize();
        writer.close();
    }

    @After
    public void cleanup() {
        directory.close();
    }

    @Test
    public void givenSingleMatch() throws IOException {

        SpanNearQuery spanNearQuery = new SpanNearQuery(
                new SpanQuery[] {
                        new SpanTermQuery(new Term(TERM_FIELD, FOO))
                }, Integer.MAX_VALUE, false);

        TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);

        Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
    }

    @Test
    public void givenTwoMatches() throws IOException {

        SpanNearQuery spanNearQuery = new SpanNearQuery(
                new SpanQuery[] {
                        new SpanTermQuery(new Term(TERM_FIELD, FOO)),
                        new SpanTermQuery(new Term(TERM_FIELD, BAR))
                }, Integer.MAX_VALUE, false);

        TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);

        Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
    }

    @Test
    public void givenThreeMatches() throws IOException {

        SpanNearQuery spanNearQuery = new SpanNearQuery(
                new SpanQuery[] {
                        new SpanTermQuery(new Term(TERM_FIELD, FOO)),
                        new SpanTermQuery(new Term(TERM_FIELD, BAR)),
                        new SpanTermQuery(new Term(TERM_FIELD, BAZ))
                }, Integer.MAX_VALUE, false);

        TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);

        Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
    }

    @Test
    public void givenSingleMatch_andExtraTerm() throws IOException {

        SpanNearQuery spanNearQuery = new SpanNearQuery(
                new SpanQuery[] {
                        new SpanTermQuery(new Term(TERM_FIELD, BAZ)),
                        new SpanTermQuery(new Term(TERM_FIELD, "EXTRA"))
                },
                Integer.MAX_VALUE, false);

        TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);

        Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
    }
}

原文

Given a document {'foo', 'bar', 'baz'}, I want to match using SpanNearQuery with the tokens {'baz', 'extra'}

But this fails.

How do I go around this?

Sample test (using lucene 2.9.1) with the following results:

givenSingleMatch - PASS
givenTwoMatches - PASS
givenThreeMatches - PASS
givenSingleMatch_andExtraTerm - FAIL

...

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.spans.SpanNearQuery;
import org.apache.lucene.search.spans.SpanQuery;
import org.apache.lucene.search.spans.SpanTermQuery;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
import org.junit.After;
import org.junit.Assert;
import org.junit.Before;
import org.junit.Test;

import java.io.IOException;

public class SpanNearQueryTest {

    private RAMDirectory directory = null;

    private static final String BAZ = "baz";
    private static final String BAR = "bar";
    private static final String FOO = "foo";
    private static final String TERM_FIELD = "text";

    @Before
    public void given() throws IOException {
        directory = new RAMDirectory();
        IndexWriter writer = new IndexWriter(
                directory,
                new StandardAnalyzer(Version.LUCENE_29),
                IndexWriter.MaxFieldLength.UNLIMITED);

        Document doc = new Document();
        doc.add(new Field(TERM_FIELD, FOO, Field.Store.NO, Field.Index.ANALYZED));
        doc.add(new Field(TERM_FIELD, BAR, Field.Store.NO, Field.Index.ANALYZED));
        doc.add(new Field(TERM_FIELD, BAZ, Field.Store.NO, Field.Index.ANALYZED));

        writer.addDocument(doc);
        writer.commit();
        writer.optimize();
        writer.close();
    }

    @After
    public void cleanup() {
        directory.close();
    }

    @Test
    public void givenSingleMatch() throws IOException {

        SpanNearQuery spanNearQuery = new SpanNearQuery(
                new SpanQuery[] {
                        new SpanTermQuery(new Term(TERM_FIELD, FOO))
                }, Integer.MAX_VALUE, false);

        TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);

        Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
    }

    @Test
    public void givenTwoMatches() throws IOException {

        SpanNearQuery spanNearQuery = new SpanNearQuery(
                new SpanQuery[] {
                        new SpanTermQuery(new Term(TERM_FIELD, FOO)),
                        new SpanTermQuery(new Term(TERM_FIELD, BAR))
                }, Integer.MAX_VALUE, false);

        TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);

        Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
    }

    @Test
    public void givenThreeMatches() throws IOException {

        SpanNearQuery spanNearQuery = new SpanNearQuery(
                new SpanQuery[] {
                        new SpanTermQuery(new Term(TERM_FIELD, FOO)),
                        new SpanTermQuery(new Term(TERM_FIELD, BAR)),
                        new SpanTermQuery(new Term(TERM_FIELD, BAZ))
                }, Integer.MAX_VALUE, false);

        TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);

        Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
    }

    @Test
    public void givenSingleMatch_andExtraTerm() throws IOException {

        SpanNearQuery spanNearQuery = new SpanNearQuery(
                new SpanQuery[] {
                        new SpanTermQuery(new Term(TERM_FIELD, BAZ)),
                        new SpanTermQuery(new Term(TERM_FIELD, "EXTRA"))
                },
                Integer.MAX_VALUE, false);

        TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);

        Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
    }
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

给我一枪 2024-08-24 06:27:55

SpanNearQuery 可让您查找彼此相距一定距离内的术语。

示例（来自 http://www.lucidimagination.com/blog/ 2009/07/18/the-spanquery/):

假设我们想在 5 内找到 lucene
道格的位置，道格跟随
lucene（顺序很重要）——你可以使用
以下 SpanQuery：

new SpanNearQuery(new SpanQuery[] {
  new SpanTermQuery(new Term(FIELD, "lucene")),
  new SpanTermQuery(new Term(FIELD, "doug"))},
  5,
  true);

_{（来源：lucidimagination.com< /a>)}

在此示例文本中，Lucene 位于
3 道格

但对于你的例子，我能看到的唯一匹配是你的查询和目标文档都有“cd”（我假设所有这些术语都在一个字段中）。在这种情况下，您不需要使用任何特殊的查询类型。使用标准机制，您将获得一些非零权重，因为它们都在同一字段中包含相同的术语。

编辑3 - 作为对最新评论的回应，答案是您不能使用SpanNearQuery做除其预期用途之外的任何事情，即找出是否有多个文档中的术语彼此出现在一定数量的位置内。我无法告诉您的具体用例/预期结果是什么（请随意发布），但在最后一种情况下，如果您只想找出（“BAZ”，“EXTRA”）中的一个或多个是否在在文档中，BooleanQuery 就可以正常工作。

编辑 4 - 现在您已经发布了您的用例，我明白您想要做什么。操作方法如下：使用上面提到的 BooleanQuery 来组合所需的各个术语以及 SpanNearQuery，并在 上设置提升SpanNearQuery。

因此，文本形式的查询将如下所示：（

BAZ OR EXTRA OR "BAZ EXTRA"~100^5

作为示例 - 这将匹配包含“BAZ”或“EXTRA”的所有文档，但为术语“BAZ”和“EXTRA”出现在 100 以内的文档分配更高的分数根据需要调整位置和提升，因此它可能无法在 Lucene 中解析，或者可能会产生不良结果，因为在下一节中我将向您展示如何构建它。通过

编程方式，您可以按如下方式构建：

Query top = new BooleanQuery();

// Construct the terms since they will be used more than once
Term bazTerm = new Term("Field", "BAZ");
Term extraTerm = new Term("Field", "EXTRA");

// Add each term as "should" since we want a partial match
top.add(new TermQuery(bazTerm), BooleanClause.Occur.SHOULD);
top.add(new TermQuery(extraTerm), BooleanClause.Occur.SHOULD);

// Construct the SpanNearQuery, with slop 100 - a document will get a boost only
// if BAZ and EXTRA occur within 100 places of each other.  The final parameter means
// that BAZ must occur before EXTRA.
SpanNearQuery spanQuery = new SpanNearQuery(
                              new SpanQuery[] { new SpanTermQuery(bazTerm), 
                                                new SpanTermQuery(extraTerm) }, 
                              100, true);

// Give it a boost of 5 since it is more important that the words are together
spanQuery.setBoost(5f);

// Add it as "should" since we want a match even when we don't have proximity
top.add(spanQuery, BooleanClause.Occur.SHOULD);

希望这会有所帮助！将来，请尝试首先发布您所期望的结果 - 即使对您来说很明显，但对读者来说可能并不明显。，并且明确可以避免多次来回。

SpanNearQuery lets you find terms that are within a certain distance of each other.

Example (from http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/):

Say we want to find lucene within 5
positions of doug, with doug following
lucene (order matters) – you could use
the following SpanQuery:

new SpanNearQuery(new SpanQuery[] {
  new SpanTermQuery(new Term(FIELD, "lucene")),
  new SpanTermQuery(new Term(FIELD, "doug"))},
  5,
  true);

_{(source: lucidimagination.com)}

In this sample text, Lucene is within
3 of Doug

But for your example, the only match I can see is that both your query and the target document have "cd" (and I am making the assumption that all of those terms are in a single field). In that case, you don't need to use any special query type. Using the standard mechanisms, you will get some non-zero weighting based on the fact that they both contain the same term in the same field.

Edit 3 - in response to latest comment, the answer is that you cannot use SpanNearQuery to do anything other than that which it is intended for, which is to find out whether multiple terms in a document occur within a certain number of places of each other. I can't tell what your specific use case / expected results are (feel free to post it), but in the last case if you only want to find out whether one or more of ("BAZ", "EXTRA") is in the document, a BooleanQuery will work just fine.

Edit 4 - now that you have posted your use case, I understand what it is you want to do. Here is how you can do it: use a BooleanQuery as mentioned above to combine the individual terms you want as well as the SpanNearQuery, and set a boost on the SpanNearQuery.

So, the query in text form would look like:

BAZ OR EXTRA OR "BAZ EXTRA"~100^5

(as an example - this would match all documents containing either "BAZ" or "EXTRA", but assign a higher score to documents where the terms "BAZ" and "EXTRA occur within 100 places of each other; adjust the position and boost as you like. This example is from the Solr cookbook so it may not parse in Lucene, or may give undesirable results. That's ok, because in the next section I show you how to build this using the API).

Programmatically, you would construct this as follows:

Query top = new BooleanQuery();

// Construct the terms since they will be used more than once
Term bazTerm = new Term("Field", "BAZ");
Term extraTerm = new Term("Field", "EXTRA");

// Add each term as "should" since we want a partial match
top.add(new TermQuery(bazTerm), BooleanClause.Occur.SHOULD);
top.add(new TermQuery(extraTerm), BooleanClause.Occur.SHOULD);

// Construct the SpanNearQuery, with slop 100 - a document will get a boost only
// if BAZ and EXTRA occur within 100 places of each other.  The final parameter means
// that BAZ must occur before EXTRA.
SpanNearQuery spanQuery = new SpanNearQuery(
                              new SpanQuery[] { new SpanTermQuery(bazTerm), 
                                                new SpanTermQuery(extraTerm) }, 
                              100, true);

// Give it a boost of 5 since it is more important that the words are together
spanQuery.setBoost(5f);

// Add it as "should" since we want a match even when we don't have proximity
top.add(spanQuery, BooleanClause.Occur.SHOULD);

Hope that helps! In the future, try to start off by posting exactly what results you are expecting - even if it is obvious to you, it may not be to the reader, and being explicit can avoid having to go back and forth so many times.

回复收藏 0 原文

~没有更多了~