Lucene SpanNearQuery 部分匹配
给定文档 {'foo', 'bar', 'baz'},我想使用 SpanNearQuery 与标记 {'baz', 'extra'} 进行匹配,
但这失败了。
我该如何解决这个问题?
示例测试(使用 lucene 2.9.1)具有以下结果:
- 给定单匹配 - 通过
- 给定两个匹配 -
- 通过给定三匹配 - 通过
- 给定单匹配_和额外术语 - 失败
...
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.spans.SpanNearQuery;
import org.apache.lucene.search.spans.SpanQuery;
import org.apache.lucene.search.spans.SpanTermQuery;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
import org.junit.After;
import org.junit.Assert;
import org.junit.Before;
import org.junit.Test;
import java.io.IOException;
public class SpanNearQueryTest {
private RAMDirectory directory = null;
private static final String BAZ = "baz";
private static final String BAR = "bar";
private static final String FOO = "foo";
private static final String TERM_FIELD = "text";
@Before
public void given() throws IOException {
directory = new RAMDirectory();
IndexWriter writer = new IndexWriter(
directory,
new StandardAnalyzer(Version.LUCENE_29),
IndexWriter.MaxFieldLength.UNLIMITED);
Document doc = new Document();
doc.add(new Field(TERM_FIELD, FOO, Field.Store.NO, Field.Index.ANALYZED));
doc.add(new Field(TERM_FIELD, BAR, Field.Store.NO, Field.Index.ANALYZED));
doc.add(new Field(TERM_FIELD, BAZ, Field.Store.NO, Field.Index.ANALYZED));
writer.addDocument(doc);
writer.commit();
writer.optimize();
writer.close();
}
@After
public void cleanup() {
directory.close();
}
@Test
public void givenSingleMatch() throws IOException {
SpanNearQuery spanNearQuery = new SpanNearQuery(
new SpanQuery[] {
new SpanTermQuery(new Term(TERM_FIELD, FOO))
}, Integer.MAX_VALUE, false);
TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);
Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
}
@Test
public void givenTwoMatches() throws IOException {
SpanNearQuery spanNearQuery = new SpanNearQuery(
new SpanQuery[] {
new SpanTermQuery(new Term(TERM_FIELD, FOO)),
new SpanTermQuery(new Term(TERM_FIELD, BAR))
}, Integer.MAX_VALUE, false);
TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);
Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
}
@Test
public void givenThreeMatches() throws IOException {
SpanNearQuery spanNearQuery = new SpanNearQuery(
new SpanQuery[] {
new SpanTermQuery(new Term(TERM_FIELD, FOO)),
new SpanTermQuery(new Term(TERM_FIELD, BAR)),
new SpanTermQuery(new Term(TERM_FIELD, BAZ))
}, Integer.MAX_VALUE, false);
TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);
Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
}
@Test
public void givenSingleMatch_andExtraTerm() throws IOException {
SpanNearQuery spanNearQuery = new SpanNearQuery(
new SpanQuery[] {
new SpanTermQuery(new Term(TERM_FIELD, BAZ)),
new SpanTermQuery(new Term(TERM_FIELD, "EXTRA"))
},
Integer.MAX_VALUE, false);
TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);
Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
}
}
Given a document {'foo', 'bar', 'baz'}, I want to match using SpanNearQuery with the tokens {'baz', 'extra'}
But this fails.
How do I go around this?
Sample test (using lucene 2.9.1) with the following results:
- givenSingleMatch - PASS
- givenTwoMatches - PASS
- givenThreeMatches - PASS
- givenSingleMatch_andExtraTerm - FAIL
...
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.spans.SpanNearQuery;
import org.apache.lucene.search.spans.SpanQuery;
import org.apache.lucene.search.spans.SpanTermQuery;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
import org.junit.After;
import org.junit.Assert;
import org.junit.Before;
import org.junit.Test;
import java.io.IOException;
public class SpanNearQueryTest {
private RAMDirectory directory = null;
private static final String BAZ = "baz";
private static final String BAR = "bar";
private static final String FOO = "foo";
private static final String TERM_FIELD = "text";
@Before
public void given() throws IOException {
directory = new RAMDirectory();
IndexWriter writer = new IndexWriter(
directory,
new StandardAnalyzer(Version.LUCENE_29),
IndexWriter.MaxFieldLength.UNLIMITED);
Document doc = new Document();
doc.add(new Field(TERM_FIELD, FOO, Field.Store.NO, Field.Index.ANALYZED));
doc.add(new Field(TERM_FIELD, BAR, Field.Store.NO, Field.Index.ANALYZED));
doc.add(new Field(TERM_FIELD, BAZ, Field.Store.NO, Field.Index.ANALYZED));
writer.addDocument(doc);
writer.commit();
writer.optimize();
writer.close();
}
@After
public void cleanup() {
directory.close();
}
@Test
public void givenSingleMatch() throws IOException {
SpanNearQuery spanNearQuery = new SpanNearQuery(
new SpanQuery[] {
new SpanTermQuery(new Term(TERM_FIELD, FOO))
}, Integer.MAX_VALUE, false);
TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);
Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
}
@Test
public void givenTwoMatches() throws IOException {
SpanNearQuery spanNearQuery = new SpanNearQuery(
new SpanQuery[] {
new SpanTermQuery(new Term(TERM_FIELD, FOO)),
new SpanTermQuery(new Term(TERM_FIELD, BAR))
}, Integer.MAX_VALUE, false);
TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);
Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
}
@Test
public void givenThreeMatches() throws IOException {
SpanNearQuery spanNearQuery = new SpanNearQuery(
new SpanQuery[] {
new SpanTermQuery(new Term(TERM_FIELD, FOO)),
new SpanTermQuery(new Term(TERM_FIELD, BAR)),
new SpanTermQuery(new Term(TERM_FIELD, BAZ))
}, Integer.MAX_VALUE, false);
TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);
Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
}
@Test
public void givenSingleMatch_andExtraTerm() throws IOException {
SpanNearQuery spanNearQuery = new SpanNearQuery(
new SpanQuery[] {
new SpanTermQuery(new Term(TERM_FIELD, BAZ)),
new SpanTermQuery(new Term(TERM_FIELD, "EXTRA"))
},
Integer.MAX_VALUE, false);
TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);
Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
}
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
SpanNearQuery 可让您查找彼此相距一定距离内的术语。
示例(来自 http://www.lucidimagination.com/blog/ 2009/07/18/the-spanquery/):
(来源:lucidimagination.com< /a>)
但对于你的例子,我能看到的唯一匹配是你的查询和目标文档都有“cd”(我假设所有这些术语都在一个字段中)。在这种情况下,您不需要使用任何特殊的查询类型。使用标准机制,您将获得一些非零权重,因为它们都在同一字段中包含相同的术语。
编辑3 - 作为对最新评论的回应,答案是您不能使用
SpanNearQuery
做除其预期用途之外的任何事情,即找出是否有多个文档中的术语彼此出现在一定数量的位置内。我无法告诉您的具体用例/预期结果是什么(请随意发布),但在最后一种情况下,如果您只想找出(“BAZ”,“EXTRA”)中的一个或多个是否在在文档中,BooleanQuery
就可以正常工作。编辑 4 - 现在您已经发布了您的用例,我明白您想要做什么。操作方法如下:使用上面提到的
BooleanQuery
来组合所需的各个术语以及SpanNearQuery
,并在上设置提升SpanNearQuery
。因此,文本形式的查询将如下所示:(
作为示例 - 这将匹配包含“BAZ”或“EXTRA”的所有文档,但为术语“BAZ”和“EXTRA”出现在 100 以内的文档分配更高的分数根据需要调整位置和提升,因此它可能无法在 Lucene 中解析,或者可能会产生不良结果,因为在下一节中我将向您展示如何构建它。通过
编程方式,您可以按如下方式构建:
希望这会有所帮助!将来,请尝试首先发布您所期望的结果 - 即使对您来说很明显,但对读者来说可能并不明显。 ,并且明确可以避免多次来回。
SpanNearQuery lets you find terms that are within a certain distance of each other.
Example (from http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/):
(source: lucidimagination.com)
But for your example, the only match I can see is that both your query and the target document have "cd" (and I am making the assumption that all of those terms are in a single field). In that case, you don't need to use any special query type. Using the standard mechanisms, you will get some non-zero weighting based on the fact that they both contain the same term in the same field.
Edit 3 - in response to latest comment, the answer is that you cannot use
SpanNearQuery
to do anything other than that which it is intended for, which is to find out whether multiple terms in a document occur within a certain number of places of each other. I can't tell what your specific use case / expected results are (feel free to post it), but in the last case if you only want to find out whether one or more of ("BAZ", "EXTRA") is in the document, aBooleanQuery
will work just fine.Edit 4 - now that you have posted your use case, I understand what it is you want to do. Here is how you can do it: use a
BooleanQuery
as mentioned above to combine the individual terms you want as well as theSpanNearQuery
, and set a boost on theSpanNearQuery
.So, the query in text form would look like:
(as an example - this would match all documents containing either "BAZ" or "EXTRA", but assign a higher score to documents where the terms "BAZ" and "EXTRA occur within 100 places of each other; adjust the position and boost as you like. This example is from the Solr cookbook so it may not parse in Lucene, or may give undesirable results. That's ok, because in the next section I show you how to build this using the API).
Programmatically, you would construct this as follows:
Hope that helps! In the future, try to start off by posting exactly what results you are expecting - even if it is obvious to you, it may not be to the reader, and being explicit can avoid having to go back and forth so many times.