Lucene 中短语查询的位置偏移
我正在致力于自定义荧光笔插件(使用 FVH)来输出给定搜索的查询词的位置偏移量。到目前为止,我已经能够使用下面的代码提取正常查询的偏移信息。然而,对于短语查询,代码返回所有查询术语(即术语集)的位置偏移量,即使它不是短语查询的一部分。因此,我想知道Lucene中是否有一种方法可以使用FVH获取Phrase查询中仅匹配短语的偏移信息?
// In DefaultSolrHighlighter.java::doHighlightingByFastVectorHighlighter()
SolrIndexSearcher searcher = req.getSearcher();
TermFreqVector[] tvector = searcher.getReader().getTermFreqVectors(docId);
TermPositionVector tvposition = (TermPositionVector) tvector[0];
Set<String> termSet = highlighter.getHitTermSet (fieldQuery, fieldName);
int[] positions;
List hitOffsetPositions = new ArrayList<String[]>();
for (String term : termSet)
{
int index = tvposition.indexOf(term);
positions = tvposition.getTermPositions(index);
StringBuilder sb = new StringBuilder();
for (int pos : positions)
{
if (!Integer.toString(pos).isEmpty())
sb.append( pos ).append(',');
}
hitOffsetPositions.add(sb.substring(0, sb.length() - 1).toString());
}
if( snippets != null && snippets.length > 0 )
{
docSummaries.add( fieldName, snippets );
docSummaries.add( "hitOffsetPositions", hitOffsetPositions);
}
// In FastVectorHighlighter.java
// Wrapper function to get query Terms
public Set<String> getHitTermSet (FieldQuery fieldQuery, String fieldName)
{
Set<String> termSet = fieldQuery.getTermSet( fieldName );
return termSet;
}
当前输出:
<lst name="6H500F0">
<arr name="name">
<str> New <em>hard drive</em> 500 GB SATA-300 and old drive 200 GB</str>
</arr>
<arr name="hitOffsetPositions">
<str>2</str>
<str>3</str>
<str>10</str>
</arr>
预期输出:
<lst name="6H500F0">
<arr name="name">
<str> New <em>hard drive</em> 500 GB SATA-300 and old drive 200 GB</str>
</arr>
<arr name="hitOffsetPositions">
<str>2</str>
<str>3</str>
</arr>
我试图突出显示的字段有 termVectors="true"、termPositions="true" 和 termOffsets="true" 并使用 Lucene 3.1.0。
I am working on customizing the Highlighter plugin(using FVH) to output the position offset of query terms for a given search. So far I have been able to extract the offset information for normal queries using the code below. However, for Phrase queries the code returns the position offset of all the query terms(i.e. termSet) even when it is not part of the Phrase query. Therefore, I am wondering if there is a way in Lucene to get the offset information of only the matched phrase for Phrase queries using FVH?
// In DefaultSolrHighlighter.java::doHighlightingByFastVectorHighlighter()
SolrIndexSearcher searcher = req.getSearcher();
TermFreqVector[] tvector = searcher.getReader().getTermFreqVectors(docId);
TermPositionVector tvposition = (TermPositionVector) tvector[0];
Set<String> termSet = highlighter.getHitTermSet (fieldQuery, fieldName);
int[] positions;
List hitOffsetPositions = new ArrayList<String[]>();
for (String term : termSet)
{
int index = tvposition.indexOf(term);
positions = tvposition.getTermPositions(index);
StringBuilder sb = new StringBuilder();
for (int pos : positions)
{
if (!Integer.toString(pos).isEmpty())
sb.append( pos ).append(',');
}
hitOffsetPositions.add(sb.substring(0, sb.length() - 1).toString());
}
if( snippets != null && snippets.length > 0 )
{
docSummaries.add( fieldName, snippets );
docSummaries.add( "hitOffsetPositions", hitOffsetPositions);
}
// In FastVectorHighlighter.java
// Wrapper function to get query Terms
public Set<String> getHitTermSet (FieldQuery fieldQuery, String fieldName)
{
Set<String> termSet = fieldQuery.getTermSet( fieldName );
return termSet;
}
Current Output:
<lst name="6H500F0">
<arr name="name">
<str> New <em>hard drive</em> 500 GB SATA-300 and old drive 200 GB</str>
</arr>
<arr name="hitOffsetPositions">
<str>2</str>
<str>3</str>
<str>10</str>
</arr>
Expected Output:
<lst name="6H500F0">
<arr name="name">
<str> New <em>hard drive</em> 500 GB SATA-300 and old drive 200 GB</str>
</arr>
<arr name="hitOffsetPositions">
<str>2</str>
<str>3</str>
</arr>
The field that I am trying to highlight has termVectors="true", termPositions="true" and termOffsets="true" and am using Lucene 3.1.0.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我无法让 FVH 正确处理短语查询,最终不得不开发自己的摘要器。 此处讨论了我的方法的要点;我最终要做的是创建一组对象,每个对象对应我从查询中提取的每个术语。每个对象都包含一个单词索引及其位置,以及它是否已在某些匹配中使用。这些实例是下面示例中的
TermAtPosition
实例。然后,给定位置范围和与短语查询相对应的单词标识(索引)数组,我迭代该数组,寻找匹配给定范围内的所有术语索引。如果找到匹配项,我会将每个匹配项标记为正在使用,并将匹配范围添加到匹配项列表中。然后我可以使用这些匹配来对句子进行评分。这是匹配代码:这种方法似乎有效,但它是贪婪的。给定一个序列“aab c”,它将匹配第一个 a(不考虑第二个 a),然后匹配 b 和 c。我认为可以应用一些递归或整数编程来使其不那么贪婪,但我不会被打扰,并且无论如何都想要一个更快而不是更准确的算法。
I wasn't able to get the FVH to handle phrase queries correctly, and wound up having to develop my own summarizer. The gist of my approach is discussed here; what I wound up doing is creating an array of objects, one for each term that I pulled from the queries. Each object contains a word index and its position, and whether it was already used in some match. These instances are the
TermAtPosition
instances in the sample below. Then, given position span and an array of word identities (indexes) corresponding to a phrase query, I iterated through the array, looking to match all term indexes within the given span. If I found a match, I marked each matching term as being consumed, and added the matching span to a list of matches. I could then use these matches to score sentences. Here is the matching code:This approach seems to work, but it is greedy. Given a sequence "a a b c" it will it match the first a (leaving the second a alone), and then match b and c. I think a bit of recursion or integer programming could be applied to make it less greedy, but I couldn't be bothered, and wanted a faster rather than a more accurate algorithm anyway.