Solr(Lucene) 在添加自定义 TokenFilter 后仅索引第一个文档
我创建了一个自定义令牌过滤器,它连接流中的所有令牌。这是我的 incrementToken()
函数,
public boolean incrementToken() throws IOException {
if (finished) {
logger.debug("Finished");
return false;
}
logger.debug("Starting");
StringBuilder buffer = new StringBuilder();
int length = 0;
while (input.incrementToken()) {
if (0 == length) {
buffer.append(termAtt);
length += termAtt.length();
} else {
buffer.append(" ").append(termAtt);
length += termAtt.length() + 1;
}
}
termAtt.setEmpty().append(buffer);
//offsetAtt.setOffset(0, length);
finished = true;
return true;
}
我将新的过滤器添加到字段的索引和查询分析链的末尾,并从 http://localhost:8983/solr/admin/analysis.jsp 似乎正在工作。过滤器正在连接流中的标记。但是在重新索引文档时,只有我的第一个文档被索引。
这就是我的过滤器链的样子。
<analyzer type="index">
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[-_]" replacement=" " />
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[^\p{L}\p{Nd}\p{Mn}\p{Mc}\s+]" replacement="" />
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.StopWordFilterFactory" ignoreCase="true" words="words.txt" />
<filter class="org.custom.solr.analysis.ConcatFilterFactory" />
</analyzer>
<analyzer type="query">
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[-_]" replacement=" " />
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[^\p{L}\p{Nd}\p{Mn}\p{Mc}\s+]" replacement="" />
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.StopWordFilterFactory" ignoreCase="true" words="words.txt" />
<filter class="org.custom.solr.analysis.ConcatFilterFactory" />
</analyzer>
如果没有 ConcatFilterFactory
,所有单词都会正确索引,但使用 ConcatFilterFactory
只有第一个文档才会被索引。我做错了什么?请帮助我理解这个问题。
更新:
终于解决了这个问题。
if (finished) {
logger.debug("Finished");
finished = false;
return false;
}
看起来同一个类被重用了。有道理。
I created a custom token filter which concatenates all the tokens in the stream. This is my incrementToken()
function
public boolean incrementToken() throws IOException {
if (finished) {
logger.debug("Finished");
return false;
}
logger.debug("Starting");
StringBuilder buffer = new StringBuilder();
int length = 0;
while (input.incrementToken()) {
if (0 == length) {
buffer.append(termAtt);
length += termAtt.length();
} else {
buffer.append(" ").append(termAtt);
length += termAtt.length() + 1;
}
}
termAtt.setEmpty().append(buffer);
//offsetAtt.setOffset(0, length);
finished = true;
return true;
}
I added the new Filter to the end of index and query analysis chain for a field and testing the filter from http://localhost:8983/solr/admin/analysis.jsp seems to be working. The filter is concatenating the tokens in the stream. But on re-indexing the documents only my first document is getting indexed.
This is how my filter chain looks like.
<analyzer type="index">
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[-_]" replacement=" " />
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[^\p{L}\p{Nd}\p{Mn}\p{Mc}\s+]" replacement="" />
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.StopWordFilterFactory" ignoreCase="true" words="words.txt" />
<filter class="org.custom.solr.analysis.ConcatFilterFactory" />
</analyzer>
<analyzer type="query">
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[-_]" replacement=" " />
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[^\p{L}\p{Nd}\p{Mn}\p{Mc}\s+]" replacement="" />
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.StopWordFilterFactory" ignoreCase="true" words="words.txt" />
<filter class="org.custom.solr.analysis.ConcatFilterFactory" />
</analyzer>
Without the ConcatFilterFactory
all words are getting indexed properly but with ConcatFilterFactory
only the first document is getting indexed. What am I doing wrong? Kindly help me in understanding the problem.
UPDATE :
Finally figured out the issue.
if (finished) {
logger.debug("Finished");
finished = false;
return false;
}
Looks like the same class is being reused. Makes sense.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您应该为您的过滤器编写一个单元测试。即使您的分析有效,它也应该失败。显然你忘记在返回 false 之前添加这一行:
You should write a unit test for your filter. It should fail even if your Analysis works. Apparently you forgot to add this line before returning false: