Solr(Lucene) 在添加自定义 TokenFilter 后仅索引第一个文档

发布于 2024-12-07 09:44:35 字数 3918 浏览 2 评论 0原文

我创建了一个自定义令牌过滤器，它连接流中的所有令牌。这是我的 incrementToken() 函数，

public boolean incrementToken() throws IOException {                        
    if (finished) {                                                         
        logger.debug("Finished");                                           
        return false;                                                       
    }                                                                       
    logger.debug("Starting");                                               
    StringBuilder buffer = new StringBuilder();                             
    int length = 0;                                                         
    while (input.incrementToken()) {                                        
        if (0 == length) {                                                  
            buffer.append(termAtt);                                         
            length += termAtt.length();                                     
        } else {                                                            
            buffer.append(" ").append(termAtt);                             
            length += termAtt.length() + 1;                                 
        }                                                                   
    }                                                                       
    termAtt.setEmpty().append(buffer);                                      
    //offsetAtt.setOffset(0, length);                                       
    finished = true;                                                        
    return true;                                                            
}

我将新的过滤器添加到字段的索引和查询分析链的末尾，并从 http://localhost:8983/solr/admin/analysis.jsp 似乎正在工作。过滤器正在连接流中的标记。但是在重新索引文档时，只有我的第一个文档被索引。

这就是我的过滤器链的样子。

        <analyzer type="index">                                             
            <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[-_]" replacement=" " />
            <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[^\p{L}\p{Nd}\p{Mn}\p{Mc}\s+]" replacement="" />
            <tokenizer class="solr.WhitespaceTokenizerFactory" />           
            <filter class="solr.LowerCaseFilterFactory" />                  
            <filter class="solr.StopWordFilterFactory" ignoreCase="true"               words="words.txt" />
            <filter class="org.custom.solr.analysis.ConcatFilterFactory" />
        </analyzer>                                                         
        <analyzer type="query">                                             
            <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[-_]" replacement=" " />
            <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[^\p{L}\p{Nd}\p{Mn}\p{Mc}\s+]" replacement="" />
            <tokenizer class="solr.WhitespaceTokenizerFactory" />           
            <filter class="solr.LowerCaseFilterFactory" />                  
            <filter class="solr.StopWordFilterFactory" ignoreCase="true"               words="words.txt" />
            <filter class="org.custom.solr.analysis.ConcatFilterFactory" />
        </analyzer>

如果没有 ConcatFilterFactory，所有单词都会正确索引，但使用 ConcatFilterFactory 只有第一个文档才会被索引。我做错了什么？请帮助我理解这个问题。

更新：

终于解决了这个问题。

if (finished) {                                                         
    logger.debug("Finished"); 
    finished = false;                                  
    return false;                                                       
}

看起来同一个类被重用了。有道理。

原文

I created a custom token filter which concatenates all the tokens in the stream. This is my incrementToken() function

public boolean incrementToken() throws IOException {                        
    if (finished) {                                                         
        logger.debug("Finished");                                           
        return false;                                                       
    }                                                                       
    logger.debug("Starting");                                               
    StringBuilder buffer = new StringBuilder();                             
    int length = 0;                                                         
    while (input.incrementToken()) {                                        
        if (0 == length) {                                                  
            buffer.append(termAtt);                                         
            length += termAtt.length();                                     
        } else {                                                            
            buffer.append(" ").append(termAtt);                             
            length += termAtt.length() + 1;                                 
        }                                                                   
    }                                                                       
    termAtt.setEmpty().append(buffer);                                      
    //offsetAtt.setOffset(0, length);                                       
    finished = true;                                                        
    return true;                                                            
}

I added the new Filter to the end of index and query analysis chain for a field and testing the filter from http://localhost:8983/solr/admin/analysis.jsp seems to be working. The filter is concatenating the tokens in the stream. But on re-indexing the documents only my first document is getting indexed.

This is how my filter chain looks like.

        <analyzer type="index">                                             
            <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[-_]" replacement=" " />
            <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[^\p{L}\p{Nd}\p{Mn}\p{Mc}\s+]" replacement="" />
            <tokenizer class="solr.WhitespaceTokenizerFactory" />           
            <filter class="solr.LowerCaseFilterFactory" />                  
            <filter class="solr.StopWordFilterFactory" ignoreCase="true"               words="words.txt" />
            <filter class="org.custom.solr.analysis.ConcatFilterFactory" />
        </analyzer>                                                         
        <analyzer type="query">                                             
            <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[-_]" replacement=" " />
            <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[^\p{L}\p{Nd}\p{Mn}\p{Mc}\s+]" replacement="" />
            <tokenizer class="solr.WhitespaceTokenizerFactory" />           
            <filter class="solr.LowerCaseFilterFactory" />                  
            <filter class="solr.StopWordFilterFactory" ignoreCase="true"               words="words.txt" />
            <filter class="org.custom.solr.analysis.ConcatFilterFactory" />
        </analyzer>

Without the ConcatFilterFactory all words are getting indexed properly but with ConcatFilterFactory only the first document is getting indexed. What am I doing wrong? Kindly help me in understanding the problem.

UPDATE :

Finally figured out the issue.

if (finished) {                                                         
    logger.debug("Finished"); 
    finished = false;                                  
    return false;                                                       
}

Looks like the same class is being reused. Makes sense.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

情泪▽动烟 2024-12-14 09:44:35

您应该为您的过滤器编写一个单元测试。即使您的分析有效，它也应该失败。显然你忘记在返回 false 之前添加这一行：

finished = false;

You should write a unit test for your filter. It should fail even if your Analysis works. Apparently you forgot to add this line before returning false: