SOLR:NGramFilterFactory 的问题
我正在运行 SOLR 作为包含 40000 多个文档的 Intranet 的搜索引擎。我使用 copyField 指令将 title
和 keywords
字段复制到 content
字段并仅对其进行索引,从而保持非常简单。
从现在开始我们使用这个配置:
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.SnowballPorterFilterFactory" language="German" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
效果很好,但有人抱怨必须手动设置通配符。因此,我将 NGRamFilterFactory
添加为分析器中的最后一行:
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.SnowballPorterFilterFactory" language="German" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="30" />
</analyzer>
现在的问题是:使用旧配置,我曾经找到带有特定关键字(“Sony”)的 7 个文档。现在,只有 2 个。我完全刷新了索引并从头开始构建它。当我再次取出该行并重新索引文档时,它再次按预期工作。这引出了我的问题:
- FilterFactory 适合我还是应该是标记器工厂?如果分词器:它可以在过滤器之后运行吗?
- 我以 75 个文档的形式将文档添加为 xml,并在最后进行提交。应该有更多的提交吗?
- 还有一个我现在忘记了..
提前谢谢!
I am running SOLR as search engine for an intranet with just over 40000 docs. I keep it very simple by using the copyField directive to copy the title
and the keywords
fields to the content
field and index only that.
Since now we were using this config:
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.SnowballPorterFilterFactory" language="German" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
That worked pretty good, but there were complains, that the wildcard had to be set manually. So I added the NGRamFilterFactory
as the last line in the analyzer:
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.SnowballPorterFilterFactory" language="German" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="30" />
</analyzer>
The problem now is: with the old config I used to find 7 docs with a certain keyword ('Sony'). Now, there are only 2. I completely flushed the index and build it up from the scratch. When I take that line out again and reindex the docs it works as expected again. That leads me to the questions I have:
- is the FilterFactory the right thing for me or should it be the tokenizer factory? And if the tokenizer: can it run after the filters?
- I am adding the docs as xml in tranches of 75 docs and doing a commit at the very end. Should there be more commits?
- There was another one that I forgot now .. grr
Thanks in advance!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
只是一个大胆的猜测 -
您的内容字段的大小(字数)是多少?
因为,现在您已将 NGramFilterFactory 添加到过滤器链中,并且 minGramSize 为 3,将生成大量令牌,并且所有令牌都位于新位置。
solrconfig.xml 中的 maxFieldLength 设置限制要索引的标记数量。
默认值是 10000(仍然很高),但可以通过过滤器链中的大内容和 ngramfilter 来超过。
尝试将此值增加到一个较高的数字,重新索引并检查是否找到匹配项。
Just a wild guess -
Whats the size (number of words) in your content field ?
As, now that you have NGramFilterFactory into your filter chain with a minGramSize of 3 a lot of tokens are going to be generated and all at a new position.
The maxFieldLength settings, in solrconfig.xml, limits the number of tokens to be indexed.
The default value is 10000 (which is still high) but can be exceeded with large content and ngramfilter in the filter chain.
Try increasing this value to a high number, re index and check if the matches are found.
我强烈建议使用字段分析调试工具。这可以通过 Solr 管理站点访问(单击 [配置] 旁边的 [分析] 链接)。这是一个非常强大的工具,您可以在其中查看文本值如何分解为单词,并在它们通过链中的每个过滤器后显示结果标记。
使用此工具,您可以获取查询“Sony”时未返回的文档之一,并将要索引的文本粘贴到索引字段中,将 sony 粘贴到查询字段中,以查看 Solr 如何应用您的过滤器,然后进行查询该字段用于匹配。然后,您可以将架构更改回不使用 NGramFilterFactory 的原始模式,并查看文档最初是如何分解和匹配的,以比较 NGramFilterFactory 对索引和查询的影响。
较小的搜索结果可能基于您在 NGramFilterFactory 设置中指定的 minGramSize 和 maxGramSize 设置。请参考 Solr Wiki 上的 NGramFilterFactory 文档有关这些如何影响索引的更多详细信息。
I would highly recommend using the Field Analysis Debugging tool. This is accessible via the Solr Admin site (click the [Analysis] link next to [Config]). This is a very powerful tool where you can see how a text value is broken down into words, and shows the resulting tokens after they pass through each filter in the chain.
With this tool you can take one of your documents that is not being returned when you query for "Sony" and paste the text to be indexed into the index field and sony into the query field to see how Solr is applying your filters and then querying that field for matches. You can then change your schema back to the original without the NGramFilterFactory and see how your document was originally being broken down and matched to compare how the NGramFilterFactory has impacted the index and query.
Your smaller search results could be based on the minGramSize and maxGramSize settings that you have specified in the NGramFilterFactory settings. Please reference the NGramFilterFactory documentation on the Solr Wiki for more details on how these impact the indexing.