SOLR：NGramFilterFactory 的问题

发布于 2024-12-10 05:24:43 字数 1987 浏览 1 评论 0原文

我正在运行 SOLR 作为包含 40000 多个文档的 Intranet 的搜索引擎。我使用 copyField 指令将 title 和 keywords 字段复制到 content 字段并仅对其进行索引，从而保持非常简单。

从现在开始我们使用这个配置：

<analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory" />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1" />
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.LowerCaseFilterFactory" />              
    <filter class="solr.SnowballPorterFilterFactory" language="German" />
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>

效果很好，但有人抱怨必须手动设置通配符。因此，我将 NGRamFilterFactory 添加为分析器中的最后一行：

<analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory" />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1" />
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.LowerCaseFilterFactory" />              
    <filter class="solr.SnowballPorterFilterFactory" language="German" />
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    <filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="30" />
</analyzer>

现在的问题是：使用旧配置，我曾经找到带有特定关键字（“Sony”）的 7 个文档。现在，只有 2 个。我完全刷新了索引并从头开始构建它。当我再次取出该行并重新索引文档时，它再次按预期工作。这引出了我的问题：

FilterFactory 适合我还是应该是标记器工厂？如果分词器：它可以在过滤器之后运行吗？
我以 75 个文档的形式将文档添加为 xml，并在最后进行提交。应该有更多的提交吗？
还有一个我现在忘记了..

提前谢谢！

原文

I am running SOLR as search engine for an intranet with just over 40000 docs. I keep it very simple by using the copyField directive to copy the title and the keywords fields to the content field and index only that.

Since now we were using this config:

<analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory" />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1" />
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.LowerCaseFilterFactory" />              
    <filter class="solr.SnowballPorterFilterFactory" language="German" />
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>

That worked pretty good, but there were complains, that the wildcard had to be set manually. So I added the NGRamFilterFactory as the last line in the analyzer:

<analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory" />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1" />
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.LowerCaseFilterFactory" />              
    <filter class="solr.SnowballPorterFilterFactory" language="German" />
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    <filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="30" />
</analyzer>

The problem now is: with the old config I used to find 7 docs with a certain keyword ('Sony'). Now, there are only 2. I completely flushed the index and build it up from the scratch. When I take that line out again and reindex the docs it works as expected again. That leads me to the questions I have:

is the FilterFactory the right thing for me or should it be the tokenizer factory? And if the tokenizer: can it run after the filters?
I am adding the docs as xml in tranches of 75 docs and doing a commit at the very end. Should there be more commits?
There was another one that I forgot now .. grr

Thanks in advance!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

水水月牙 2024-12-17 05:24:43

只是一个大胆的猜测 -

您的内容字段的大小（字数）是多少？
因为，现在您已将 NGramFilterFactory 添加到过滤器链中，并且 minGramSize 为 3，将生成大量令牌，并且所有令牌都位于新位置。

solrconfig.xml 中的 maxFieldLength 设置限制要索引的标记数量。
默认值是 10000（仍然很高），但可以通过过滤器链中的大内容和 ngramfilter 来超过。

<maxFieldLength>10000</maxFieldLength>

尝试将此值增加到一个较高的数字，重新索引并检查是否找到匹配项。

Just a wild guess -

Whats the size (number of words) in your content field ?
As, now that you have NGramFilterFactory into your filter chain with a minGramSize of 3 a lot of tokens are going to be generated and all at a new position.

The maxFieldLength settings, in solrconfig.xml, limits the number of tokens to be indexed.
The default value is 10000 (which is still high) but can be exceeded with large content and ngramfilter in the filter chain.

<maxFieldLength>10000</maxFieldLength>

Try increasing this value to a high number, re index and check if the matches are found.

回复收藏 0 原文

旧瑾黎汐 2024-12-17 05:24:43

我强烈建议使用字段分析调试工具。这可以通过 Solr 管理站点访问（单击 [配置] 旁边的 [分析] 链接）。这是一个非常强大的工具，您可以在其中查看文本值如何分解为单词，并在它们通过链中的每个过滤器后显示结果标记。

使用此工具，您可以获取查询“Sony”时未返回的文档之一，并将要索引的文本粘贴到索引字段中，将 sony 粘贴到查询字段中，以查看 Solr 如何应用您的过滤器，然后进行查询该字段用于匹配。然后，您可以将架构更改回不使用 NGramFilterFactory 的原始模式，并查看文档最初是如何分解和匹配的，以比较 NGramFilterFactory 对索引和查询的影响。

较小的搜索结果可能基于您在 NGramFilterFactory 设置中指定的 minGramSize 和 maxGramSize 设置。请参考 Solr Wiki 上的 NGramFilterFactory 文档有关这些如何影响索引的更多详细信息。

回复收藏 0 原文

~没有更多了~