PatternTokenizerFactory 和停用词

发布于 2024-11-16 18:11:38 字数 922 浏览 5 评论 0原文

solr/lucene 中名为 COLORS 的文档字段具有如下组单词：

field1: blue/dark red/green 字段2：蓝色/黄色/橙色 [...]

我需要对其进行分面搜索以获取所有颜色和每种颜色的数量。首先，我尝试了 PatternTokenizerFactory，然后是停用词列表：

<analyzer>
        <tokenizer class="solr.PatternTokenizerFactory" pattern="/" />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.TrimFilterFactory" />
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords"
        enablePositionIncrements="true"
        />
</analyzer>

不幸的是，停用词列表接缝被忽略。停用词出现在分面搜索结果中。

这个SO问题描述了同样的问题。不幸的是，发布的解决方案对我不起作用，因为我无法使用 solr.StandardTokenizerFactory，因为标准标记生成器也在空白空间上分割标记。这意味着“深红色”变成了“暗”和“红色”，这是错误的。

有什么方法可以使用模式标记器吗？

感谢您的任何帮助！

原文

an document field in solr/lucene called COLORS has group of words like this:

field1: blue/dark red/green
field2: blue/yellow/orange
[...]

I need to run an faceted search over that to get all the colors and the count of each color.
First I tried the PatternTokenizerFactory, followd by the stopword-list:

<analyzer>
        <tokenizer class="solr.PatternTokenizerFactory" pattern="/" />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.TrimFilterFactory" />
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords"
        enablePositionIncrements="true"
        />
</analyzer>

Unfortunately the stopword list seams to be ignored. Stopwords are showing up in faceted search result.

This SO question describes the same problem. Unfortunately the posted solution doen't work for me, because i can not use the solr.StandardTokenizerFactory, because the standard tokenizer also split tokens on whitspaces. That means "dark red" becomes "dark" and "red" which is wrong.

Is there any way to use the pattern tokenizer?

Thnak you for any kind of help!

分享到QQ

分享到微博