过滤器对 solr 中搜索结果的影响

发布于 2024-11-17 22:38:49 字数 541 浏览 5 评论 0原文

当我在 solr 中查询“优雅”时,我也得到“优雅”的结果。

我使用这些过滤器进行索引分析

WhitespaceTokenizerFactory
StopFilterFactory
WordDelimiterFilterFactory
LowerCaseFilterFactory
SynonymFilterFactory
EnglishPorterFilterFactory
RemoveDuplicatesTokenFilterFactory
ReversedWildcardFilterFactory

和查询分析:

WhitespaceTokenizerFactory
SynonymFilterFactory
StopFilterFactory
WordDelimiterFilterFactory
LowerCaseFilterFactory
EnglishPorterFilterFactory
RemoveDuplicatesTokenFilterFactory 

我想知道哪个过滤器影响我的搜索结果。

when i query for "elegant" in solr i get results for "elegance" too.

I used these filters for index analyze

WhitespaceTokenizerFactory
StopFilterFactory
WordDelimiterFilterFactory
LowerCaseFilterFactory
SynonymFilterFactory
EnglishPorterFilterFactory
RemoveDuplicatesTokenFilterFactory
ReversedWildcardFilterFactory

and for query analyze:

WhitespaceTokenizerFactory
SynonymFilterFactory
StopFilterFactory
WordDelimiterFilterFactory
LowerCaseFilterFactory
EnglishPorterFilterFactory
RemoveDuplicatesTokenFilterFactory 

I want to know which filter affecting my search result.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

死开点丶别碍眼 2024-11-24 22:38:49

EnglishPorterFilterFactory

这就是简短的答案;)

更多信息:

English Porter 的意思是英语 porter 词干分析器词干算法。根据词干分析器(启发式词根构建器),优雅和优雅都具有相同的词干。

您可以在线验证这一点,例如此处。基本上你会看到“eleg ant”和“eleg ance”源于同一个词干>

来自 Solr 来源:

       public void inform(ResourceLoader loader) {
            String wordFiles = args.get(PROTECTED_TOKENS);
            if (wordFiles != null) {
                try {

这里正是 prowords 文件发挥作用:

                    File protectedWordFiles = new File(wordFiles);
                    if (protectedWordFiles.exists()) {
                        List<String> wlist = loader.getLines(wordFiles);
                        //This cast is safe in Lucene
                        protectedWords = new CharArraySet(wlist, false);//No need to go through StopFilter as before, since it just uses a List internally
                    } else {
                        List<String> files = StrUtils
                                .splitFileNames(wordFiles);
                        for (String file : files) {
                            List<String> wlist = loader.getLines(file
                                    .trim());
                            if (protectedWords == null)
                                protectedWords = new CharArraySet(wlist,
                                        false);
                            else
                                protectedWords.addAll(wlist);
                        }
                    }
                } catch (IOException e) {
                    throw new RuntimeException(e);
                }
            }
        }

这就是影响词干的部分。在那里你可以看到雪球库的调用

        public EnglishPorterFilter create(TokenStream input) {
            return new EnglishPorterFilter(input, protectedWords);
        }

    }

    /**
     * English Porter2 filter that doesn't use reflection to
     * adapt lucene to the snowball stemmer code.
     */
    @Deprecated
    class EnglishPorterFilter extends SnowballPorterFilter {
        public EnglishPorterFilter(TokenStream source,
                CharArraySet protWords) {
            super (source, new org.tartarus.snowball.ext.EnglishStemmer(),
                    protWords);
        }
    }

EnglishPorterFilterFactory

Thats the short answer ;)

A little more information:

English Porter means the english porter stemmer stemming alogrithm. And both elegant and elegance have according to the stemmer (which is a heuristical word root builder) the same stem.

You can verify this online e.g. Here. Basically you will see "eleg ant " and "eleg ance" stemmed to the same stem > eleg.

From Solr source:

       public void inform(ResourceLoader loader) {
            String wordFiles = args.get(PROTECTED_TOKENS);
            if (wordFiles != null) {
                try {

Here exactly comes the protwords file into play:

                    File protectedWordFiles = new File(wordFiles);
                    if (protectedWordFiles.exists()) {
                        List<String> wlist = loader.getLines(wordFiles);
                        //This cast is safe in Lucene
                        protectedWords = new CharArraySet(wlist, false);//No need to go through StopFilter as before, since it just uses a List internally
                    } else {
                        List<String> files = StrUtils
                                .splitFileNames(wordFiles);
                        for (String file : files) {
                            List<String> wlist = loader.getLines(file
                                    .trim());
                            if (protectedWords == null)
                                protectedWords = new CharArraySet(wlist,
                                        false);
                            else
                                protectedWords.addAll(wlist);
                        }
                    }
                } catch (IOException e) {
                    throw new RuntimeException(e);
                }
            }
        }

Thats the part which affects the stemming. There you see the invocation of the snowball library

        public EnglishPorterFilter create(TokenStream input) {
            return new EnglishPorterFilter(input, protectedWords);
        }

    }

    /**
     * English Porter2 filter that doesn't use reflection to
     * adapt lucene to the snowball stemmer code.
     */
    @Deprecated
    class EnglishPorterFilter extends SnowballPorterFilter {
        public EnglishPorterFilter(TokenStream source,
                CharArraySet protWords) {
            super (source, new org.tartarus.snowball.ext.EnglishStemmer(),
                    protWords);
        }
    }
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文