过滤器对 solr 中搜索结果的影响

发布于 2024-11-17 22:38:49 字数 541 浏览 5 评论 0原文

当我在 solr 中查询“优雅”时，我也得到“优雅”的结果。

我使用这些过滤器进行索引分析

WhitespaceTokenizerFactory
StopFilterFactory
WordDelimiterFilterFactory
LowerCaseFilterFactory
SynonymFilterFactory
EnglishPorterFilterFactory
RemoveDuplicatesTokenFilterFactory
ReversedWildcardFilterFactory

和查询分析：

WhitespaceTokenizerFactory
SynonymFilterFactory
StopFilterFactory
WordDelimiterFilterFactory
LowerCaseFilterFactory
EnglishPorterFilterFactory
RemoveDuplicatesTokenFilterFactory

我想知道哪个过滤器影响我的搜索结果。

原文

when i query for "elegant" in solr i get results for "elegance" too.

I used these filters for index analyze

WhitespaceTokenizerFactory
StopFilterFactory
WordDelimiterFilterFactory
LowerCaseFilterFactory
SynonymFilterFactory
EnglishPorterFilterFactory
RemoveDuplicatesTokenFilterFactory
ReversedWildcardFilterFactory

and for query analyze:

WhitespaceTokenizerFactory
SynonymFilterFactory
StopFilterFactory
WordDelimiterFilterFactory
LowerCaseFilterFactory
EnglishPorterFilterFactory
RemoveDuplicatesTokenFilterFactory

I want to know which filter affecting my search result.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

死开点丶别碍眼 2024-11-24 22:38:49

EnglishPorterFilterFactory

这就是简短的答案；）

更多信息：

English Porter 的意思是英语 porter 词干分析器词干算法。根据词干分析器（启发式词根构建器），优雅和优雅都具有相同的词干。

您可以在线验证这一点，例如此处。基本上你会看到“eleg ant”和“eleg ance”源于同一个词干> 腿。

来自 Solr 来源：

       public void inform(ResourceLoader loader) {
            String wordFiles = args.get(PROTECTED_TOKENS);
            if (wordFiles != null) {
                try {

这里正是 prowords 文件发挥作用：

                    File protectedWordFiles = new File(wordFiles);
                    if (protectedWordFiles.exists()) {
                        List<String> wlist = loader.getLines(wordFiles);
                        //This cast is safe in Lucene
                        protectedWords = new CharArraySet(wlist, false);//No need to go through StopFilter as before, since it just uses a List internally
                    } else {
                        List<String> files = StrUtils
                                .splitFileNames(wordFiles);
                        for (String file : files) {
                            List<String> wlist = loader.getLines(file
                                    .trim());
                            if (protectedWords == null)
                                protectedWords = new CharArraySet(wlist,
                                        false);
                            else
                                protectedWords.addAll(wlist);
                        }
                    }
                } catch (IOException e) {
                    throw new RuntimeException(e);
                }
            }
        }

这就是影响词干的部分。在那里你可以看到雪球库的调用

        public EnglishPorterFilter create(TokenStream input) {
            return new EnglishPorterFilter(input, protectedWords);
        }

    }

    /**
     * English Porter2 filter that doesn't use reflection to
     * adapt lucene to the snowball stemmer code.
     */
    @Deprecated
    class EnglishPorterFilter extends SnowballPorterFilter {
        public EnglishPorterFilter(TokenStream source,
                CharArraySet protWords) {
            super (source, new org.tartarus.snowball.ext.EnglishStemmer(),
                    protWords);
        }
    }

EnglishPorterFilterFactory

Thats the short answer ;)

A little more information:

English Porter means the english porter stemmer stemming alogrithm. And both elegant and elegance have according to the stemmer (which is a heuristical word root builder) the same stem.

You can verify this online e.g. Here. Basically you will see "eleg ant " and "eleg ance" stemmed to the same stem > eleg.

From Solr source:

       public void inform(ResourceLoader loader) {
            String wordFiles = args.get(PROTECTED_TOKENS);
            if (wordFiles != null) {
                try {

Here exactly comes the protwords file into play:

                    File protectedWordFiles = new File(wordFiles);
                    if (protectedWordFiles.exists()) {
                        List<String> wlist = loader.getLines(wordFiles);
                        //This cast is safe in Lucene
                        protectedWords = new CharArraySet(wlist, false);//No need to go through StopFilter as before, since it just uses a List internally
                    } else {
                        List<String> files = StrUtils
                                .splitFileNames(wordFiles);
                        for (String file : files) {
                            List<String> wlist = loader.getLines(file
                                    .trim());
                            if (protectedWords == null)
                                protectedWords = new CharArraySet(wlist,
                                        false);
                            else
                                protectedWords.addAll(wlist);
                        }
                    }
                } catch (IOException e) {
                    throw new RuntimeException(e);
                }
            }
        }

Thats the part which affects the stemming. There you see the invocation of the snowball library

        public EnglishPorterFilter create(TokenStream input) {
            return new EnglishPorterFilter(input, protectedWords);
        }

    }

    /**
     * English Porter2 filter that doesn't use reflection to
     * adapt lucene to the snowball stemmer code.
     */
    @Deprecated
    class EnglishPorterFilter extends SnowballPorterFilter {
        public EnglishPorterFilter(TokenStream source,
                CharArraySet protWords) {
            super (source, new org.tartarus.snowball.ext.EnglishStemmer(),
                    protWords);
        }
    }

回复收藏 0 原文

~没有更多了~