方面结果上的受保护词？

发布于 12-04 05:03 字数 1054 浏览 3 评论 0原文

我使用 lucene 和 solr 来索引一些文档（新闻）。这些文件还有标题。现在，我尝试对 HEADLINE 字段进行分面搜索，以查找计数最高的术语。所有这些都可以正常工作，包括停用词列表。 HEADLINE 字段是一个多值字段。我使用 solr.StandardTokenizerFactory 来将这些字段拆分为单个术语（我知道，这不是最佳实践，但这是唯一的方法并且有效）。

有时，分词器会分割不应分割的术语，例如 9/11（被分割为 9 和 11）。所以我决定使用“原词”列表。 “9/11”是此前言列表的一部分。但没有变化。

这是我的 schema.xml 中查看分面结果的部分

  <fieldType name="facet_headline" class="solr.TextField" omitNorms="true">
        <analyzer>
            <tokenizer class="solr.StandardTokenizerFactory" protected="protwords.txt"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.TrimFilterFactory" />
            <filter class="solr.StopFilterFactory"
                    ignoreCase="true"
                    words="stopwords.txt"
                    enablePositionIncrements="true"
                protected="protwords.txt"
                />
        </analyzer>
   </fieldType>

，我看到很多处理“9/11”的文档在“9”或“11”处分组（分面），但从来没有“9/11”。

为什么这不起作用？

谢谢。

原文

I' using lucene with solr to index some documents (news). Those documents also have an HEADLINE.
Now I try to make an facet search over the HEADLINE field to find the terms with the highest count.
All this works without an problem including an stopword-list.
The HEADLINE field is an multi valued field. I use the solr.StandardTokenizerFactory to split those field into single terms (I know, this is not best practise, but it's the only way and it works).

sometimes, the tokenizer splits terms, which shouldn't be splitted, like 9/11 (which is splitted into 9 and 11). So I decided to use an "protword" list. "9/11" is part of this protword list. But no change.

Here is the part from my schema.xml

  <fieldType name="facet_headline" class="solr.TextField" omitNorms="true">
        <analyzer>
            <tokenizer class="solr.StandardTokenizerFactory" protected="protwords.txt"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.TrimFilterFactory" />
            <filter class="solr.StopFilterFactory"
                    ignoreCase="true"
                    words="stopwords.txt"
                    enablePositionIncrements="true"
                protected="protwords.txt"
                />
        </analyzer>
   </fieldType>

looking at the facet result, i see a lots of documents dealing with "9/11" grouped (faceted) at "9" or "11" but never "9/11".

Why this does not work?

Thank you.

分享到QQ

分享到微博