方面结果上的受保护词?
我使用 lucene 和 solr 来索引一些文档(新闻)。这些文件还有标题。 现在,我尝试对 HEADLINE 字段进行分面搜索,以查找计数最高的术语。 所有这些都可以正常工作,包括停用词列表。 HEADLINE 字段是一个多值字段。我使用 solr.StandardTokenizerFactory 来将这些字段拆分为单个术语(我知道,这不是最佳实践,但这是唯一的方法并且有效)。
有时,分词器会分割不应分割的术语,例如 9/11(被分割为 9 和 11)。所以我决定使用“原词”列表。 “9/11”是此前言列表的一部分。但没有变化。
这是我的 schema.xml 中查看分面结果的部分
<fieldType name="facet_headline" class="solr.TextField" omitNorms="true">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory" protected="protwords.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.TrimFilterFactory" />
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
protected="protwords.txt"
/>
</analyzer>
</fieldType>
,我看到很多处理“9/11”的文档在“9”或“11”处分组(分面),但从来没有“9/11”。
为什么这不起作用?
谢谢。
I' using lucene with solr to index some documents (news). Those documents also have an HEADLINE.
Now I try to make an facet search over the HEADLINE field to find the terms with the highest count.
All this works without an problem including an stopword-list.
The HEADLINE field is an multi valued field. I use the solr.StandardTokenizerFactory
to split those field into single terms (I know, this is not best practise, but it's the only way and it works).
sometimes, the tokenizer splits terms, which shouldn't be splitted, like 9/11
(which is splitted into 9 and 11). So I decided to use an "protword" list. "9/11" is part of this protword list. But no change.
Here is the part from my schema.xml
<fieldType name="facet_headline" class="solr.TextField" omitNorms="true">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory" protected="protwords.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.TrimFilterFactory" />
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
protected="protwords.txt"
/>
</analyzer>
</fieldType>
looking at the facet result, i see a lots of documents dealing with "9/11" grouped (faceted) at "9" or "11" but never "9/11".
Why this does not work?
Thank you.
问题是您无法为您喜欢的任何过滤器/标记器设置受保护的字。只有某些过滤器支持该功能。因此,
StandardTokenizer
会忽略您的受保护单词,并将 9/11 拆分为“9”“11”。使用WhitespaceTokenizer
将确保 9/11 不会被拆分。此外,StopFilterFactory 看起来也不承认受保护的单词(它只是过滤掉“to”或“and”等停用词。WordDelimiterFilterFactory 使用受保护的单词。因此,您可以尝试一下,看看它是否可以帮助您。
最好的方法要查看文档的分析方式,请使用内置的 Solr 管理实用程序 查看字段在索引或查询时如何分解。
The problem is that you cannot set a protected words for any filter/tokenizer that you like. Only certain filters support that feature. Therefore, the
StandardTokenizer
is ignoring your protected words and splitting 9/11 into '9' '11' anyway. Using aWhitespaceTokenizer
would ensure that 9/11 does not get split.In addition, it does not look like the StopFilterFactory acknowledges protected words either (it just filters out stop words like 'to' or 'and'. The WordDelimiterFilterFactory uses protected words. So, you might experiment with that to see if it can help you.
The best way to see how your documents are analyzed is to use the built in Solr administration utility to see how a field is broken down when it is indexed or queried.