Solr 3.4.0 中支持 EdgeNGram 分析和短语搜索
我想为 SOLR 查询中的每个术语启用“startsWith”搜索,但也能够执行短语搜索(在引号中给出)。 对于前缀搜索,我首先添加了后缀“*”。此解决方案允许前缀搜索和短语搜索,但我不喜欢此解决方案,因为它是通配符搜索,并且通配符搜索不会分析术语。
因此,我仅在索引时启用 EdgeNgramFilterFactory。前缀搜索工作正常,但精确短语搜索不再工作。
有谁知道如何在启用 EdgeNgram 的情况下启用短语搜索?
谢谢!
这是 schema.xml
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="50" side="back" />
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="50" side="front" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
我还注意到,当使用 WordDelimiterFilterFactory 时,突出显示不再表现良好。
I want to enable "startsWith" search for each term in a SOLR query but also being able to perform phrase searches (given in quotes).
For the prefix search firstly I added the suffix "*". This solution allows both prefix search and phrase search but I don't like this solution because it's a wildcard search and the wildcard searches doesn't analyze the terms.
So I enabled the EdgeNgramFilterFactory only on indexing. The prefix search works fine but the exact phrase search doesn't work anymore.
Does anyone know how to enable phrase search even when the EdgeNgram is enabled?
Thanks!
Here is the schema.xml
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="50" side="back" />
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="50" side="front" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
Also I have noticed that when using the WordDelimiterFilterFactory the highlighting doesn't perform well anymore.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
短语搜索不起作用,因为 EdgeNGram 会生成附加术语并增加每个单词块的术语位置(令人惊讶)。短语应该是准确的,这意味着两个连续术语之间的距离(斜率)为 1。但是对于块索引文本看起来有所不同。想象一下,您已使用
对文本“Hello World”建立了索引。那么索引文本将看起来像“he hel hell hello wo wor wor world world”。你会发现短语“hel hell”而不是“hello world”。作为一个选项,您可以通过增加 qs 参数来允许单词之间存在一定距离查询解析器(dismax)。
但“不准确的短语”搜索可能不可接受,因为您会发现其他意想不到的短语,例如“hel hell”。
更好的选择是为 ngram 使用单独的字段。在这种情况下,文本将在两个字段中建立索引,并且 ngram 不会破坏原始文本。
Phrase search does not work because EdgeNGram produces additional terms and increases the term position(surprisingly) of each chunk of the word. Phrases are expected to be exact, meaning distance(slops) between two sequential terms is 1. But with chunks indexed text looks different. Imagine you have indexed the text "Hello World" using
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" side="front"/>
. Then indexed text would look like "he hel hell hello wo wor worl world". You would find the phrase "hel hell" rather than "hello world".As an option you could allow some distance between words by increasing qs parameter of the query parser (dismax).
But 'not exact phrase' search may be unacceptable as you would find additional unexpected phrases like 'hel hell'.
A better option is to use a separate field for ngrams. In this case text will be indexed in two fields and ngrams will not break the original text.
您可以使用两个字段 - 一个用于前缀和后缀搜索,另一个用于精确匹配。
现在您可以在两个字段中搜索,甚至可以使用不同的提升,即对 myfield_exactmatch 中的匹配项进行排名更高。
You can use two field - one for prefix and suffix search and another one for exact match.
Now you can search in both field and even use different boosts, i.e. to rank matches in myfield_exactmatch higher.
另一个选择是升级到 3.6.0,因为现在通配符不会阻止查询被分析
Yet another option is upgrade to 3.6.0 as now wildcards don't prevent the query being analyzed