Solr 3.4.0 中支持 EdgeNGram 分析和短语搜索

发布于 2024-12-27 23:40:12 字数 2138 浏览 4 评论 0原文

我想为 SOLR 查询中的每个术语启用“startsWith”搜索，但也能够执行短语搜索（在引号中给出）。对于前缀搜索，我首先添加了后缀“*”。此解决方案允许前缀搜索和短语搜索，但我不喜欢此解决方案，因为它是通配符搜索，并且通配符搜索不会分析术语。

因此，我仅在索引时启用 EdgeNgramFilterFactory。前缀搜索工作正常，但精确短语搜索不再工作。

有谁知道如何在启用 EdgeNgram 的情况下启用短语搜索？

谢谢！

这是 schema.xml

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="50" side="back" />
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="50" side="front" />
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>

        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

我还注意到，当使用 WordDelimiterFilterFactory 时，突出显示不再表现良好。

原文

I want to enable "startsWith" search for each term in a SOLR query but also being able to perform phrase searches (given in quotes).
For the prefix search firstly I added the suffix "*". This solution allows both prefix search and phrase search but I don't like this solution because it's a wildcard search and the wildcard searches doesn't analyze the terms.

So I enabled the EdgeNgramFilterFactory only on indexing. The prefix search works fine but the exact phrase search doesn't work anymore.

Does anyone know how to enable phrase search even when the EdgeNgram is enabled?

Thanks!

Here is the schema.xml

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="50" side="back" />
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="50" side="front" />
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>

        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

Also I have noticed that when using the WordDelimiterFilterFactory the highlighting doesn't perform well anymore.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

装纯掩盖桑 2025-01-03 23:40:12

短语搜索不起作用，因为 EdgeNGram 会生成附加术语并增加每个单词块的术语位置（令人惊讶）。短语应该是准确的，这意味着两个连续术语之间的距离（斜率）为 1。但是对于块索引文本看起来有所不同。想象一下，您已使用对文本“Hello World”建立了索引。那么索引文本将看起来像“he hel hell hello wo wor wor world world”。你会发现短语“hel hell”而不是“hello world”。

在此处输入图像描述

作为一个选项，您可以通过增加 qs 参数来允许单词之间存在一定距离查询解析器（dismax）。

但“不准确的短语”搜索可能不可接受，因为您会发现其他意想不到的短语，例如“hel hell”。

更好的选择是为 ngram 使用单独的字段。在这种情况下，文本将在两个字段中建立索引，并且 ngram 不会破坏原始文本。

回复收藏 0 原文

旧伤慢歌 2025-01-03 23:40:12

您可以使用两个字段 - 一个用于前缀和后缀搜索，另一个用于精确匹配。

  <field indexed="true" name="myfield_edgy"        type="edgy"/>
  <field indexed="true" name="myfield_exactmatch"  type="exactmatch"/>
  <copyField source="myfield_exactmatch" dest="myfield_edgy"/>

现在您可以在两个字段中搜索，甚至可以使用不同的提升，即对 myfield_exactmatch 中的匹配项进行排名更高。

You can use two field - one for prefix and suffix search and another one for exact match.

  <field indexed="true" name="myfield_edgy"        type="edgy"/>
  <field indexed="true" name="myfield_exactmatch"  type="exactmatch"/>
  <copyField source="myfield_exactmatch" dest="myfield_edgy"/>

Now you can search in both field and even use different boosts, i.e. to rank matches in myfield_exactmatch higher.

回复收藏 0 原文