Solr:使用 EdgeNGramFilterFactory 进行精确短语查询

发布于 2024-12-07 04:25:26 字数 1663 浏览 0 评论 0原文

在 Solr (3.3) 中,是否可以通过 EdgeNGramFilterFactory 使字段可逐个字母搜索,并且对短语查询也敏感?

例如,我正在寻找一个字段,如果包含“contrat informatique”,则如果用户键入:

  • contrat
  • informatique
  • contr
  • informa
  • “contrat informatique”
  • “contrat info”

当前,我做了这样的事情:

<fieldtype name="terms" class="solr.TextField">
    <analyzer type="index">
        <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <tokenizer class="solr.LowerCaseTokenizerFactory"/>
        <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
    </analyzer>
    <analyzer type="query">
        <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <tokenizer class="solr.LowerCaseTokenizerFactory"/>
    </analyzer>
</fieldtype>

...但它在短语查询上失败了。

当我查看 solr admin 中的模式分析器时,我发现“contrat informatique”生成了以下标记:

[...] contr contra contrat in inf info infor inform [...]

因此查询适用于“contrat in”(连续标记),但不适用于“contrat inf”(因为这两个标记是分开的) )。

我很确定任何类型的词干提取都可以与短语查询一起使用,但我找不到在 EdgeNGramFilterFactory 之前使用的正确过滤器分词器。

In Solr (3.3), is it possible to make a field letter-by-letter searchable through a EdgeNGramFilterFactory and also sensitive to phrase queries?

By example, I'm looking for a field that, if containing "contrat informatique", will be found if the user types:

  • contrat
  • informatique
  • contr
  • informa
  • "contrat informatique"
  • "contrat info"

Currently, I made something like this:

<fieldtype name="terms" class="solr.TextField">
    <analyzer type="index">
        <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <tokenizer class="solr.LowerCaseTokenizerFactory"/>
        <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
    </analyzer>
    <analyzer type="query">
        <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <tokenizer class="solr.LowerCaseTokenizerFactory"/>
    </analyzer>
</fieldtype>

...but it failed on phrase queries.

When I look in the schema analyzer in solr admin, I find that "contrat informatique" generated the followings tokens:

[...] contr contra contrat in inf info infor inform [...]

So the query works with "contrat in" (consecutive tokens), but not "contrat inf" (because this two tokens are separated).

I'm pretty sure any kind of stemming can work with phrase queries, but I cannot find the right tokenizer of filter to use before the EdgeNGramFilterFactory.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

揽清风入怀 2024-12-14 04:25:26

由于查询 slop 参数默认 = 0,精确短语搜索不起作用。
搜索短语“Hello World”时,它会搜索具有连续位置的术语。
我希望 EdgeNGramFilter 有一个参数来控制输出定位,这看起来像一个古老的问题

通过将 qs 参数设置为某个非常高的值(大于 ngram 之间的最大距离),您可以取回短语。这部分解决了允许短语但不精确的排列的问题。
因此,搜索“contrat informatique”将匹配“...合同被放弃。Informatique...”之类的文本

在此处输入图像描述

为了支持精确短语查询,我最终使用ngram 的单独字段

所需步骤:

定义单独的字段类型来索引常规值和克:

<fieldType name="text" class="solr.TextField" omitNorms="false">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

<fieldType name="ngrams" class="solr.TextField" omitNorms="false">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

告诉 solr

您可以为每个字段定义单独的 ngram 反射:

<field name="contact_ngrams" type="ngrams" indexed="true" stored="false"/>
<field name="product_ngrams" type="ngrams" indexed="true" stored="false"/>
<copyField source="contact_text" dest="contact_ngrams"/>
<copyField source="product_text" dest="product_ngrams"/>

或者您可以将所有 ngram 放入一个字段:

<field name="heap_ngrams" type="ngrams" indexed="true" stored="false"/>
<copyField source="*_text" dest="heap_ngrams"/>

请注意,在这种情况下您将无法分离助推器。

最后一件事是在查询中指定 ngrams 字段和助推器。
一种方法是配置您的应用程序。
另一种方法是在 solrconfig.xml 中指定“appends”参数

   <lst name="appends">
     <str name="qf">heap_ngrams</str>
   </lst>

Exact phrase search does not work because of query slop parameter = 0 by default.
Searching for a phrase '"Hello World"' it searches for terms with sequential positions.
I wish EdgeNGramFilter had a parameter to control output positioning, this looks like an old question.

By setting qs parameter to some very high value (more than maximum distance between ngrams) you can get phrases back. This partially solves problem allowing phrases, but not exact, permutations will be found as well.
So that search for "contrat informatique" would match text like "...contract abandoned. Informatique..."

enter image description here

To support exact phrase query i end up to use separate fields for ngrams.

Steps required:

Define separate field types to index regular values and grams:

<fieldType name="text" class="solr.TextField" omitNorms="false">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

<fieldType name="ngrams" class="solr.TextField" omitNorms="false">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

Tell solr to copy fields when indexing:

You can define separate ngrams reflection for each field:

<field name="contact_ngrams" type="ngrams" indexed="true" stored="false"/>
<field name="product_ngrams" type="ngrams" indexed="true" stored="false"/>
<copyField source="contact_text" dest="contact_ngrams"/>
<copyField source="product_text" dest="product_ngrams"/>

Or you can put all ngrams into one field:

<field name="heap_ngrams" type="ngrams" indexed="true" stored="false"/>
<copyField source="*_text" dest="heap_ngrams"/>

Note that you'll not be able to separate boosters in this case.

And the last thing is to specify ngrams fields and boosters in the query.
One way is to configure your application.
Another way is to specify "appends" params in the solrconfig.xml

   <lst name="appends">
     <str name="qf">heap_ngrams</str>
   </lst>
闻呓 2024-12-14 04:25:26

可惜我无法像 Jayendra Patil 建议的那样使用 PositionFilter (PositionFilter 使任何查询成为 OR 布尔查询),我使用了不同的方法。

仍然使用 EdgeNGramFilter,我添加了这样一个事实:用户输入的每个关键字都是强制性的,并禁用了所有短语。

因此,如果用户请求“cont info”,它会转换为+cont +info。它比真正的短语更宽松一些,但它成功地做到了我想要的(并且不会仅返回两个术语中的一个结果)。

反对这种解决方法的唯一缺点是术语可以在结果中排​​列(因此也会找到带有“informatique contrat”的文档),但这并不是什么大问题。

As alas I could not manage to use a PositionFilter right like Jayendra Patil suggested (PositionFilter makes any query a OR boolean query), I used a different approach.

Still with the EdgeNGramFilter, I added the fact that each keyword the user typed in is mandatory, and disabled all phrases.

So if the user ask for "cont info", it transforms to +cont +info. It's a bit more permissive that a true phrase would be, but it managed to do what I want (and doesn't return results with only one term from the two).

The only con against this workaround is that terms can be permutated in the results (so a document with "informatique contrat" will also be found), but it's not that a big deal.

半寸时光 2024-12-14 04:25:26

这就是我的想法 -
对于要进行短语匹配的 ngram,为每个单词生成的标记的位置应该相同。
我检查了边缘克过滤器,它增加了标记,但没有找到任何参数来阻止它。
有一个可用的位置过滤器,它可以将标记位置保持为与开始时相同的标记。
因此,如果使用以下配置,则所有标记都位于相同位置并且与短语查询匹配(相同标记位置与短语匹配)
我通过分析工具检查了它并且查询匹配。

所以你可能想尝试一下提示:-

<analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory" />
    <charFilter class="solr.MappingCharFilterFactory" 
            mapping="mapping-ISOLatin1Accent.txt" />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
            generateNumberParts="1" catenateWords="1" catenateNumbers="1" 
            catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory" />
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" 
            maxGramSize="15" side="front"/>
    <filter class="solr.PositionFilterFactory" />
</analyzer>

Here is what I was thinking -
For the ngrams to be phrase matched the position of the tokens generated for each word should be the same.
I checked for the edge grams filter and it increments the tokens, and didn't find any parameter to prevent it.
There is a position filter available and this maintains the tokens position to the same token as to the begining.
So if the following configuration is used all tokens are at the same position and it matches the phrase query (same token positions are matched as phrases)
I checked it through the anaylsis tool and the queries matched.

So you might want to try the hint :-

<analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory" />
    <charFilter class="solr.MappingCharFilterFactory" 
            mapping="mapping-ISOLatin1Accent.txt" />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
            generateNumberParts="1" catenateWords="1" catenateNumbers="1" 
            catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory" />
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" 
            maxGramSize="15" side="front"/>
    <filter class="solr.PositionFilterFactory" />
</analyzer>
榆西 2024-12-14 04:25:26

我已经修复了 EdgeNGramFilter,因此令牌内的位置不再增加:

    public class CustomEdgeNGramTokenFilterFactory extends TokenFilterFactory {
    private int maxGramSize = 0;

    private int minGramSize = 0;

    @Override
    public void init(Map<String, String> args) {
        super.init(args);
        String maxArg = args.get("maxGramSize");
        maxGramSize = (maxArg != null ? Integer.parseInt(maxArg)
                : EdgeNGramTokenFilter.DEFAULT_MAX_GRAM_SIZE);

        String minArg = args.get("minGramSize");
        minGramSize = (minArg != null ? Integer.parseInt(minArg)
                : EdgeNGramTokenFilter.DEFAULT_MIN_GRAM_SIZE);

    }

    @Override
    public CustomEdgeNGramTokenFilter create(TokenStream input) {
        return new CustomEdgeNGramTokenFilter(input, minGramSize, maxGramSize);
    }
}
public class CustomEdgeNGramTokenFilter extends TokenFilter {
    private final int minGram;
    private final int maxGram;
    private char[] curTermBuffer;
    private int curTermLength;
    private int curGramSize;

    private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
    private final OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);
    private final PositionIncrementAttribute positionIncrementAttribute = addAttribute(PositionIncrementAttribute.class);

    /**
     * Creates EdgeNGramTokenFilter that can generate n-grams in the sizes of the given range
     *
     * @param input   {@link org.apache.lucene.analysis.TokenStream} holding the input to be tokenized
     * @param minGram the smallest n-gram to generate
     * @param maxGram the largest n-gram to generate
     */
    public CustomEdgeNGramTokenFilter(TokenStream input, int minGram, int maxGram) {
        super(input);

        if (minGram < 1) {
            throw new IllegalArgumentException("minGram must be greater than zero");
        }

        if (minGram > maxGram) {
            throw new IllegalArgumentException("minGram must not be greater than maxGram");
        }

        this.minGram = minGram;
        this.maxGram = maxGram;
    }

@Override
public final boolean incrementToken() throws IOException {
    while (true) {
        int positionIncrement = 0;
        if (curTermBuffer == null) {
            if (!input.incrementToken()) {
                return false;
            } else {
                positionIncrement = positionIncrementAttribute.getPositionIncrement();
                curTermBuffer = termAtt.buffer().clone();
                curTermLength = termAtt.length();
                curGramSize = minGram;
            }
        }
        if (curGramSize <= maxGram) {
            if (!(curGramSize > curTermLength         // if the remaining input is too short, we can't generate any n-grams
                    || curGramSize > maxGram)) {       // if we have hit the end of our n-gram size range, quit
                // grab gramSize chars from front
                int start = 0;
                int end = start + curGramSize;
                offsetAtt.setOffset(start, end);
                positionIncrementAttribute.setPositionIncrement(positionIncrement);
                termAtt.copyBuffer(curTermBuffer, start, curGramSize);
                curGramSize++;

                return true;
            }
        }
        curTermBuffer = null;
    }
}

    @Override
    public void reset() throws IOException {
        super.reset();
        curTermBuffer = null;
    }
}

I've made a fix to EdgeNGramFilter so positions within a token are not incremented anymore:

    public class CustomEdgeNGramTokenFilterFactory extends TokenFilterFactory {
    private int maxGramSize = 0;

    private int minGramSize = 0;

    @Override
    public void init(Map<String, String> args) {
        super.init(args);
        String maxArg = args.get("maxGramSize");
        maxGramSize = (maxArg != null ? Integer.parseInt(maxArg)
                : EdgeNGramTokenFilter.DEFAULT_MAX_GRAM_SIZE);

        String minArg = args.get("minGramSize");
        minGramSize = (minArg != null ? Integer.parseInt(minArg)
                : EdgeNGramTokenFilter.DEFAULT_MIN_GRAM_SIZE);

    }

    @Override
    public CustomEdgeNGramTokenFilter create(TokenStream input) {
        return new CustomEdgeNGramTokenFilter(input, minGramSize, maxGramSize);
    }
}
public class CustomEdgeNGramTokenFilter extends TokenFilter {
    private final int minGram;
    private final int maxGram;
    private char[] curTermBuffer;
    private int curTermLength;
    private int curGramSize;

    private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
    private final OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);
    private final PositionIncrementAttribute positionIncrementAttribute = addAttribute(PositionIncrementAttribute.class);

    /**
     * Creates EdgeNGramTokenFilter that can generate n-grams in the sizes of the given range
     *
     * @param input   {@link org.apache.lucene.analysis.TokenStream} holding the input to be tokenized
     * @param minGram the smallest n-gram to generate
     * @param maxGram the largest n-gram to generate
     */
    public CustomEdgeNGramTokenFilter(TokenStream input, int minGram, int maxGram) {
        super(input);

        if (minGram < 1) {
            throw new IllegalArgumentException("minGram must be greater than zero");
        }

        if (minGram > maxGram) {
            throw new IllegalArgumentException("minGram must not be greater than maxGram");
        }

        this.minGram = minGram;
        this.maxGram = maxGram;
    }

@Override
public final boolean incrementToken() throws IOException {
    while (true) {
        int positionIncrement = 0;
        if (curTermBuffer == null) {
            if (!input.incrementToken()) {
                return false;
            } else {
                positionIncrement = positionIncrementAttribute.getPositionIncrement();
                curTermBuffer = termAtt.buffer().clone();
                curTermLength = termAtt.length();
                curGramSize = minGram;
            }
        }
        if (curGramSize <= maxGram) {
            if (!(curGramSize > curTermLength         // if the remaining input is too short, we can't generate any n-grams
                    || curGramSize > maxGram)) {       // if we have hit the end of our n-gram size range, quit
                // grab gramSize chars from front
                int start = 0;
                int end = start + curGramSize;
                offsetAtt.setOffset(start, end);
                positionIncrementAttribute.setPositionIncrement(positionIncrement);
                termAtt.copyBuffer(curTermBuffer, start, curGramSize);
                curGramSize++;

                return true;
            }
        }
        curTermBuffer = null;
    }
}

    @Override
    public void reset() throws IOException {
        super.reset();
        curTermBuffer = null;
    }
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文