Solr拼写检查问题

发布于 2025-01-08 16:13:17 字数 5268 浏览 4 评论 0原文

我对 Solr 的拼写检查建议有一个奇怪的问题。

我搜索这样的术语（例如产品编号）：08p17a6

使用这个术语，我在索引中找到文档。

我已启用拼写检查=true。因此，除了文档之外，solr 还在 xml 响应中为我提供了拼写检查建议：

<lst name="spellcheck">
    <lst name="suggestions">
        <lst name="p17a6">
            <int name="numFound">1</int>
            <int name="startOffset">2</int>
            <int name="endOffset">7</int>
            <arr name="suggestion">
                <str>08p17a6</str>
            </arr>
        </lst>
    </lst>
</lst>

Solr 采用我的搜索词的第一个数字，并根据“p17a6”为我提供了建议。我不明白他为什么要删去他的建议的前两个数字。

如果我启用pellcheck.collate，事情会变得更加奇怪：

<lst name="spellcheck">
    <lst name="suggestions">
        <lst name="p17a6">
            <int name="numFound">1</int>
            <int name="startOffset">2</int>
            <int name="endOffset">7</int>
            <arr name="suggestion">
                <str>08p17a6</str>
            </arr>
        </lst>
        <str name="collation">0808p17a6</str>
    </lst>
</lst>

我需要使用spellcheck.collate来针对多个搜索词提供建议。但正如您所看到的，xml 响应建议我使用“0808p17a6”。

有谁知道这是怎么发生的？

编辑：

这是我关于拼写检查的架构配置：

<field name="spell" type="textSpell" indexed="true" stored="false" multiValued="true" />
<copyField source="title"    dest="spell" />
<copyField source="subTitle" dest="spell" />
<copyField source="content"  dest="spell" />

复制字段的源字段配置如下：

<field name="title"       type="text"   indexed="true"  stored="true" termVectors="true" omitNorms="true" />
<field name="subTitle"    type="text"   indexed="true"  stored="true" termVectors="true" omitNorms="true" />
<field name="content"     type="text"   indexed="true"  stored="true" termVectors="true" />

这是分析器的配置： >

        <filter class="solr.WordDelimiterFilterFactory"
            generateWordParts="1"
            generateNumberParts="1"
            catenateWords="1"
            catenateNumbers="1"
            catenateAll="0"
            splitOnCaseChange="1"
            preserveOriginal="1"
        />
        <!-- best practice (currently) for synonyms is to add them by
            expansions during index time
        -->
        <filter class="solr.SynonymFilterFactory" synonyms="german/synonyms.txt" ignoreCase="true" expand="true"/>
        <!-- Case insensitive stop word removal.
            add enablePositionIncrements=true in both the index and query
            analyzers to leave a 'gap' for more accurate phrase queries.
        -->
        <filter class="solr.StopFilterFactory" words="german/stopwords.txt" ignoreCase="true" enablePositionIncrements="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="German2" protected="german/protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>

        <filter class="solr.WordDelimiterFilterFactory"
            generateWordParts="1"
            generateNumberParts="1"
            catenateWords="0"
            catenateNumbers="0"
            catenateAll="0"
            splitOnCaseChange="1"
            preserveOriginal="1"
        />
        <filter class="solr.StopFilterFactory" words="german/stopwords.txt" ignoreCase="true" enablePositionIncrements="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="German2" protected="german/protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
</fieldType>

<!-- Setup simple analysis for spell checking -->
<fieldType name="textSpell" class="solr.TextField" positionIncrementGap="100" omitNorms="true">
    <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>

        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="german/synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" words="german/stopwords.txt" ignoreCase="true"/>
        <filter class="solr.StandardFilterFactory" />
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory" />

        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory" words="german/stopwords.txt" ignoreCase="true"/>
        <filter class="solr.StandardFilterFactory" />
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
</fieldType>

原文

I have a weird problem with the spellcheck suggestions of Solr.

I search for a term like this (a product-number for example): 08p17a6

With this term, i find documents in my index.

I have enabled spellcheck=true. So besides documents, solr also gives me a spellcheck suggestion in the xml response:

<lst name="spellcheck">
    <lst name="suggestions">
        <lst name="p17a6">
            <int name="numFound">1</int>
            <int name="startOffset">2</int>
            <int name="endOffset">7</int>
            <arr name="suggestion">
                <str>08p17a6</str>
            </arr>
        </lst>
    </lst>
</lst>

Solr takes of the first to numbers of my search term, and gives me a suggestion based on "p17a6". I don't understand why he cuts of the first two numbers for his suggestion.

Things will get more weird, if i enable spellcheck.collate:

<lst name="spellcheck">
    <lst name="suggestions">
        <lst name="p17a6">
            <int name="numFound">1</int>
            <int name="startOffset">2</int>
            <int name="endOffset">7</int>
            <arr name="suggestion">
                <str>08p17a6</str>
            </arr>
        </lst>
        <str name="collation">0808p17a6</str>
    </lst>
</lst>

I need to use spellcheck.collate for suggestsions on multiple search terms. But as you can see, the xml response suggests me to use "0808p17a6".

Does anyone know how this happens?

Edit:

Here is my schema configuration regarding the spellcheck:

<field name="spell" type="textSpell" indexed="true" stored="false" multiValued="true" />
<copyField source="title"    dest="spell" />
<copyField source="subTitle" dest="spell" />
<copyField source="content"  dest="spell" />

The source fields of the copyfields are configured like this:

<field name="title"       type="text"   indexed="true"  stored="true" termVectors="true" omitNorms="true" />
<field name="subTitle"    type="text"   indexed="true"  stored="true" termVectors="true" omitNorms="true" />
<field name="content"     type="text"   indexed="true"  stored="true" termVectors="true" />

This is the configuration for the analyzers:

        <filter class="solr.WordDelimiterFilterFactory"
            generateWordParts="1"
            generateNumberParts="1"
            catenateWords="1"
            catenateNumbers="1"
            catenateAll="0"
            splitOnCaseChange="1"
            preserveOriginal="1"
        />
        <!-- best practice (currently) for synonyms is to add them by
            expansions during index time
        -->
        <filter class="solr.SynonymFilterFactory" synonyms="german/synonyms.txt" ignoreCase="true" expand="true"/>
        <!-- Case insensitive stop word removal.
            add enablePositionIncrements=true in both the index and query
            analyzers to leave a 'gap' for more accurate phrase queries.
        -->
        <filter class="solr.StopFilterFactory" words="german/stopwords.txt" ignoreCase="true" enablePositionIncrements="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="German2" protected="german/protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>

        <filter class="solr.WordDelimiterFilterFactory"
            generateWordParts="1"
            generateNumberParts="1"
            catenateWords="0"
            catenateNumbers="0"
            catenateAll="0"
            splitOnCaseChange="1"
            preserveOriginal="1"
        />
        <filter class="solr.StopFilterFactory" words="german/stopwords.txt" ignoreCase="true" enablePositionIncrements="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="German2" protected="german/protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
</fieldType>

<!-- Setup simple analysis for spell checking -->
<fieldType name="textSpell" class="solr.TextField" positionIncrementGap="100" omitNorms="true">
    <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>

        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="german/synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" words="german/stopwords.txt" ignoreCase="true"/>
        <filter class="solr.StandardFilterFactory" />
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory" />

        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory" words="german/stopwords.txt" ignoreCase="true"/>
        <filter class="solr.StandardFilterFactory" />
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
</fieldType>

分享到QQ

分享到微博