通过 shingles 和 termvector 组件自动完成

发布于 2024-10-16 18:14:34 字数 1827 浏览 6 评论 0原文

实现类似 Google 的自动完成功能的方法之一是将 shingles 和 Solr 1.4 中的 termvector 组件结合起来。

首先,我们使用 shingles 组件生成所有 n 元分布,然后使用 termvector 获得最接近用户术语序列的预测(基于文档频率)。

架构:

<fieldType name="shingle_text_fivegram" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
        <tokenizer class="solr.LowerCaseTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false" />
        <filter class="solr.ShingleFilterFactory" maxShingleSize="5" outputUnigrams="false"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
</fieldType>

Solr 配置:

<searchcomponent name="termsComponent" class="org.apache.solr.handler.component.TermsComponent"/>
<requesthandler name="/terms" class="org.apache.solr.handler.component.SearchHandler">
    <lst name="defaults">
        <bool name="terms">true</bool>
        <str name="terms.fl">shingleContent_fivegram</str>
    </lst>
    <arr name="components">
        <str>termsComponent</str>
    </arr>
</requesthandler>

通过上述设置,我需要在 n-gram 边缘的任何位置删除停用词,并将它们保留在 n-gram 序列内。

假设从序列“印度和中国”中我需要以下序列:

india
china
india and china

并跳过其余部分。

它可以与其他 Solr 组件/过滤器结合使用吗?

UPD:这是 Lucene 4 中的一个可能的解决方案(应该可以连接到 SOLR):

“你不能制作一个自定义停止过滤器,只删除开头(看到的第一个标记)或结尾处的停止词吗?输入(之后没有看到非停用词标记)?它需要一些缓冲/状态保持(捕获/恢复状态),但这似乎可行?” -- Michael McCandless

来自:http://blog.mikemccandless.com/2013 /08/suggeststopfilter-careously-removes.html

One of the ways to go about Google-like auto-completion is to combine shingles and the termvector component in Solr 1.4.

First we generate all n-gram distributions with the shingles component and then use termvector to get the closest prediction to a user's term's sequence (based on document frequency).

Schema:

<fieldType name="shingle_text_fivegram" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
        <tokenizer class="solr.LowerCaseTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false" />
        <filter class="solr.ShingleFilterFactory" maxShingleSize="5" outputUnigrams="false"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
</fieldType>

Solr config:

<searchcomponent name="termsComponent" class="org.apache.solr.handler.component.TermsComponent"/>
<requesthandler name="/terms" class="org.apache.solr.handler.component.SearchHandler">
    <lst name="defaults">
        <bool name="terms">true</bool>
        <str name="terms.fl">shingleContent_fivegram</str>
    </lst>
    <arr name="components">
        <str>termsComponent</str>
    </arr>
</requesthandler>

With the above setup I need to drop stopwords anywhere on the edges of n-grams and keep them inside the n-gram sequence.

Let's say from the sequence "india and china" I need the following sequence:

india
china
india and china

and skip the rest.

Is it doable in combination with other Solr components/filters?

UPD: here is one possible solution in Lucene 4 (should be possible to wire into SOLR):

"Couldn't you make a custom stop filter that only removed stop words at the start (first token(s) seen) or end of the input (no non-stopword tokens seen after)? It'd required some buffering / state keeping (capture/restorteState) but it seem doable?" -- Michael McCandless

from: http://blog.mikemccandless.com/2013/08/suggeststopfilter-carefully-removes.html

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

楠木可依 2024-10-23 18:14:34

在 Solr 1.4 中进行多字自动完成的最佳方法是使用 EdgeNGramFilterFactory,因为您需要在用户输入时匹配他/她的输入。所以需要匹配“i”、“in”、“ind”等来建议印度。

The best way to do multi-word auto-complete in Solr 1.4 is with EdgeNGramFilterFactory, as you need to match the user input as he/she types it. So you need to match "i", "in" "ind" and so on to suggest India.

聽兲甴掵 2024-10-23 18:14:34

将单独的查询分析器与 KeywordTokenizerFactory 结合使用,因此(使用您的示例):

        <analyzer type="index">
            <tokenizer class="solr.LowerCaseTokenizerFactory"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false" />
            <filter class="solr.ShingleFilterFactory" maxShingleSize="5" outputUnigrams="false"/>
            <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        </analyzer>
        <analyzer type="query">
            <tokenizer class="solr.KeywordTokenizerFactory"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false" />
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        </analyzer>

Use a separate query analyzer with the KeywordTokenizerFactory, thus (using your example):

        <analyzer type="index">
            <tokenizer class="solr.LowerCaseTokenizerFactory"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false" />
            <filter class="solr.ShingleFilterFactory" maxShingleSize="5" outputUnigrams="false"/>
            <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        </analyzer>
        <analyzer type="query">
            <tokenizer class="solr.KeywordTokenizerFactory"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false" />
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        </analyzer>
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文