Solr NGramTokenizerFactory 和 PatternReplaceCharFilterFactory - 分析器结果与查询结果不一致

发布于 2024-11-17 04:18:49 字数 1639 浏览 3 评论 0原文

我目前正在使用我(错误地)认为是 Solr 的 NGramTokenizerFactory 的相当简单的实现,但我得到了管理分析器和实际查询结果之间不一致的奇怪结果,我希望得到一些指导。

我正在尝试让用户输入与我的 NGram (minGramSize=2, maxGramSize=2) 索引相匹配。我的索引和查询时间架构如下,其中

  1. 我使用 PatternReplaceCharFilter 去除所有非字母数字字符。
  2. 我使用 NGramTokenizerFactory 进行标记。
  3. 我使用 LowerCaseFilterFactory 小写(这会保留非字母标记,因此我的数字将保留)。

使用下面的模式,我认为搜索“PCB-1260”(带有正确转义的破折号)应该与索引的 Ngram 标记化和小写值“Arolor-1260”匹配(即 1260 的二元组是“12 26”)索引值和查询值均为 60")。

不幸的是,除非删除破折号,否则我不会得到任何结果。 [编辑 - 即使我正确地转义破折号并将其留在查询中,我也没有得到任何结果]。这看起来很奇怪,因为我正在使用 PatternReplaceCharFilter 对所有字母数字字符进行完整的模式替换 - 我认为这会删除所有空格和破折号。

管理页面中的查询分析器使用下面的架构显示了正确的匹配 - 所以我有点不知所措。我在这里缺少关于 PatternReplaceCharFilterNGramTokenizerFactory 的基本知识吗?

我检查了代码和其他帖子,但似乎无法弄清楚这一点。经过一周的头撞墙后,我将这个提交给堆栈的权威......

<fieldtype name="tokentext" class="solr.TextField" positionincrementgap="100">
    <analyzer type="index">
        <charfilter class="solr.PatternReplaceCharFilterFactory" pattern="([^A-Za-z0-9])" replacement=""/>
        <tokenizer class="solr.NGramTokenizerFactory" mingramsize="2" maxgramsize="2"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="query">
        <charfilter class="solr.PatternReplaceCharFilterFactory" pattern="[^A-Za-z0-9]" replacement=""/>
        <tokenizer class="solr.NGramTokenizerFactory" mingramsize="2" maxgramsize="2"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
</fieldtype>

I am currently using what I (mistakenly) thought would be a fairly straightforward implementation of Solr's NGramTokenizerFactory, but I'm getting strange results that are inconsistent between the admin analyzer and actual query results, and I'm hoping for some guidance.

I am trying to get user inputs to match my NGram (minGramSize=2, maxGramSize=2) index. My schema for indexing and query time is below, in which

  1. I strip all non alphanumeric characters using PatternReplaceCharFilter.
  2. I tokenize with NGramTokenizerFactory.
  3. I lowercase using LowerCaseFilterFactory (which leaves non-letter tokens in place, so my numbers will remain).

Using the schema below, I would think that a search for "PCB-1260" (with a properly escaped dash) should match an indexed Ngram tokenized and lowercased value of "Arochlor-1260" (i.e., the bigrams for 1260 are "12 26 60" in both the indexed value and the queried value).

Unfortunately, I get no results unless I delete the dash. [EDIT - even when I properly escape the dash and leave it in the query, I also get no results]. This seems odd because I'm doing a complete pattern replacement of all alphanumeric characters using PatternReplaceCharFilter - which I assume removes all whitespace and dashes.

The query analyzer in the admin page shows proper matching using the schema below - so I'm at a bit of a loss. Is there something fundamental about the PatternReplaceCharFilter or the NGramTokenizerFactory that I'm missing here?

I've checked the code and other posts, but can't seem to figure this one out. After a week of banging my head against the wall, I submit this one to the authority of the stack....

<fieldtype name="tokentext" class="solr.TextField" positionincrementgap="100">
    <analyzer type="index">
        <charfilter class="solr.PatternReplaceCharFilterFactory" pattern="([^A-Za-z0-9])" replacement=""/>
        <tokenizer class="solr.NGramTokenizerFactory" mingramsize="2" maxgramsize="2"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="query">
        <charfilter class="solr.PatternReplaceCharFilterFactory" pattern="[^A-Za-z0-9]" replacement=""/>
        <tokenizer class="solr.NGramTokenizerFactory" mingramsize="2" maxgramsize="2"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
</fieldtype>

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

街角卖回忆 2024-11-24 04:18:49

所以 - PatternReplaceCharFilter 未能在查询时删除破折号肯定很奇怪。最终,我只是在发送到 Solr 之前使用 preg_replace 在 php 中对用户输入进行了一些预查询处理,并且 - viola! - 效果非常好,达到了预期的效果。令人困惑的是 PatternReplaceCharFilter 没有表现...

这是我用来删除破折号的预查询 php 代码,如果有人需要的话。

$pattern = '/([-])/';
$replacement = ' ';
$usrpar = preg_replace($pattern, $replacement, $raw_user_search_contents);
$res = htmlentities($usrpar, ENT_QUOTES, 'utf-8');

之后,我将 $res 传递给 Solr...

So - something is definitely odd with PatternReplaceCharFilter failing to remove dashes at query time. Ultimately, I just did some pre-query processing in php of the user input with preg_replace before sending to Solr, and - viola! - worked like a charm with the expected results. Puzzling that the PatternReplaceCharFilter wasn't behaving...

Here's the pre-query php code that I used to get rid of the dashes, if anyone needs it.

$pattern = '/([-])/';
$replacement = ' ';
$usrpar = preg_replace($pattern, $replacement, $raw_user_search_contents);
$res = htmlentities($usrpar, ENT_QUOTES, 'utf-8');

After that, I just passed $res to Solr...

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文