在 Solr 中对 NGramFilterFactory 的结果进行标记（查询分析器）

发布于 2025-01-04 11:14:04 字数 586 浏览 4 评论 0原文

我使用 NGramFilterFactory 进行索引和查询。

因此，如果我正在搜索“overflow”，它会创建一个如下查询：

mySearchField:"ov ve ... erflow overflo verflow overflow"

但是，如果我拼错“overflow”，即“owerflow”，则没有匹配项，因为查询周围的引号：

mySearchField:"ow we ... erflow owerflo werflow owerflow"

Is it possible to tokenize the result NGramFilteFactory，它将创建一个像这样的查询：

mySearchField:"ow"
mySearchField:"we"
mySearchField:"erflow"
mySearchField:"owerflo"
mySearchField:"werflow"
mySearchField:"owerflow"

在这种情况下，solr 也会找到结果，因为标记“erflow”存在。

原文

I'm using the NGramFilterFactory for indexing and querying.

So if I'm searching for "overflow" it creates an query like this:

mySearchField:"ov ve ... erflow overflo verflow overflow"

But if I misspell "overflow", i.e. "owerflow" there are no matches, because the quotes around the query:

mySearchField:"ow we ... erflow owerflo werflow owerflow"

Is it possible to tokenize the result of the NGramFilteFactory, that it'll creates an query like this:

mySearchField:"ow"
mySearchField:"we"
mySearchField:"erflow"
mySearchField:"owerflo"
mySearchField:"werflow"
mySearchField:"owerflow"

In this case solr would also find results, because the token "erflow" exists.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

苦行僧 2025-01-11 11:14:04

您不需要像您所写的那样对查询进行标记。检查您的 schema.xml 中是否在索引时间和查询时间应用了 NGramFilterFactory。
然后，您使用的查询解析器会产生影响。使用 LuceneQParser 您可以获得所需的结果，但使用 DisMax 和 eDisMax 则不然。

我使用 eDisMax 和 debugQuery=on 检查了查询 mySearchField:owerflow：

<str name="querystring">text:owerflow</str>
<str name="parsedquery">
+((text:o text:w text:e text:r text:f text:l text:o text:w text:ow text:we text:er text:rf text:fl text:lo text:ow text:owe text:wer text:erf text:rfl text:flo text:low text:ower text:werf text:erfl text:rflo text:flow text:owerf text:werfl text:erflo text:rflow text:owerfl text:werflo text:erflow text:owerflo text:werflow text:owerflow)~36)
</str>

如果您查看生成的查询的末尾，您将看到 < code>~36 其中 36 是查询生成的 n 元语法的数量。由于 ~36，您不会获得任何结果，但您可以通过 mm 参数，这是最小应该匹配的值。

如果您将查询更改为 mySearchField:owerflow&mm=1 或低于 25 的值，您将获得所需的结果。

这个答案与您的答案之间的区别在于，使用 EdgeNGramFilterFactory 时，像 mySearchField:werflow 这样的中缀查询不会返回任何结果，而使用 NGramFilterFactory 时会返回任何结果>。

无论如何，如果您使用 NGramFilterFactory 进行拼写纠正，我强烈建议您查看 SpellCheckComponent 也是为此目的而制作的。

You don't need to tokenize your query like you wrote. Check if in your schema.xml you have the NGramFilterFactory applied at both index time and query time.
Then, the query parser you're using makes the difference. With LuceneQParser you'd get the result you're looking for, but not with DisMax and eDisMax.

I checked the query mySearchField:owerflow with eDisMax and debugQuery=on:

<str name="querystring">text:owerflow</str>
<str name="parsedquery">
+((text:o text:w text:e text:r text:f text:l text:o text:w text:ow text:we text:er text:rf text:fl text:lo text:ow text:owe text:wer text:erf text:rfl text:flo text:low text:ower text:werf text:erfl text:rflo text:flow text:owerf text:werfl text:erflo text:rflow text:owerfl text:werflo text:erflow text:owerflo text:werflow text:owerflow)~36)
</str>

If you look at the end of the generated query you'll see ~36 where 36 is the number of n-grams generated from your query. You don't get any results because of that ~36, but you can change it through the mm parameter, which is the minimum should match.

If you change the query to mySearchField:owerflow&mm=1 or a value lower than 25 you'll have the result you're looking for.

The difference between this answer and yours is that with EdgeNGramFilterFactory an infix query like mySearchField:werflow doesn't return any result, while it does with NGramFilterFactory.

Anyway, If you're using the NGramFilterFactory for making spelling correction, I'd strongly recommend to have a look at the SpellCheckComponent as well, made exactly for that purpose.

回复收藏 0 原文

蓝海似她心 2025-01-11 11:14:04

好的，我找到了一个快速简单的方法来解决这个问题。

fieldType 有一个可选属性 autoGeneratePhraseQueries（默认=true）。如果我将 autoGeneratePhraseQueries 设置为 false，则一切正常。

说明：

schema.xml 中使用的 fieldType：

<fieldType name="edgytext" class="solr.TextField" autoGeneratePhraseQueries="false">
 <analyzer type="index">
   <tokenizer class="solr.KeywordTokenizerFactory"/>
   <filter class="solr.LowerCaseFilterFactory"/>
   <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" />
 </analyzer>
 <analyzer type="query">
   <tokenizer class="solr.WhiteSpaceTokenizerFactory"/>
   <filter class="solr.LowerCaseFilterFactory"/>
   <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" />
 </analyzer>
</fieldType>

如果您正在为单词“surprise”建立索引，则索引中包含以下标记：

s、su、、sur、surp、surpr、惊喜、惊喜、惊喜

如果您搜索“surpriese”（拼写错误），solr 将创建以下标记（匹配标记为粗体）：

s、su、sur、surp、surpr、惊喜，惊喜，惊喜，惊喜

将创建的真正查询如下所示：

mySearchField:s、mySearchField:su、mySearchField:sup .. 等等

但是如果您设置 autoGeneratePhraseQueries=true 将创建以下查询：

mySearchField：“s su surp supr surprie surpries surpriese”

这是短语查询，与索引术语不匹配。

OK, I found a quick and easy way to solve the problem.

The fieldType has an optional attribute autoGeneratePhraseQueries (Default=true). If I set autoGeneratePhraseQueries to false, everything works fine.

Explanation:

fieldType used in schema.xml:

<fieldType name="edgytext" class="solr.TextField" autoGeneratePhraseQueries="false">
 <analyzer type="index">
   <tokenizer class="solr.KeywordTokenizerFactory"/>
   <filter class="solr.LowerCaseFilterFactory"/>
   <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" />
 </analyzer>
 <analyzer type="query">
   <tokenizer class="solr.WhiteSpaceTokenizerFactory"/>
   <filter class="solr.LowerCaseFilterFactory"/>
   <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" />
 </analyzer>
</fieldType>

If you are indexing the word "surprise", following tokens are in the index: