在 Solr 中对 NGramFilterFactory 的结果进行标记(查询分析器)
我使用 NGramFilterFactory 进行索引和查询。
因此,如果我正在搜索“overflow”,它会创建一个如下查询:
mySearchField:"ov ve ... erflow overflo verflow overflow"
但是,如果我拼错“overflow”,即“owerflow”,则没有匹配项,因为查询周围的引号:
mySearchField:"ow we ... erflow owerflo werflow owerflow"
Is it possible to tokenize the result NGramFilteFactory,它将创建一个像这样的查询:
mySearchField:"ow"
mySearchField:"we"
mySearchField:"erflow"
mySearchField:"owerflo"
mySearchField:"werflow"
mySearchField:"owerflow"
在这种情况下,solr 也会找到结果,因为标记“erflow”存在。
I'm using the NGramFilterFactory for indexing and querying.
So if I'm searching for "overflow" it creates an query like this:
mySearchField:"ov ve ... erflow overflo verflow overflow"
But if I misspell "overflow", i.e. "owerflow" there are no matches, because the quotes around the query:
mySearchField:"ow we ... erflow owerflo werflow owerflow"
Is it possible to tokenize the result of the NGramFilteFactory, that it'll creates an query like this:
mySearchField:"ow"
mySearchField:"we"
mySearchField:"erflow"
mySearchField:"owerflo"
mySearchField:"werflow"
mySearchField:"owerflow"
In this case solr would also find results, because the token "erflow" exists.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您不需要像您所写的那样对查询进行标记。检查您的
schema.xml
中是否在索引时间和查询时间应用了NGramFilterFactory
。然后,您使用的查询解析器会产生影响。使用
LuceneQParser
您可以获得所需的结果,但使用DisMax
和eDisMax
则不然。我使用
eDisMax
和debugQuery=on
检查了查询mySearchField:owerflow
:如果您查看生成的查询的末尾,您将看到 < code>~36 其中 36 是查询生成的 n 元语法的数量。由于
~36
,您不会获得任何结果,但您可以通过mm
参数,这是最小应该匹配的值。如果您将查询更改为
mySearchField:owerflow&mm=1
或低于 25 的值,您将获得所需的结果。这个答案与您的答案之间的区别在于,使用
EdgeNGramFilterFactory
时,像mySearchField:werflow
这样的中缀查询不会返回任何结果,而使用NGramFilterFactory
时会返回任何结果>。无论如何,如果您使用
NGramFilterFactory
进行拼写纠正,我强烈建议您查看SpellCheckComponent
也是为此目的而制作的。You don't need to tokenize your query like you wrote. Check if in your
schema.xml
you have theNGramFilterFactory
applied at both index time and query time.Then, the query parser you're using makes the difference. With
LuceneQParser
you'd get the result you're looking for, but not withDisMax
andeDisMax
.I checked the query
mySearchField:owerflow
witheDisMax
anddebugQuery=on
:If you look at the end of the generated query you'll see
~36
where 36 is the number of n-grams generated from your query. You don't get any results because of that~36
, but you can change it through themm
parameter, which is the minimum should match.If you change the query to
mySearchField:owerflow&mm=1
or a value lower than 25 you'll have the result you're looking for.The difference between this answer and yours is that with
EdgeNGramFilterFactory
an infix query likemySearchField:werflow
doesn't return any result, while it does withNGramFilterFactory
.Anyway, If you're using the
NGramFilterFactory
for making spelling correction, I'd strongly recommend to have a look at theSpellCheckComponent
as well, made exactly for that purpose.好的,我找到了一个快速简单的方法来解决这个问题。
fieldType 有一个可选属性 autoGeneratePhraseQueries(默认=true)。如果我将 autoGeneratePhraseQueries 设置为 false,则一切正常。
说明:
schema.xml 中使用的 fieldType:
如果您正在为单词“surprise”建立索引,则索引中包含以下标记:
如果您搜索“surpriese”(拼写错误),solr 将创建以下标记(匹配标记为粗体):
将创建的真正查询如下所示:
但是如果您设置 autoGeneratePhraseQueries=true 将创建以下查询:
这是短语查询,与索引术语不匹配。
OK, I found a quick and easy way to solve the problem.
The fieldType has an optional attribute autoGeneratePhraseQueries (Default=true). If I set autoGeneratePhraseQueries to false, everything works fine.
Explanation:
fieldType used in schema.xml:
If you are indexing the word "surprise", following tokens are in the index:
If you are search for "surpriese" (misspelled) solr creates following tokens (matching tokens are bold):
The real query which will be created looks like:
But if you set autoGeneratePhraseQueries=true following query will be created:
This is an phrase query and does not match the indexed terms.