Solr:结合 EdgeNGramFilterFactory 和 NGramFilterFactory
我有一种情况需要同时使用 EdgeNGramFilterFactory 和 NGramFilterFactory。
我正在使用 NGramFilterFactory 执行“包含”样式搜索,最小字符数为 2。我还想搜索第一个字母,例如带有前面 EdgeNGramFilterFactory 的“startswith”。
我不想将 NGramFilterFactory 降低到最小字符数 1,因为我不想索引所有字符。
一些帮助将不胜感激
干杯
I have a situation where I need to use both EdgeNGramFilterFactory and NGramFilterFactory.
I am using NGramFilterFactory to perform a "contains" style search with min number of characters as 2. I also want to search for the first letter, like a "startswith" with a front EdgeNGramFilterFactory.
I dont want to lower the NGramFilterFactory to min characters of 1 as I dont want to index all characters.
Some help would be greatly appreciated
Cheers
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您不必在同一领域完成所有这些工作。我会为每种处理使用不同的自定义类型创建不同的字段,以便您可以单独应用逻辑。
在下面的内容中:
text
包含经过最低限度处理的原始标记;text_ngram
对至少两个字符的标记使用 NGramFiltertext_first_letter
对单字符首字母标记使用 EdgeNGram如果您正在处理所有
文本 字段以这种方式,那么您可能可以使用
copyField
来填充字段。否则,您可以指示 Solr 客户端为三个单独的字段类型发送相同的字段值。搜索时,使用
qf
参数将所有这些都包含在搜索中。设置
field
和dynamicField
定义由您决定。或者,如果您有更多问题,请告诉我,我可以进行编辑并进行澄清。You don't necessarily have to do all this in the same field. I would create a different fields using different custom types for each treatment so that you can apply the logic separately.
In the following:
text
contains the original tokens, minimally processed;text_ngram
uses the NGramFilter for your two-character-minimum tokenstext_first_letter
uses EdgeNGram for your one-character initial-letter tokensIf you're processing all
text
fields in this way, then you might be able to get away with using acopyField
to populate the fields. Otherwise, you can instruct your Solr client to send in the same field values for the three separate field types.When searching, include all of them in your searches with the
qf
parameter.Setting up
field
anddynamicField
definitions are left up to you. Or let me know if you have more questions and I can edit with clarifications.首先应用 EdgeNgramFilter,最小值 = 1,最大值 = 1000(我们希望包含整个原始标记)。示例:
你好=> 'h', 'he', 'hel', 'hell', 'hello'
其次使用 min = 2 的 NGramFilter。(为了简单起见,我将在示例中使用 2 作为最大值)
'h', 'he', 'hel', '地狱', '你好' => 'h', 'he', 'he', 'el', 'he', 'el', 'll', 'he', 'el', 'll', 'lo'
现在你将有几个相同的标记因为您已对 EdgeNGramFilter 中的所有“部分”标记应用了 NGramFilter,但只需应用 RemoveDuplicatesTokensFilter 即可删除这些标记。
'h', 'he', 'he', 'el', 'he', 'el', 'll', 'he', 'el', 'll', 'lo' => 'h', 'he', 'el', 'll', 'lo'
现在您的字段将支持单个字符“startsWith”查询和多个字符“contains”查询。
Start by applying the EdgeNgramFilter with min = 1 and max = 1000 (we want the entire original token to be included). Example:
hello => 'h', 'he', 'hel', 'hell', 'hello'
Secondly use the NGramFilter with min = 2. (I will use 2 as the max in the example for simplicity)
'h', 'he', 'hel', 'hell', 'hello' => 'h', 'he', 'he', 'el', 'he', 'el', 'll', 'he', 'el', 'll', 'lo'
Now you will have several identical tokens since you have applied the NGramFilter on all "partial" tokens from the EdgeNGramFilter but simply apply the RemoveDuplicatesTokensFilter to remove those.
'h', 'he', 'he', 'el', 'he', 'el', 'll', 'he', 'el', 'll', 'lo' => 'h', 'he', 'el', 'll', 'lo'
Now your field will support a single char "startsWith" query and a multiple chars "contains" query.