在 Solr 中使用多个标记器
我想要做的是执行查询并获取不区分大小写且与索引中的部分单词匹配的结果。
我目前设置了一个 Solr 模式,该模式已被修改,以便我可以查询并返回结果,无论它们是什么情况。因此,如果我搜索 iPod,我将看到返回的 iPod。执行此操作的代码是:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
</analyzer>
...
</fieldType>
我发现这段代码允许我们进行部分单词匹配查询,但我认为我不能在一个字段上有两个标记器。
<fieldType name="text" class="solr.TextField" >
<analyzer type="index">
<tokenizer class="solr.NGramTokenizerFactory" minGramSize="3" maxGramSize="15" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
...
</fieldType>
那么我该怎么做才能在现场执行这个分词器呢?
或者有没有办法合并它们?
或者还有其他方法可以完成这项任务吗?
What I want to be able to do is perform a query and get results back that are not case sensitive and that match partial words from the index.
I have a Solr schema set up at the moment that has been modified so that I can query and return results no matter what case they are. So, if I search for iPOd, Iwill see iPod returned. The code to do this is:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
</analyzer>
...
</fieldType>
I have found this code that will allow us to do a partial word match query, but I don't think I can have two tokenizers on one field.
<fieldType name="text" class="solr.TextField" >
<analyzer type="index">
<tokenizer class="solr.NGramTokenizerFactory" minGramSize="3" maxGramSize="15" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
...
</fieldType>
So what can I do to perform this tokenizer on the field as well?
Or is there a way to merge them?
Or is there another way I can accomplish this task?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
声明另一个具有 NGram 分词器的 fieldType(即不同的名称),然后声明一个使用 NGram 的 fieldType 的字段和另一个使用标准“文本”fieldType 的字段。使用 copyField 将一个字段复制到另一个字段。请参阅在多个字段中对相同数据建立索引。
Declare another fieldType (i.e. a different name) that has the NGram tokenizer, then declare a field that uses the fieldType with NGram and another field with the standard "text" fieldType. Use copyField to copy one to another. See Indexing same data in multiple fields.
另一种方法是将 EdgeGramFilterFactory 应用于现有字段并保留当前的分词器 (WhitespaceTokenizerFactory),例如
这将使您当前的架构保持不变,即您不需要具有另一个标记生成器的附加字段 (
NGramTokenizerFactory
)您的字段如下所示:
An alternative would be to apply the
EdgeGramFilterFactory
to the existing field and stay with your current tokenizer (WhitespaceTokenizerFactory
), e.g.This would keep your current schema unchanged, i.e. you would not need an additional field which has another tokenizer (
NGramTokenizerFactory
)Your field look then something like the below: