如何在 solr 索引中处理 AT&T 令牌
我有一个包含 AT&T 作为字段的索引,但是当我搜索该字段时,我们无法将 &登录查询,因此它被编码为 AT%26T。搜索 AT%26T 没有返回任何内容, 有没有办法使用分析器或过滤器来索引此类术语。
注意:我使用了带有reserveOriginal=1的WordDelimiter分析器...但这不起作用
I have an index containing AT&T as a field , but when I search for this field we cannot put & sign in the query , so it is encoded to AT%26T. Searching for AT%26T returns nothing ,
Is there any way to use analyzer or filters to index this type of terms.
NOTE : I have used WordDelimiter analyzer with reserveOriginal=1 ...but that didn't work
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您可以尝试搜索
AT&T
,否则您可以在管理/分析中了解术语AT&T在查询和索引阶段发生了什么。启用 verbose 后,您可以准确地看到分析器对您的术语执行的操作。
You can try to search for
AT&T
Else you can find out in the admin/analysis what happens to the term AT&T in query and index stage. With verbose on, you can see excactly what analyzers do with your terms.
除了其他人显示的原因之外,另一个原因是转义特殊字符。您应该从列表中转义所有内容:
+ - && || ! ( ) { } [ ] ^ " ~ * ? : \
尝试在 & 符号前使用反斜杠。
The other reason than that shown by others is escaping special characters. You should escape all from the list:
+ - && || ! ( ) { } [ ] ^ " ~ * ? : \
Just try use backslash before ampersand.
您需要进一步调整 WordDelemiter。请参阅我为 jetwick 所做的调整,以搜索主题标签 ala #java
https://github.com/karussell/Jetwick/blob/master/src/main/java/de/jetwick/es/JetwickFilterFactory.java#L49
背景:AT&T 是通常标记为 AT 和 T,因为“&”被删除,因为它没有数字或字符,但使用上面的类,您可以将其设为“&”符号被处理为数字和所有包含“&”的东西然后,符号将被标记为“AT&T”(我认为还有“AT”和“T”),但前提是 keepOriginal=1 或者您将它们处理为 char,但它不会拆分为“AT”和“T” ' 我认为字符串的所有位置都被检测为字符
顺便说一句:您还需要重新索引并在查询字符串上应用相同的分析器/标记器!
You need to tune WordDelemiter a bit further. See my adjustments I had made for jetwick to search for hashtags ala #java
https://github.com/karussell/Jetwick/blob/master/src/main/java/de/jetwick/es/JetwickFilterFactory.java#L49
The background: AT&T is normally tokenized as AT and T because '&' is removed as its no digit or character but with the class above you can make that the '&' sign is handled as digit and all stuff containing '&' signs will then be tokenized as 'AT&T' (and 'AT' and 'T' I think) but only if preserveOriginal=1 or you handle them as char, but then it won't split into 'AT' and 'T' I think as all positions of the string are detected as chars
BTW: you'll need to reindex and apply the same analyzer/tokenizer on the query string too!
也许你可以尝试使用 catenateWords="1"。这样 AT&T 也将我索引为 ATT。
另请确保您的分析器出现在以下两者下:
和
如果您只有此标签;比分析器将同时用于查询和索引时间。
Maybe you can try to use catenateWords="1". So that AT&T will me also indexed as ATT.
Also make sure your analyzer appears under both:
and
If you only have this tag <analyzer> than the analyzer will be used both on query and index time.