如何在 solr 索引中处理 AT&T 令牌

发布于 2024-12-03 03:21:54 字数 194 浏览 1 评论 0原文

我有一个包含 AT&T 作为字段的索引,但是当我搜索该字段时,我们无法将 &登录查询,因此它被编码为 AT%26T。搜索 AT%26T 没有返回任何内容, 有没有办法使用分析器或过滤器来索引此类术语。

注意:我使用了带有reserveOriginal=1的WordDelimiter分析器...但这不起作用

I have an index containing AT&T as a field , but when I search for this field we cannot put & sign in the query , so it is encoded to AT%26T. Searching for AT%26T returns nothing ,
Is there any way to use analyzer or filters to index this type of terms.

NOTE : I have used WordDelimiter analyzer with reserveOriginal=1 ...but that didn't work

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

汹涌人海 2024-12-10 03:21:54

您可以尝试搜索AT&T

否则您可以在管理/分析中了解术语AT&T在查询和索引阶段发生了什么。启用 verbose 后,您可以准确地看到分析器对您的术语执行的操作。

You can try to search for AT&T

Else you can find out in the admin/analysis what happens to the term AT&T in query and index stage. With verbose on, you can see excactly what analyzers do with your terms.

乙白 2024-12-10 03:21:54

除了其他人显示的原因之外,另一个原因是转义特殊字符。您应该从列表中转义所有内容:

+ - && || ! ( ) { } [ ] ^ " ~ * ? : \

尝试在 & 符号前使用反斜杠。

The other reason than that shown by others is escaping special characters. You should escape all from the list:

+ - && || ! ( ) { } [ ] ^ " ~ * ? : \

Just try use backslash before ampersand.

怪我入戏太深 2024-12-10 03:21:54

您需要进一步调整 WordDelemiter。请参阅我为 jetwick 所做的调整,以搜索主题标签 ala #java

https://github.com/karussell/Jetwick/blob/master/src/main/java/de/jetwick/es/JetwickFilterFactory.java#L49

背景:AT&T 是通常标记为 AT 和 T,因为“&”被删除,因为它没有数字或字符,但使用上面的类,您可以将其设为“&”符号被处理为数字和所有包含“&”的东西然后,符号将被标记为“AT&T”(我认为还有“AT”和“T”),但前提是 keepOriginal=1 或者您将它们处理为 char,但它不会拆分为“AT”和“T” ' 我认为字符串的所有位置都被检测为字符

顺便说一句:您还需要重新索引并在查询字符串上应用相同的分析器/标记器!

You need to tune WordDelemiter a bit further. See my adjustments I had made for jetwick to search for hashtags ala #java

https://github.com/karussell/Jetwick/blob/master/src/main/java/de/jetwick/es/JetwickFilterFactory.java#L49

The background: AT&T is normally tokenized as AT and T because '&' is removed as its no digit or character but with the class above you can make that the '&' sign is handled as digit and all stuff containing '&' signs will then be tokenized as 'AT&T' (and 'AT' and 'T' I think) but only if preserveOriginal=1 or you handle them as char, but then it won't split into 'AT' and 'T' I think as all positions of the string are detected as chars

BTW: you'll need to reindex and apply the same analyzer/tokenizer on the query string too!

哆兒滾 2024-12-10 03:21:54

也许你可以尝试使用 catenateWords="1"。这样 AT&T 也将我索引为 ATT。
另请确保您的分析器出现在以下两者下:

<analyzer type="query"> //this will define how the query is parsed and split into tokens before searching it

<analyzer type="index">// this will define how the field is indexed

如果您只有此标签;比分析器将同时用于查询和索引时间。

Maybe you can try to use catenateWords="1". So that AT&T will me also indexed as ATT.
Also make sure your analyzer appears under both:

<analyzer type="query"> //this will define how the query is parsed and split into tokens before searching it

and

<analyzer type="index">// this will define how the field is indexed

If you only have this tag <analyzer> than the analyzer will be used both on query and index time.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文