Zend Lucene 无法通过特殊字符进行所有搜索
如果有人知道这个问题的简单答案,我就不必费力地创建一个带有转义字符串的额外索引,也不必在乱扔漂亮的代码时哭泣。
基本上,我们运行的 Lucene 搜索无法处理任何非字母字符。空格、百分号、点、破折号、斜杠,凡是你能想到的。这是非常令人愤怒的,因为我无法对包含这些字符的项目进行任何搜索,无论我是否在任何地方转义它们。
我有两个选择:在单独的索引中删除这些字符,并将它们从我正在搜索的名称中删除,或者停止该死的搜索。
if anyone knows a simple answer to this, I don't have to wade through creating an extra index with escaped strings and crying my eyes out while littering my pretty code.
Basically, the Lucene search we have running cannot handle any non-letter characters. Space, percent signs, dots, dashes, slashes, you name it. This is higly infuriating, because I cannot make any search on items containing these characters, no matter wherever I escape them or not.
I have two options: Kill these characters in a separate index and strip them from the names I'm searching or stop goddamn searching.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可以使用“/”转义特殊字符。 Lucene 将以下字符视为特殊字符,您必须转义这些字符才能使其正常工作。
如果要搜索“2+3”,则查询应为“2/+3”
You can escape special characters using '/'. Lucene treats followings the following as special characters and you will have to escape those characters to make it work.
If you want to search "2+3", query should be "2/+3"
使用 QueryParser.escape(String s) 转义查询字符串。
Use
QueryParser.escape(String s)
to escape the query string.根据 http://lucene.apache.org/core/old_versioned_docs /versions/2_9_1/queryparsersyntax.html#-
转义字符是斜杠向后,而不是 -forward: 。
要回答Ankit, $ 似乎不必转义,因为它不是特殊字符。
按照 Ralph 的建议转义破折号对我来说没有什么区别(Zend Lucene)。您可能会认为,当对单词“abc-def”建立索引并且搜索“abc-def”时,您会以某种方式找到该单词,无论在索引步骤中是否忽略破折号。相同的输入应该有相同的结果。该词似乎被索引为两个单独的标记“abc”和“def”。然而,当搜索“abc def”时,搜索“abc-def”不会给出任何结果。
According to http://lucene.apache.org/core/old_versioned_docs/versions/2_9_1/queryparsersyntax.html#-
The escape character is slash-backward, not -forward: .
And to answer Ankit, $ doesn't seem to have to be escaped since it's not a special character.
Escaping the dash as suggested by Ralph doesn't make a difference for me (Zend Lucene). You'd think that when a word 'abc-def' is indexed and you search for 'abc-def' you'll somehow find that word, regardless of whether the dash is ignored at the indexing step or not. Same input should have same result. The word seems to be indexed as two separate tokens 'abc' and 'def'. Yet searching for 'abc-def' gives no results when 'abc def' does.