使用 openldap 进行近似搜索
我正在尝试编写一个搜索来查询运行 openldap 的目录服务器。
用户将使用他们感兴趣的人的名字或姓氏进行搜索。
我发现重音字符(如 áéíóú
)存在问题,因为名字和姓氏是用西班牙语书写的,因此虽然正确的方式是 Pérez
,但为了搜索的目的,可以将其写为 Perez
,不带重音。
如果我使用 '(cn=*Perez*)'
我只会得到非重音结果。
如果我使用 '(cn=*Pérez*)'
我只会得到带重音的结果。
如果我使用 '(cn=~Perez)'
我会得到奇怪的结果(或者至少我无法使用任何结果,因为虽然结果同时包含 Perez
和 Pérez
发生,我还得到了一些显然与查询无关的结果...
在西班牙语中,这种情况发生了很多...无论是懒惰,无论你想怎么称呼它,事实是对于这种事情,人们往往不写重音符号,因为假设所有这些搜索都可以使用这两种选项(我猜既然谷歌允许它,每个人都认为它应该以这种方式工作),
除了更新数据库并删除所有重音并在查询中修剪它们之外......你能想到另一种方法吗?解决方案?
I am trying to write a search that queries our directory server running openldap.
The users are going to be searching using the first or last name of the person they're interested in.
I found a problem with accented characters (like áéíóú
), because first and last names are written in Spanish, so while the proper way is Pérez
it can be written for the sake of the search as Perez
, without the accent.
If I use '(cn=*Perez*)'
I get only the non-accented results.
If I use '(cn=*Pérez*)'
I get only accented results.
If I use '(cn=~Perez)'
I get weird results (or at least nothing I can use, because while the results contain both Perez
and Pérez
ocurrences, I also get some results that apparently have nothing to do with the query...
In Spanish this happens quite a lot... be it lazyness, be it whatever you want to call it, the fact is that for this kind of thing people tend NOT to write the accents because it's assumend all these searches work with both options (I guess since Google allowes it, everybody assumes it's supposed to work that way).
Other than updating the database and removing all accents and trimming them on the query... can you think of another solution?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
你上面的 ~ 和 = 交换了。应该是(cn~=Perez)。我仍然不知道这会有多好。 Soundex 一直很奇怪。由于许多属性都是多值的,包括 cn,因此您可以在将扩展字符转换为其基本版本的属性上存储第二个值。当你需要的时候,你至少还能拥有原来的价值。您还可以真正喜欢并在转换后的值前加上一些前缀,然后使用valuesReturnFilter 将其从结果中过滤掉。
然后修改您的查询以使用 or 表达式。
并且您将包含一个看起来像
RFC3876 http://www.networksorcery.com/ 的 valuesReturnFilter enp/rfc/rfc3876.txt 了解详细信息。添加请求控件的方法因您用于访问目录的平台/库而异。
You have your ~ and = swapped above. It should be (cn~=Perez). I still don't know how well that will work. Soundex has always been strange. Since many attributes are multi-valued including cn you could store a second value on the attribute that has the extended characters converted to their base versions. You would at least have the original value to still go off of when you needed it. You could also get real fancy and prefix the converted value with something and use the valuesReturnFilter to filter it out from your results.
Then modify your query to use an or expression.
And you would include a valuesReturnFilter that looked like
See RFC3876 http://www.networksorcery.com/enp/rfc/rfc3876.txt for details. The method for adding a request control varies by what platform/library you are using to access the directory.
搜索过滤器(“查询”)由 RFC2254 指定。
编码:
RFC2254
实际上需要过滤器(间接定义)
OCTET STRING,即 ASCII 8 字节字符串:
AttributeValue 是八位字节字符串,
MatchingRuleId
和属性描述
是 LDAPString,LDAPString 是一个 OCTET STRING。
转义标准:使用“”替换特殊字符
(https://www.rfc-editor.org/rfc/rfc4515#page -4,示例https://www.rfc-editor.org/rfc/rfc4515#page-5)。
引用:
此外,您可能应该替换在语义上修改过滤器的所有字符(RFC 4515 的语法给出了一个列表),并使用通配符 (*) 对非 ASCII 字符进行正则表达式替换务必。这也将帮助您处理“é”等字符。
Search filters ("queries") are specified by RFC2254.
Encoding:
RFC2254
actually requires filters (indirectly defined) to be an
OCTET STRING, i.e. ASCII 8-byte String:
AttributeValue is OCTET STRING,
MatchingRuleId
and AttributeDescription
are LDAPString, LDAPString is an OCTET STRING.
The standard on escaping: Use "<ASCII HEX NUMBER>" to replace special characters
(https://www.rfc-editor.org/rfc/rfc4515#page-4, examples https://www.rfc-editor.org/rfc/rfc4515#page-5).
Quote:
Additionally, you should probably replace all characters that semantically modify the filter (RFC 4515's grammar gives a list), and do a Regex replace of non-ASCII characters with wildcards (*) to be sure. This will also help you with characters like "é".