快速 ESP 字符标准化
我正在 FAST ESP 服务器上运行搜索应用程序。现在我遇到了字符标准化的问题。
我想要的是搜索“wurth”并在“würth”中获得点击。
我尝试在 esp/etc/tokenizer/tokenization.xml 中配置以下内容,
<normalizationlist name="German to Norwegian">
<normalization description="German u with diaeresis, to Norwegian u">
<input>x75</input>
<output>xFC</output>
<output>x75</output>
</normalization>
</normalizationlist>
但当然,这会将所有 u 转换为 ü,这是无用的。
我该如何正确配置?
I'm running a search application on a FAST ESP server. Now I have this problem with character normalization.
What I want is to search for 'wurth' and get a hit in 'würth'.
i've tried configuring the following in esp/etc/tokenizer/tokenization.xml
<normalizationlist name="German to Norwegian">
<normalization description="German u with diaeresis, to Norwegian u">
<input>x75</input>
<output>xFC</output>
<output>x75</output>
</normalization>
</normalizationlist>
but of cours, this translate all u to ü, which is useless.
How do I configure this the right way?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
解决方案是将每个“特殊字符”规范化为相同的“普通字符”;
ö ->哦
ø->哦
å ->一个
ä ->一个
æ-> a
这有点耗时,但很有效!
The solution is to normalize every "special character" to the same "normal character";
ö -> o
ø -> o
å -> a
ä -> a
æ -> a
This is at bit time consuming, but it works!
阅读高级物流指南。它包含关于字符规范化的一章。当您按照指南中的步骤操作时,所有特殊字符都将被视为普通字符。因此,搜索 uber 将给出与搜索 uber 相同的结果。
Read the Avanced Logistics Guide. It contains a chapter on Character Normalization. When you follow the steps from the guide all special characters will be treated as normal characters. So searching for über will give the same results as when searching for uber.
您还可以安装 MS 支持提供的自定义词典,然后可以提供每种语言的词典。因此,如果您安装德语,那么搜索引擎将通过“您的意思是”功能了解您要搜索的内容。安装词典后,您可以启用搜索查询。另外,不要忘记使用正确的字符编码正确设置搜索架构以支持多语言。如果集合中的文档未使用正确的字符编码进行索引,您在标记化和查询结束时所做的任何努力都是无用的。
Also you can install custom dictionaries available from MS support, then can provide the dictionary on each language. So if you install German, then the search engine will understand what you are trying to search, with the did you mean feature. You can enable into the search queries once you have the dictionary installed. Also don't forget to setup correctly the search schema with the proper character encoding for multi-language support. If the documents in the collection are not indexed with proper character encoding any effort you do at tokenization and query ends is useless.