Solr、特殊字符和拉丁文到西里尔文字符转换
我正在尝试使用 Solr (或 Lucene)设置一个搜索引擎,它可以包含带有特殊字符的拉丁文文本(特殊字符包括 Ö 或 Ç 作为示例)或西里尔字符(示例包括 Б 或 б 和 Ж ж) 。
无论如何,我正在尝试找到一个解决方案,让我能够搜索包含这些字符的单词,但是对于键盘上没有该键的用户......
示例是(在这里组成单词,希望不会冒犯任何人):
- 搜索“book”时会找到“BÖÖK”
- 搜索 XRAY 时会找到
- “ЖRAY” 如果搜索 ZRAY、ZHRAY 或 žray 也会找到“ЖRAY”(请参阅GOST 16876-71 了解有关 Cylric 到拉丁 Char 音译的信息
那么,我应该如何处理 。我的一些理论是:
- 允许为每个原始字符串存储多个文本字段,一个是原始形式,一个是音译的第一遍(例如,将 Ö 转换为 O 和Ж 到 ž,还有 X),然后是第三种形式(从 ž 到 z 或 zh)-> 意味着我将存储大量数据...
- 按原样存储在 solr 中,并让 Solr 执行magic -> 不知道这会发挥多大作用...在 solr 中看不到任何东西来执行这个
- 我还没有找到的神奇子弹...
有什么想法吗?以前有人尝试过这个吗?
I am trying to setup a search engine using Solr (or Lucene) which could have text in both Latin with special chars, (special chars would include Ö or Ç as an example) or Cyrilic chars (examples include Б or б and Ж ж).
Anyway, I am trying to find a solution to allow me to search for words with these charicters in them, but for users who do not have the key on their keyboard...
Example would be (making up words here, hopefully won't offend anyone):
- "BÖÖK" would be found when searching for "book"
- "ЖRAY" would be found when searching for XRAY
- "ЖRAY" would also be found if searching for ZRAY, ZHRAY, or žray (see GOST 16876-71 for info on Transliteration of Cylric to Latin Char.
So, how should I go about this? Some theories I have are:
- allow multiple text fields to be stored for each original string, one in original form, one in the first pass of transliteration (which, for example, would convert Ö to just O and Ж to ž, but also X) and then one in the third form (from the ž to z or zh) -> means I will be storing a LOT of data...
- store in solr as is, and let Solr do the magic -> don't know how well this will work... can't see anything in solr to do this
- Magic bullet I have not found yet...
Any ideas? Anyone tried this before?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
查看 Solr 的分析器、分词器和分词过滤器,它可以很好地介绍该类型您正在寻找的操纵。
Take a look at Solr's Analyzers, Tokenizers, and Token Filters which give you a good intro to the type of manipulation you're looking for.
您需要在索引和查询文本分析中使用重音过滤器,这会将外来字符转换为其英文版本。
您可以使用 ISOLatin1AccentFilterFactory 或 ASCIIFoldingFilterFactory,具体取决于您使用的 Solr 版本。
例如
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ISOLatin1AccentFilterFactory
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ASCIIFoldingFilterFactory
所以 -
“BÖÖK”将在 Solr 中转换并索引为“book”。
这将使用户能够搜索书籍和 BÖÖK,并且仍然可以取回文档。
You need to use the accent filter in your index and query text analysis, which would convert foreign characters to their english version
You can use ISOLatin1AccentFilterFactory or ASCIIFoldingFilterFactory depending upon the Solr version you are using.
e.g.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ISOLatin1AccentFilterFactory
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ASCIIFoldingFilterFactory
So -
"BÖÖK" would be converted and indexed as "book" in Solr.
This would enable the users to search for both, book and BÖÖK and still get back the document.