使用 lucene 进行多语言搜索
我正在进行多语言搜索。我将使用 lucene 作为工具来完成它。
我已经有翻译的内容了,每个文件都会有3到4种语言。
对于索引和搜索,可以有 4 种策略,对于每个文档/内容:
- 每种语言都在不同的索引/目录中建立索引。
- 每种语言都在不同的文档中索引,但在相同的索引中。
- 每种语言都在不同的字段中索引,但在同一文档中。
- 所有语言都在文档中的同一字段中建立索引
,但是我还没有测试每种方式,有经验的人可以告诉我哪一种是进行多语言搜索的更好方法吗?
谢谢!
I am doing a multilingual search. And I will use lucene as the tool to do it.
I have the translated contents already, there will be 3 or 4 languages of each document.
For indexing and search, there could be the 4 strategies, For each document/contents:
- each language are indexed in different index/directory.
- each language are indexed in different document but in the same index.
- each language are indexed in different Field but in the same document.
- all the languages are indexed in the same Field in a document
But I have not test each of the way yet, could anyone experienced tell me which one is a better way to do the multilingual search?
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
尽管这个问题已经在几年前提出,但这仍然是一个很好的问题。
评估不同的解决方案方法需要考虑几个方面:
如果 (1.) & (5.) 在您的项目中有效,您不应考虑在同一倒排索引中(重新)对多种语言使用相同字段的任何策略,因为各种语言的术语频率都混合在一起(与您是否索引您的多语言内容作为一份或多份文档)。有趣的是,添加“n”种语言特定字段不会导致索引变大“n”倍,但由于显而易见的原因,它会带来一些开销。
单字段(策略 2 和 4)
多字段(策略 3)
多个索引(策略 1)
独立于单个或多字段方法,如果您将内容索引为多个文档,您的解决方案可能需要处理“错误”语言匹配的结果折叠。一种方法可能是添加语言字段和过滤器。
建议:您选择的方法/策略取决于项目要求。只要有可能,我都会选择多字段或多索引方法。
Although the question has been asked a couple of years ago, it's still a great question.
There are a couple of aspects to consider evaluating the different solution approaches:
If (1.) & (5.) are valid in your project you should not consider any strategy that (re-)uses the same field for multiple languages in the same inverted index, as term frequencies for the various languages are all mixed up (independent of whether you index your multilingual content as one document or as multiple documents). It might be interesting to know, that adding "n" language specific fields does not result in an "n"-times larger index, but for obvious reasons it comes with some overhead.
Single Field (Strategies 2 & 4)
Multiple Fields (Strategy 3)
Multiple Indices (Strategy 1)
Independent of a single or multiple fields approach, your solution might need to handle result collapsing for matches in the "wrong" language, if you index your content as multiple documents. One approach might could be by adding a language field and filter for that.
Recommendation: The approach/strategy you choose, depends on a projects requirements. Whenever possible I would opt for a multiple fields or multiple indices approach.
简而言之,这取决于您的需求,但我会选择选项 3 或 1。
1) 可能是最好的方法,如果语言之间根本没有重叠/共享字段。
3)如果有多个字段需要跨语言共享,那么这将是一种可行的方法,因为这可以节省磁盘空间并允许较大部分的索引适合文件系统缓存,
我不推荐2):这使得您的搜索查询更加复杂,并迫使 lucene 考虑更多文档。
4) 将使您的搜索查询变得非常复杂,除非您希望用户能够在不先选择的情况下以任何语言进行搜索。
In short, it depends on your needs, but I would go with option 3 or 1.
1) would probably the best way, if there is no overlap / shared fields between the languages at all.
3) would be the way to go if there are several fields that need to be shared across languages, as this saves disk space and allows a larger part of the index to fit in the file system cache
I would not recommend 2): this makes your search queries more complex and forces lucene to consider more documents.
4) will make your search query very complex, unless you want users to be able to search in any language without selecting it first.