使用 lucene 进行多语言搜索

发布于 2024-12-02 12:04:09 字数 311 浏览 8 评论 0原文

我正在进行多语言搜索。我将使用 lucene 作为工具来完成它。

我已经有翻译的内容了，每个文件都会有3到4种语言。

对于索引和搜索，可以有 4 种策略，对于每个文档/内容：

每种语言都在不同的索引/目录中建立索引。
每种语言都在不同的文档中索引，但在相同的索引中。
每种语言都在不同的字段中索引，但在同一文档中。
所有语言都在文档中的同一字段中建立索引

，但是我还没有测试每种方式，有经验的人可以告诉我哪一种是进行多语言搜索的更好方法吗？

谢谢！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

戴着白色围巾的女孩 2024-12-09 12:04:09

尽管这个问题已经在几年前提出，但这仍然是一个很好的问题。

评估不同的解决方案方法需要考虑几个方面：

索引时是否使用特定于语言的分析器？
查询语言是否始终已知（例如用户可选择）？
查询语言是否始终与“内容”语言之一匹配？
是否应该只返回与查询语言匹配的内容？
相关性重要吗？

如果 (1.) & (5.) 在您的项目中有效，您不应考虑在同一倒排索引中（重新）对多种语言使用相同字段的任何策略，因为各种语言的术语频率都混合在一起（与您是否索引您的多语言内容作为一份或多份文档）。有趣的是，添加“n”种语言特定字段不会导致索引变大“n”倍，但由于显而易见的原因，它会带来一些开销。

单字段（策略 2 和 4）

+ only one field to query
+ scales well for additional languages
+ can distinguish/filter languages (if multiple documents, and extra language field)
- cannot distinguish/filter languages (if single document)
- cannot just display the queried language (if single document)
- "wrong" term frequencies (as all languages mixed up)

多字段（策略 3）

+ correct term frequencies
+ can easily restrict/filter queries for particular language(s)
+ facilitates Auto-Complete & Spellcheck / Did-You-Mean
- more fields to index
- more fields to query

多个索引（策略 1）

+ correct term frequencies
+ can easily restrict/filter queries for particular language(s)
+ facilitates Auto-Complete & Spellcheck / Did-You-Mean
- additional languages requires all their own index

独立于单个或多字段方法，如果您将内容索引为多个文档，您的解决方案可能需要处理“错误”语言匹配的结果折叠。一种方法可能是添加语言字段和过滤器。

建议：您选择的方法/策略取决于项目要求。只要有可能，我都会选择多字段或多索引方法。

Although the question has been asked a couple of years ago, it's still a great question.

There are a couple of aspects to consider evaluating the different solution approaches:

are language specific analyzers used at indexing time?
is the query language always known (e.g. user selectable)?
does the query language always match one of the "content" languages?
should only content matching the query language be retuned?
is relevancy important?

If (1.) & (5.) are valid in your project you should not consider any strategy that (re-)uses the same field for multiple languages in the same inverted index, as term frequencies for the various languages are all mixed up (independent of whether you index your multilingual content as one document or as multiple documents). It might be interesting to know, that adding "n" language specific fields does not result in an "n"-times larger index, but for obvious reasons it comes with some overhead.

Single Field (Strategies 2 & 4)

+ only one field to query
+ scales well for additional languages
+ can distinguish/filter languages (if multiple documents, and extra language field)
- cannot distinguish/filter languages (if single document)
- cannot just display the queried language (if single document)
- "wrong" term frequencies (as all languages mixed up)

Multiple Fields (Strategy 3)

+ correct term frequencies
+ can easily restrict/filter queries for particular language(s)
+ facilitates Auto-Complete & Spellcheck / Did-You-Mean
- more fields to index
- more fields to query

Multiple Indices (Strategy 1)

+ correct term frequencies
+ can easily restrict/filter queries for particular language(s)
+ facilitates Auto-Complete & Spellcheck / Did-You-Mean
- additional languages requires all their own index

Independent of a single or multiple fields approach, your solution might need to handle result collapsing for matches in the "wrong" language, if you index your content as multiple documents. One approach might could be by adding a language field and filter for that.

Recommendation: The approach/strategy you choose, depends on a projects requirements. Whenever possible I would opt for a multiple fields or multiple indices approach.

回复收藏 0 原文