使用 lucene 进行多语言搜索

发布于 2024-12-02 12:04:09 字数 311 浏览 0 评论 0原文

我正在进行多语言搜索。我将使用 lucene 作为工具来完成它。

我已经有翻译的内容了,每个文件都会有3到4种语言。

对于索引和搜索,可以有 4 种策略,对于每个文档/内容:

  1. 每种语言都在不同的索引/目录中建立索引。
  2. 每种语言都在不同的文档中索引,但在相同的索引中。
  3. 每种语言都在不同的字段中索引,但在同一文档中。
  4. 所有语言都在文档中的同一字段中建立索引

,但是我还没有测试每种方式,有经验的人可以告诉我哪一种是进行多语言搜索的更好方法吗?

谢谢!

I am doing a multilingual search. And I will use lucene as the tool to do it.

I have the translated contents already, there will be 3 or 4 languages of each document.

For indexing and search, there could be the 4 strategies, For each document/contents:

  1. each language are indexed in different index/directory.
  2. each language are indexed in different document but in the same index.
  3. each language are indexed in different Field but in the same document.
  4. all the languages are indexed in the same Field in a document

But I have not test each of the way yet, could anyone experienced tell me which one is a better way to do the multilingual search?

Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

戴着白色围巾的女孩 2024-12-09 12:04:09

尽管这个问题已经在几年前提出,但这仍然是一个很好的问题。

评估不同的解决方案方法需要考虑几个方面:

  1. 索引时是否使用特定于语言的分析器?
  2. 查询语言是否始终已知(例如用户可选择)?
  3. 查询语言是否始终与“内容”语言之一匹配?
  4. 是否应该只返回与查询语言匹配的内容?
  5. 相关性重要吗?

如果 (1.) & (5.) 在您的项目中有效,您不应考虑在同一倒排索引中(重新)对多种语言使用相同字段的任何策略,因为各种语言的术语频率都混合在一起(与您是否索引您的多语言内容作为一份或多份文档)。有趣的是,添加“n”种语言特定字段不会导致索引变大“n”倍,但由于显而易见的原因,它会带来一些开销。

单字段(策略 2 和 4)


+ only one field to query
+ scales well for additional languages
+ can distinguish/filter languages (if multiple documents, and extra language field)
- cannot distinguish/filter languages (if single document)
- cannot just display the queried language (if single document)
- "wrong" term frequencies (as all languages mixed up)

多字段(策略 3)


+ correct term frequencies
+ can easily restrict/filter queries for particular language(s)
+ facilitates Auto-Complete & Spellcheck / Did-You-Mean
- more fields to index
- more fields to query

多个索引(策略 1)


+ correct term frequencies
+ can easily restrict/filter queries for particular language(s)
+ facilitates Auto-Complete & Spellcheck / Did-You-Mean
- additional languages requires all their own index

独立于单个或多字段方法,如果您将内容索引为多个文档,您的解决方案可能需要处理“错误”语言匹配的结果折叠。一种方法可能是添加语言字段和过滤器。

建议:您选择的方法/策略取决于项目要求。只要有可能,我都会选择多字段或多索引方法。

Although the question has been asked a couple of years ago, it's still a great question.

There are a couple of aspects to consider evaluating the different solution approaches:

  1. are language specific analyzers used at indexing time?
  2. is the query language always known (e.g. user selectable)?
  3. does the query language always match one of the "content" languages?
  4. should only content matching the query language be retuned?
  5. is relevancy important?

If (1.) & (5.) are valid in your project you should not consider any strategy that (re-)uses the same field for multiple languages in the same inverted index, as term frequencies for the various languages are all mixed up (independent of whether you index your multilingual content as one document or as multiple documents). It might be interesting to know, that adding "n" language specific fields does not result in an "n"-times larger index, but for obvious reasons it comes with some overhead.

Single Field (Strategies 2 & 4)


+ only one field to query
+ scales well for additional languages
+ can distinguish/filter languages (if multiple documents, and extra language field)
- cannot distinguish/filter languages (if single document)
- cannot just display the queried language (if single document)
- "wrong" term frequencies (as all languages mixed up)

Multiple Fields (Strategy 3)


+ correct term frequencies
+ can easily restrict/filter queries for particular language(s)
+ facilitates Auto-Complete & Spellcheck / Did-You-Mean
- more fields to index
- more fields to query

Multiple Indices (Strategy 1)


+ correct term frequencies
+ can easily restrict/filter queries for particular language(s)
+ facilitates Auto-Complete & Spellcheck / Did-You-Mean
- additional languages requires all their own index

Independent of a single or multiple fields approach, your solution might need to handle result collapsing for matches in the "wrong" language, if you index your content as multiple documents. One approach might could be by adding a language field and filter for that.

Recommendation: The approach/strategy you choose, depends on a projects requirements. Whenever possible I would opt for a multiple fields or multiple indices approach.

最后的乘客 2024-12-09 12:04:09

简而言之,这取决于您的需求,但我会选择选项 3 或 1。

1) 可能是最好的方法,如果语言之间根本没有重叠/共享字段。

3)如果有多个字段需要跨语言共享,那么这将是一种可行的方法,因为这可以节省磁盘空间并允许较大部分的索引适合文件系统缓存,

我不推荐2):这使得您的搜索查询更加复杂,并迫使 lucene 考虑更多文档。

4) 将使您的搜索查询变得非常复杂,除非您希望用户能够在不先选择的情况下以任何语言进行搜索。

In short, it depends on your needs, but I would go with option 3 or 1.

1) would probably the best way, if there is no overlap / shared fields between the languages at all.

3) would be the way to go if there are several fields that need to be shared across languages, as this saves disk space and allows a larger part of the index to fit in the file system cache

I would not recommend 2): this makes your search queries more complex and forces lucene to consider more documents.

4) will make your search query very complex, unless you want users to be able to search in any language without selecting it first.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文