Lucene 多语言文本字段

发布于 2024-10-24 19:00:51 字数 327 浏览 6 评论 0原文

我看过这个问题 - Indexing multilingual Words in lucene 它证实了我的一些想法怀疑。

我有一个实体,其中有许多我希望索引的字段。其中一个字段可以是多种语言之一,我需要为每种语言使用不同的分析器。

我最好将其实现为同一索引中的不同字段还是每种语言的不同索引?

我猜测,权衡是在运行多个索引的开销和弄乱单个索引的糟糕性之间进行的。

任何建议表示赞赏。

I have looked at this question - Indexing multilingual words in lucene and it confirmed some of my suspicions.

I have an entity with a number of fields I wish to index. One of these fields can be one of several languages, and I need to use different analyzers for each language.

Am I best to implement this as different fields in the same index or as different indexes for each language?

I am guessing that the trade off is between the overhead of running multiple indexes and the suckiness of cluttering up a single index.

Any advice appreciated.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

月亮邮递员 2024-10-31 19:00:51

您没有提到的另一个想法:您可以使每种语言成为非存储、非索引字段。然后,您可以将所有(分析的)数据复制到单个存储+索引字段,它的行为就像您正在搜索单个字段一样。 (这类似于 Solr 的“复制字段” - 我不确定在休眠状态下这样做有多难。)

如果将它们保存在单独的索引中,您应该注意,您将无法轻松地跨语言搜索(或者,可以说,根本没有)。因此,如果您想允许像“english:foo dutch:foo”这样的查询,您需要将它们放在同一索引中。

从性能的角度来看,这取决于共享的数据量。如果文档是不相交的(即没有文档包含两种语言),那么将其放在一个索引中与将其放在两个索引中可能不会有太大差异。它们共享的数据越多,Lucene 复制的内存就越多,因此拥有一个索引会变得更好。我的猜测是,如果您有大量存储数据,这只是一个问题,但是 YMMV。

One additional idea that you didn't mention: you can make each language a non-stored, non-indexed field. Then you can copy all the (analyzed) data to a single stored+indexed field, and it will behave as though you're searching a single field. (This is analogous to Solr's "Copy fields" - I'm not sure how hard it would be to do in hibernate.)

If you keep them in separate indexes, you should note that you won't be able to search across languages easily (or, arguably, at all). So if you want to allow queries like "english:foo dutch:foo", you'll need them in the same index.

From a performance standpoint, it would depend on how much data is shared. If the documents are disjoint (i.e. no document has two languages in it) then there probably won't be that much of a difference between having it in one index vs. two. The more data they share, the more memory Lucene will duplicate, so it will become better to have one index. My guess is that this is only an issue if you have a lot of stored data, but YMMV.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文