用于非规范化混合语言文档的 Solr 语言检测更新处理器

发布于 2025-01-06 07:28:50 字数 1000 浏览 1 评论 0原文

我有一个事物数据库，每个事物都可以有多个不同语言的名称。目前，这已规范化为一个事物有很多名称的模式：

things
------
id
...

names
-----
id
thing_id
language
name

我正在使用 Solr 对此进行索引，并试图找出将其非规范化为 Lucene 模式的最佳方法。这个工作正常：

<fields>
    <field name="id" type="uuid" indexed="true" stored="true" required="true" />
    ...
    <field name="name_eng" type="text_eng" indexed="true" stored="true" />
    <field name="name_jpn" type="text_cjk" indexed="true" stored="true" />
    <field name="name_kor" type="text_cjk" indexed="true" stored="true" />
</fields>

问题是我需要为每种支持的语言单独指定一个字段和字段类型，而且可能有很多。由于我还使用 SQL DataImportHandler，这意味着我必须复制大量代码来指定 SQL 查询，以将这些查询从数据库导入到此架构中。此外，名称的语言字段并不总是正确的，因为它基于用户输入。

我正在研究 Solr 提供的语言检测功能，它看起来非常好。但它们似乎只对整个文档起作用，我想在这种情况下这不会有太大帮助。有没有办法在架构中指定一个 multiValued 字段，我可以在其中存储名称，其语言将被自动检测并相应索引？或者语言检测设施可以通过其他方式让我在这里的生活变得更轻松？

原文

I have a database of things, with each thing being able to have several names in different languages. This is currently normalized to a thing has-many names schema:

things
------
id
...

names
-----
id
thing_id
language
name

I am indexing this using Solr and am trying to figure out the best way to denormalize this into a Lucene schema. This one works okay:

<fields>
    <field name="id" type="uuid" indexed="true" stored="true" required="true" />
    ...
    <field name="name_eng" type="text_eng" indexed="true" stored="true" />
    <field name="name_jpn" type="text_cjk" indexed="true" stored="true" />
    <field name="name_kor" type="text_cjk" indexed="true" stored="true" />
</fields>

The problem is that I need to specify a field and field type for each supported language individually, and there may be a lot. Since I also use the SQL DataImportHandler, it means I have to duplicate a lot of code to specify SQL queries to import these from the database into this schema. Further, the language field of the names is not always correct since it's based on user input.

I was looking at the language detection capabilities Solr offers, which look very good. But they only seem to work on documents as a whole, which in this case won't help a lot I guess. Is there a way to specify a single multiValued field in the schema in which I can store names, whose language will be automatically detected and indexed accordingly? Or other ways in which the language detection facilities could make my life easier here?

分享到QQ

分享到微博