用于非规范化混合语言文档的 Solr 语言检测更新处理器
我有一个事物数据库,每个事物都可以有多个不同语言的名称。目前,这已规范化为一个事物有很多名称的模式:
things
------
id
...
names
-----
id
thing_id
language
name
我正在使用 Solr 对此进行索引,并试图找出将其非规范化为 Lucene 模式的最佳方法。这个工作正常:
<fields>
<field name="id" type="uuid" indexed="true" stored="true" required="true" />
...
<field name="name_eng" type="text_eng" indexed="true" stored="true" />
<field name="name_jpn" type="text_cjk" indexed="true" stored="true" />
<field name="name_kor" type="text_cjk" indexed="true" stored="true" />
</fields>
问题是我需要为每种支持的语言单独指定一个字段和字段类型,而且可能有很多。由于我还使用 SQL DataImportHandler,这意味着我必须复制大量代码来指定 SQL 查询,以将这些查询从数据库导入到此架构中。此外,名称的语言
字段并不总是正确的,因为它基于用户输入。
我正在研究 Solr 提供的语言检测功能,它看起来非常好。但它们似乎只对整个文档起作用,我想在这种情况下这不会有太大帮助。有没有办法在架构中指定一个 multiValued
字段,我可以在其中存储名称,其语言将被自动检测并相应索引?或者语言检测设施可以通过其他方式让我在这里的生活变得更轻松?
I have a database of things, with each thing being able to have several names in different languages. This is currently normalized to a thing has-many names schema:
things
------
id
...
names
-----
id
thing_id
language
name
I am indexing this using Solr and am trying to figure out the best way to denormalize this into a Lucene schema. This one works okay:
<fields>
<field name="id" type="uuid" indexed="true" stored="true" required="true" />
...
<field name="name_eng" type="text_eng" indexed="true" stored="true" />
<field name="name_jpn" type="text_cjk" indexed="true" stored="true" />
<field name="name_kor" type="text_cjk" indexed="true" stored="true" />
</fields>
The problem is that I need to specify a field and field type for each supported language individually, and there may be a lot. Since I also use the SQL DataImportHandler, it means I have to duplicate a lot of code to specify SQL queries to import these from the database into this schema. Further, the language
field of the names is not always correct since it's based on user input.
I was looking at the language detection capabilities Solr offers, which look very good. But they only seem to work on documents as a whole, which in this case won't help a lot I guess. Is there a way to specify a single multiValued
field in the schema in which I can store names, whose language will be automatically detected and indexed accordingly? Or other ways in which the language detection facilities could make my life easier here?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可能可以编写一个转换器来在索引端执行此操作,但查询端不会获得相同的分析链,因此这是行不通的。
这些“东西”的文字是什么样的?
如果少于 200 个字符左右,语言 ID 将无法正常工作。将其视为采用统计方法的“语言猜测”。对于少量数据,猜测是不好的。 “mobile”是英语还是丹麦语?两者都是,真的。 “Die”是英语和德语等等。对于一个好的猜测,一千个字符会很有帮助。
文本是否有商标名称? “LaserJet”和“Linux”在所有语言中都是相同的,并且很少变化,因此语言处理不会做任何事情。也许你不需要特定于语言的词干分析就可以了。
最后,您可以考虑使用 n 元语法而不是语言处理。它是与语言敏感匹配完全不同的模型,但它可能更适合这种情况。从某种意义上说,它正在执行与语言 ID 相同类型的统计模式匹配,但在查询时而不是在索引时。它将从查询中获取简短的模式序列,并在文本中查找这些模式。虽然需要更多的时间和空间,但值得一试。
You could probably write a transformer that would do that on the index side, but the query side would not get the same analysis chain, so that wouldn't work.
What does the text for these "things" look like?
If it is less than about 200 characters, language ID will not work very well. Think of it as "language guessing", with a statistical approach. With small amounts of data, guesses are bad. Is "mobile" English or Danish? Both, really. "Die" is English and German, and so on. For a good guess, a thousand characters would be helpful.
Does the text have trademarked names? "LaserJet" and "Linux" are the same in all languages and rarely inflected, so linguistic processing just doesn't do anything. Maybe you can get by without language-specific stemming.
Finally, you might consider n-grams instead of linguistic processing. It is a completely different model from language-sensitive matching, but it might work better for this. In a sense, it is doing the same sort of statistical pattern matching as language ID, but at query time instead of at index time. It will take short sequences of patterns from the query and look for those in the text. It takes more time and space, but it is worth a try.