Solr:多语言索引和 DIH多值字段?

发布于 2024-10-01 19:01:40 字数 904 浏览 9 评论 0原文

我有一个 MySQL 表:

CREATE TABLE documents (
    id INT NOT NULL AUTO_INCREMENT,
    language_code CHAR(2),
    tags CHAR(30),
    text TEXT,
    PRIMARY KEY (id)
);

我有 2 个关于 Solr DIH 的问题:

1)langauge_code 字段指示 text 字段使用的语言。根据语言,我想要将 text 索引到不同的 Solr 字段。

# pseudo code

if langauge_code == "en":
    index "text" to Solr field "text_en"
elif langauge_code == "fr":
    index "text" to Solr field "text_fr"
elif langauge_code == "zh":
    index "text" to Solr field "text_zh"
...

DIH 可以处理这样的用例吗?我该如何配置它才能做到这一点?

2) tags 字段需要索引到 Solr multiValued 字段中。多个值存储在一个字符串中,并用逗号分隔。例如,如果 tags 包含字符串 "blue, green, Yellow" 那么我想索引 3 个值 "blue", “绿色”“黄色” 到 Solr 多值字段中。

我该如何使用 DIH 做到这一点?

谢谢。

I have a MySQL table:

CREATE TABLE documents (
    id INT NOT NULL AUTO_INCREMENT,
    language_code CHAR(2),
    tags CHAR(30),
    text TEXT,
    PRIMARY KEY (id)
);

I have 2 questions about Solr DIH:

1) The langauge_code field indicates what language the text field is in. And depending on the language, I want to index text to different Solr fields.

# pseudo code

if langauge_code == "en":
    index "text" to Solr field "text_en"
elif langauge_code == "fr":
    index "text" to Solr field "text_fr"
elif langauge_code == "zh":
    index "text" to Solr field "text_zh"
...

Can DIH handle a usecase like this? How do I configure it to do so?

2) The tags field needs to be indexed into a Solr multiValued field. Multiple values are stored in a string, separated by a comma. For example, if tags contains the string "blue, green, yellow" then I want to index the 3 values "blue", "green", "yellow" into a Solr multiValued field.

How do I do that with DIH?

Thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

愁杀 2024-10-08 19:01:40

首先,您的架构需要允许使用如下内容:

<dynamicField name="text_*" type="string" indexed="true" stored="true" />

然后在您的 DIH 配置中,如下所示:

<entity name="document" dataSource="ds1" transformer="script:ftextLang" query="SELECT * FROM documents" />

在数据源下方定义脚本:

<script><![CDATA[
  function ftextLang(row){
     var name = row.get('language_code');
     var value = row.get('text');
     row.put('text_'+name, value); return row;
  }
]]></script>

First your schema needs to allow it with something like this:

<dynamicField name="text_*" type="string" indexed="true" stored="true" />

Then in your DIH config something like this:

<entity name="document" dataSource="ds1" transformer="script:ftextLang" query="SELECT * FROM documents" />

With the script being defined just below the datasource:

<script><![CDATA[
  function ftextLang(row){
     var name = row.get('language_code');
     var value = row.get('text');
     row.put('text_'+name, value); return row;
  }
]]></script>
情话已封尘 2024-10-08 19:01:40

很抱歉,我无法直接回答您的 DIH 问题,不过了解一下会很有趣。

我确实注意到您的 2 字母语言代码并建议使用 5 字母插槽。有些语言存在着不小的方言差异。例如,简体中文与繁体中文。对于词法分析,SmartCN 过滤器可以处理 zh-cn,但不能处理 zh-tw 等。

葡萄牙语和西班牙语也是我们被警告不要将所有方言混合在一起的语言,尽管差异不那么剧烈,而且两者仍然会可搜索。

当然,您可能已经知道这一点,只是为了简单起见没有将其添加到问题中。这对我来说是一个非常新鲜的话题。

I'm sorry I don't have a direct answer about your DIH question, though it'd be interesting to know.

I did notice your 2 letter language code and suggest a 5 letter slot. Some languages have dialect differences that are non trivial. For example, Simplified Chinese vs. Traditional Chinese. For morphological analysis, the SmartCN filter can handle zh-cn, but not zh-tw, etc.

Portuguese and Spanish are also languages where we've been warned against mixing all dialects together, although the differences are less drastic, and both would still be searchable.

Of course you may have already known this, and just didn't add it to the question to keep it simple. It's just a subject very fresh on my mind.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文