Solr 字母转换/元音变音支持

发布于 2024-12-10 05:58:10 字数 730 浏览 3 评论 0原文

我正在使用 Solr 3.x,重点关注德语文本,效果很好。 搜索变音符号 (öäüß) 也很有效。

问题是: 我收到了一些 80 年代末的存档文本,大多数计算机/软件不支持 ASCII 以外的内容,尤其是不支持德语元音变音。 为此,使用了替代符号:

ae instead of ä
oe instead of ö
ue instead of ü
ss instead of ß

这意味着,名称 Müller 被保存为 Mueller

回到 Solr,我现在需要查找包含 ue 的文档 - 即使用户搜索 ü

示例:如果我想搜索来自名为 Müller 的人的所有短信, Solr 必须使用 Mueller 以及 Müller 查找文本,

我该如何处理这个问题?

这是一个足够的功能吗? --> http://wiki.apache.org/solr/UnicodeCollat​​ion (我不确定,如果我完全理解文档)

顺便说一句,不能通过“搜索和替换”来更改源文本:所有 oeö

I'm using Solr 3.x with focus on German text, which works well.
Searching for umlauts (öäüß) also works well.

The problem is:
I received some archived text from the late 80s, were most of the computer/software did not support more than ASCII, especially no German umlauts were supported.
For this an alternative notation was used:

ae instead of ä
oe instead of ö
ue instead of ü
ss instead of ß

That means, the name Müller was saved as Mueller.

Back to Solr, I need now to find documents which contains ue - even if the user searched for ü.

Example: If I like to search for all text messages from the person called Müller,
Solr has to find text with Mueller and also Müller

How can I handle this?

Is this an adequate feature? --> http://wiki.apache.org/solr/UnicodeCollation (I'm not sure, if I understand the documentation completely)

By the way, it's not an option to change the source-text by "search and replace": all oe to ö.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

您的好友蓝忘机已上羡 2024-12-17 05:58:10

正如 Paige Cook 已经指出的那样,您已经找到相关文档 ,但由于并非每个 Solr 用户都了解 Java,我决定创建自己的答案,并提供更多细节。

第一步是将过滤器添加到您的字段定义中:

<fieldType>
  <analyzer>
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <!-- BEGIN OF IMPORTANT PART -->
    <filter class="solr.CollationKeyFilterFactory"
        custom="customRules.dat"
        strength="primary"
    />
    <!-- END OF IMPORTANT PART -->
  </analyzer>
</fieldType>

下一步是创建必要的 customRules.dat 文件:

您必须创建一个小型 Java 程序以便遵循文档。不幸的是,对于非 Java 程序员来说,这有点困难,因为代码片段只显示了重要的部分。它还使用未随 JDK (Apache Commons IO) 分发的第三方库。

这里是编写 customRules.dat 所需的完整 Java 7 代码,无需使用外部库:

import java.io.*;
import java.text.*;
import java.util.*;

public class RulesWriter {
    public static void main(String[] args) throws Exception {
        RuleBasedCollator baseCollator = (RuleBasedCollator) 
                Collator.getInstance(new Locale("de", "DE"));

        String DIN5007_2_tailorings =
          "& ae , a\u0308 & AE , A\u0308"+
          "& oe , o\u0308 & OE , O\u0308"+
          "& ue , u\u0308 & UE , u\u0308";

        RuleBasedCollator tailoredCollator = new RuleBasedCollator(
                baseCollator.getRules() + DIN5007_2_tailorings);
        String tailoredRules = tailoredCollator.getRules();

        Writer fw = new OutputStreamWriter(
                new FileOutputStream("c:/customRules.dat"), "UTF-8");
        fw.write(tailoredRules);
        fw.flush();
        fw.close();
    }
}

免责声明:上面的代码编译并创建了一个 customRules.dat 文件,但我实际上没有使用 Solr 测试创建的文件。

As Paige Cook already pointed out, you already found the relevant documentation, but since not every Solr user knows Java I decided to create my own answer with a little more detail.

The first step is to add the filter to your field definition:

<fieldType>
  <analyzer>
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <!-- BEGIN OF IMPORTANT PART -->
    <filter class="solr.CollationKeyFilterFactory"
        custom="customRules.dat"
        strength="primary"
    />
    <!-- END OF IMPORTANT PART -->
  </analyzer>
</fieldType>

The next step is to create the necessary customRules.dat file:

You have to create a tiny Java program in order to follow the documentation. Unfortunately for non-Java programmers this is a little difficult, since the code snippet only shows the important parts. Also it uses a third-party library not distributed with the JDK (Apache Commons IO)

Heres the full Java 7 code necessary to write a customRules.dat without the use of external libraries:

import java.io.*;
import java.text.*;
import java.util.*;

public class RulesWriter {
    public static void main(String[] args) throws Exception {
        RuleBasedCollator baseCollator = (RuleBasedCollator) 
                Collator.getInstance(new Locale("de", "DE"));

        String DIN5007_2_tailorings =
          "& ae , a\u0308 & AE , A\u0308"+
          "& oe , o\u0308 & OE , O\u0308"+
          "& ue , u\u0308 & UE , u\u0308";

        RuleBasedCollator tailoredCollator = new RuleBasedCollator(
                baseCollator.getRules() + DIN5007_2_tailorings);
        String tailoredRules = tailoredCollator.getRules();

        Writer fw = new OutputStreamWriter(
                new FileOutputStream("c:/customRules.dat"), "UTF-8");
        fw.write(tailoredRules);
        fw.flush();
        fw.close();
    }
}

Disclaimer: The above code compiles and creates a customRules.dat file, but I didn't actually test the created file with Solr.

烧了回忆取暖 2024-12-17 05:58:10

根据我对您提供的 Unicode 排序功能链接的解释,这是绝对的功能,因为它展示了如何解决您遇到的完全相同的问题

看起来您将编写一些 Java 来生成适当的 customRules.dat 文件。

From my interpretation of the link you provided to the Unicode Collation feature, this is the absolute feature as it shows how to solve the exact same issue that you are having.

Looks like you will to write a little Java to generate your appropriate customRules.dat file.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文