Solr 字母转换/元音变音支持

发布于 2024-12-10 05:58:10 字数 730 浏览 4 评论 0原文

我正在使用 Solr 3.x，重点关注德语文本，效果很好。搜索变音符号 (öäüß) 也很有效。

问题是：我收到了一些 80 年代末的存档文本，大多数计算机/软件不支持 ASCII 以外的内容，尤其是不支持德语元音变音。为此，使用了替代符号：

ae instead of ä
oe instead of ö
ue instead of ü
ss instead of ß

这意味着，名称 Müller 被保存为 Mueller。

回到 Solr，我现在需要查找包含 ue 的文档 - 即使用户搜索 ü。

示例：如果我想搜索来自名为 Müller 的人的所有短信， Solr 必须使用 Mueller 以及 Müller 查找文本，

我该如何处理这个问题？

这是一个足够的功能吗？ --> http://wiki.apache.org/solr/UnicodeCollation （我不确定，如果我完全理解文档）

顺便说一句，不能通过“搜索和替换”来更改源文本：所有 oe 为 ö。

原文

I'm using Solr 3.x with focus on German text, which works well.
Searching for umlauts (öäüß) also works well.

The problem is:
I received some archived text from the late 80s, were most of the computer/software did not support more than ASCII, especially no German umlauts were supported.
For this an alternative notation was used:

ae instead of ä
oe instead of ö
ue instead of ü
ss instead of ß

That means, the name Müller was saved as Mueller.

Back to Solr, I need now to find documents which contains ue - even if the user searched for ü.

Example: If I like to search for all text messages from the person called Müller,
Solr has to find text with Mueller and also Müller

How can I handle this?

Is this an adequate feature? --> http://wiki.apache.org/solr/UnicodeCollation (I'm not sure, if I understand the documentation completely)

By the way, it's not an option to change the source-text by "search and replace": all oe to ö.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

您的好友蓝忘机已上羡 2024-12-17 05:58:10

正如 Paige Cook 已经指出的那样，您已经找到相关文档 ，但由于并非每个 Solr 用户都了解 Java，我决定创建自己的答案，并提供更多细节。

第一步是将过滤器添加到您的字段定义中：

<fieldType>
  <analyzer>
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <!-- BEGIN OF IMPORTANT PART -->
    <filter class="solr.CollationKeyFilterFactory"
        custom="customRules.dat"
        strength="primary"
    />
    <!-- END OF IMPORTANT PART -->
  </analyzer>
</fieldType>

下一步是创建必要的 customRules.dat 文件：

您必须创建一个小型 Java 程序以便遵循文档。不幸的是，对于非 Java 程序员来说，这有点困难，因为代码片段只显示了重要的部分。它还使用未随 JDK (Apache Commons IO) 分发的第三方库。

这里是编写 customRules.dat 所需的完整 Java 7 代码，无需使用外部库：

import java.io.*;
import java.text.*;
import java.util.*;

public class RulesWriter {
    public static void main(String[] args) throws Exception {
        RuleBasedCollator baseCollator = (RuleBasedCollator) 
                Collator.getInstance(new Locale("de", "DE"));

        String DIN5007_2_tailorings =
          "& ae , a\u0308 & AE , A\u0308"+
          "& oe , o\u0308 & OE , O\u0308"+
          "& ue , u\u0308 & UE , u\u0308";

        RuleBasedCollator tailoredCollator = new RuleBasedCollator(
                baseCollator.getRules() + DIN5007_2_tailorings);
        String tailoredRules = tailoredCollator.getRules();

        Writer fw = new OutputStreamWriter(
                new FileOutputStream("c:/customRules.dat"), "UTF-8");
        fw.write(tailoredRules);
        fw.flush();
        fw.close();
    }
}

免责声明：上面的代码编译并创建了一个 customRules.dat 文件，但我实际上没有使用 Solr 测试创建的文件。

As Paige Cook already pointed out, you already found the relevant documentation, but since not every Solr user knows Java I decided to create my own answer with a little more detail.

The first step is to add the filter to your field definition:

<fieldType>
  <analyzer>
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <!-- BEGIN OF IMPORTANT PART -->
    <filter class="solr.CollationKeyFilterFactory"
        custom="customRules.dat"
        strength="primary"
    />
    <!-- END OF IMPORTANT PART -->
  </analyzer>
</fieldType>

The next step is to create the necessary customRules.dat file:

You have to create a tiny Java program in order to follow the documentation. Unfortunately for non-Java programmers this is a little difficult, since the code snippet only shows the important parts. Also it uses a third-party library not distributed with the JDK (Apache Commons IO)

Heres the full Java 7 code necessary to write a customRules.dat without the use of external libraries:

import java.io.*;
import java.text.*;
import java.util.*;

public class RulesWriter {
    public static void main(String[] args) throws Exception {
        RuleBasedCollator baseCollator = (RuleBasedCollator) 
                Collator.getInstance(new Locale("de", "DE"));

        String DIN5007_2_tailorings =
          "& ae , a\u0308 & AE , A\u0308"+
          "& oe , o\u0308 & OE , O\u0308"+
          "& ue , u\u0308 & UE , u\u0308";

        RuleBasedCollator tailoredCollator = new RuleBasedCollator(
                baseCollator.getRules() + DIN5007_2_tailorings);
        String tailoredRules = tailoredCollator.getRules();

        Writer fw = new OutputStreamWriter(
                new FileOutputStream("c:/customRules.dat"), "UTF-8");
        fw.write(tailoredRules);
        fw.flush();
        fw.close();
    }
}

Disclaimer: The above code compiles and creates a customRules.dat file, but I didn't actually test the created file with Solr.

回复收藏 0 原文