在哪里可以找到一组用于字符串相等性比较的特定排序规则？

发布于 2024-12-20 03:36:13 字数 1512 浏览 5 评论 0原文

我们都知道使用String的equals()方法进行相等比较会惨败。相反，应该使用 Collator，如下所示：

// we need to detect User Interface locale somehow
Locale uiLocale = Locale.forLanguageTag("da-DK");
// Setting up collator object
Collator collator = Collator.getInstance(uiLocale);
collator.setStrength(Collator.SECONDARY);
collator.setDecomposition(Collator.CANONICAL_DECOMPOSITION);
// strings for equality testing
String test1 = "USA lover Grækenland støtte";
String test2 = "USA lover graekenland støtte";
boolean result = collator.equals(test1, test2);

现在，这段代码有效，即结果为 true除非 uiLocale 设置为丹麦语。在这种情况下，它将产生 false。我当然明白为什么发生这种情况：这只是因为 equals 方法是这样实现的：

return compare(s1, s2) == Collator.Equal;

该方法调用用于排序的方法并检查字符串是否相同。它们不是，因为丹麦特定的排序规则要求将 æ 排序在（如果我正确理解比较方法的结果）ae 之后。然而，这些字符串实际上是相同的，凭借这种优势，大小写差异和此类兼容性字符（这就是它的名称）应该被视为相同。

要解决此问题，可以使用 RuleBasedCollator 以及一组特定的规则将适用于平等情况。
最后的问题是：有谁知道我可以在哪里获得这样的特定规则（不仅适用于丹麦语，也适用于其他语言），以便将兼容性字符、连字等视为平等（CLDR 图表似乎不包含这样的内容或者我失败了正在寻找它）？

或者也许我想在这里做一些愚蠢的事情，我真的应该简单地使用 UCA 进行相等比较（请提供任何代码示例）？

原文

We all know that using String's equals() method for equality comparison will fail miserably. Instead, one should use Collator, like this:

// we need to detect User Interface locale somehow
Locale uiLocale = Locale.forLanguageTag("da-DK");
// Setting up collator object
Collator collator = Collator.getInstance(uiLocale);
collator.setStrength(Collator.SECONDARY);
collator.setDecomposition(Collator.CANONICAL_DECOMPOSITION);
// strings for equality testing
String test1 = "USA lover Grækenland støtte";
String test2 = "USA lover graekenland støtte";
boolean result = collator.equals(test1, test2);

Now, this code works, that is result is true unless uiLocale is set to Danish. In such case it will yield false. I certainly understand why this happened: this is just because the method equals is implemented like this:

return compare(s1, s2) == Collator.Equal;

This method calls the one that is used for sorting and check if strings are the same. They are not, because Danish specific collation rules requires that æ to be sorted after (if I understand the result of compare method correctly) ae. However, these strings are really the same, with this strength both case differences and such compatibility characters (that's what its called) should be treated as equal.

To fix this, one would use RuleBasedCollator with specific set of rules that will work for the equality case.
Finally the question is: does anyone know where I can get such specific rules (not only for Danish, but for other languages as well), so that compatibility characters, ligatures, etc. be treated as equal (CLDR chart does not seem to contain such or I failed searching for it)?

Or maybe I want to do something stupid here, and I should really use simply UCA for equality comparison (any code sample, please)?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

灯角 2024-12-27 03:36:13

我找不到任何现有的丹麦语整理器；丹麦语言环境的内置版本应该是正确的。我不确定您关于 ae 应使用 æ 排序的假设是否成立，特别是由于某些外来词（例如 "aerofobi") 丹麦语（我不是说丹麦语的人，虽然我说瑞典语）。

但是，如果您想将它们排序在一起，似乎有两种方法可以做到这一点，具体取决于您所处的上下文。在某些上下文中，仅替换字符可能是合适的：

String str = "USA lover graekenland støtte";
String sortStr = str.replace("ae", "æ");

另一个也许更好的选项是您指定的那个；使用RuleBasedCollator。使用 javadocs 中的示例，这非常简单：

String danish = "< a, A < b, B < c, C < d, D < e, E < f, F < g, G < h, H < i, I" +
                "< j, J < k, K < l, L < m, M < n, N < o, O < p, P < q, Q < r, R" +
                "< s, S < t, T < u, U < v, V < w, W < x, X < y, Y < z, Z" +
                "< \u00E6 = ae," +       // Latin letter ae
                "  \u00C6 = AE " +       // Latin letter AE
                "< \u00F8, \u00D8" +     // Latin letter o & O with stroke
                "< \u00E5 = a\u030A," +  // Latin letter a with ring above
                "  \u00C5 = A\u030A;" +  // Latin letter A with ring above
                "  aa, AA";
RuleBasedCollator danishCollator = new RuleBasedCollator(danish);

然后您可以使用它：

String test1 = "USA lover Grækenland støtte";
String test2 = "USA lover Graekenland støtte";         // note capital 'G'
boolean result = danishCollator.equals(test1, test2);  // true

如果您认为默认整理器不正确，您可能希望报告错误。（以前曾出现过类似的错误）。

更新：我用印刷版丹麦语百科全书对此进行了检查。确实存在以“ae”开头的单词（主要是外语单词；例如“aerobics”），它们不与（因此不等于）以“æ”开头的单词排序。因此，尽管我明白为什么在许多情况下你会想平等地对待它们，但严格来说它们并非如此。

I can't find any existing Collator for danish; the built-in one for the Danish locale is supposed to be correct. I am not sure that your assumption that ae should be sorted with æ holds, specifically due to certain foreign words (for example "aerofobi") in danish (I am not a danish speaker, though I do speak swedish).

But, if you want to sort them together, it seems like you have two ways to do this, depending upon which context you're in. In certain contexts, just replacing the characters might be approprite:

String str = "USA lover graekenland støtte";
String sortStr = str.replace("ae", "æ");

The other, perhaps better, option is the one you specified; using RuleBasedCollator. Using the example from the javadocs, this is pretty trivial:

String danish = "< a, A < b, B < c, C < d, D < e, E < f, F < g, G < h, H < i, I" +
                "< j, J < k, K < l, L < m, M < n, N < o, O < p, P < q, Q < r, R" +
                "< s, S < t, T < u, U < v, V < w, W < x, X < y, Y < z, Z" +
                "< \u00E6 = ae," +       // Latin letter ae
                "  \u00C6 = AE " +       // Latin letter AE
                "< \u00F8, \u00D8" +     // Latin letter o & O with stroke
                "< \u00E5 = a\u030A," +  // Latin letter a with ring above
                "  \u00C5 = A\u030A;" +  // Latin letter A with ring above
                "  aa, AA";
RuleBasedCollator danishCollator = new RuleBasedCollator(danish);

Which you can then use:

String test1 = "USA lover Grækenland støtte";
String test2 = "USA lover Graekenland støtte";         // note capital 'G'
boolean result = danishCollator.equals(test1, test2);  // true

If you believe that the default collator is incorrect, you may wish to report a bug. (There have previously been similar bugs).

Update: I checked this with a printed danish-language encyclopedia. There are indeed word which begin with 'ae' (primarily words from foreign languages; "aerobics", for example) which are not sorted with (and therefore not equal to) word beginning with 'æ'. So although I see why you would want to treat them as equal in many circumstances, they are not strictly so.

回复收藏 0 原文

隔纱相望 2024-12-27 03:36:13

获取特定区域设置的规则的一种方法是使用 getRules 函数。然而，在Android中，这个函数返回一个空字符串。

    RuleBasedCollator collTemp = (RuleBasedCollator) Collator
            .getInstance(Locale.US);
    String usRules = collTemp.getRules();


    //Save rules in a file
    String rulesPath = "C:\\projects\\droid\\rules.txt";
    BufferedWriter out = new BufferedWriter
            (new OutputStreamWriter(new FileOutputStream(rulesPath),"UTF-16"));
    out.write(usRules);
    out.close();

这些规则与比较函数使用的规则相同。

if (collTemp.compare(target, str) < 0)

注意：我尝试将 JDK 桌面应用程序字符串中的规则插入 Android RuleBasedCollator 构造函数，但收到 U_INVALID_FORMAT_ERROR（仅限 Android）。所以我仍在试图弄清楚如何在 Android 中获取美国规则。

One way to get rules for a specific locale is to use getRules function. However, in Android, this function returns an empty string.

    RuleBasedCollator collTemp = (RuleBasedCollator) Collator
            .getInstance(Locale.US);
    String usRules = collTemp.getRules();


    //Save rules in a file
    String rulesPath = "C:\\projects\\droid\\rules.txt";
    BufferedWriter out = new BufferedWriter
            (new OutputStreamWriter(new FileOutputStream(rulesPath),"UTF-16"));
    out.write(usRules);
    out.close();

These rules are the same ones used by compare function.

if (collTemp.compare(target, str) < 0)

Note: I tried to plug the rules from my JDK desktop app string into Android RuleBasedCollator constructor, but I get U_INVALID_FORMAT_ERROR (in Android only). So I am still trying to figure out how to get the US rules in Android.

回复收藏 0 原文

~没有更多了~