Java 在其正则表达式库中支持非 BMP Unicode 字符（即代码点 > 0xFFFF）？

发布于 2024-10-25 09:23:55 字数 265 浏览 8 评论 0原文

我目前正在使用 Java 6（我无法选择迁移到 Java 7），并且我正在尝试使用 java.util.regex 包对包含 Unicode 字符的字符串进行模式匹配。

我知道 java.lang.String 支持补充字符（即代码点 > 0xFFFF 的字符）（自 Java 5 起），但我没有看到与这些字符进行模式匹配的简单方法。 java.util.regex.Pattern 仍然只允许使用 4 位数字表示十六进制（例如 \uFFFF）

有谁知道我这里是否缺少 API？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

沩ん囻菔务 2024-11-01 09:23:55

我从未对补充字符进行过模式匹配，但我认为这就像将它们（在模式和字符串中）编码为两个 16 位数字（UTF-16 代理项对） \unnnn\ummmm 一样简单。 java.util.regex 应该足够聪明，可以将这两个数字（Java 字符）解释为模式和字符串中的单个字符（尽管 Java 仍然会将它们视为两个字符，作为字符串的元素）。

两个链接：

Java Unicode 编码

http://java.sun.com/developer/technicalArticles/Intl/Supplementary/

来自最后一个链接（指Java 5）：

java.util.regex 包已
更新以便两个模式字符串
目标字符串可以包含
补充字符，这将
作为完整的单元进行处理。

另请注意，如果您使用 UTF8 作为编码（对于源文件），您也可以直接写入它们（请参阅最后一个链接中的“表示源文件中的补充字符”部分）。

例如：

    String pat1 = ".*\uD840\uDC00{2}.*";
    String s1  = "HI \uD840\uDC00\uD840\uDC00 BYE";
    System.out.println(s1.matches(pat1) + " len=" + s1.length());

    String pat2 = ".*\u0040\u0041{2}.*";
    String s2 = "HI \u0040\u0041\u0040\u0041 BYE";
    System.out.println(s2.matches(pat2) + " len=" + s2.length());

This，用Java 6编译，打印

true len=11
false len=11

与上面一致。在第一种情况下，我们有一个代码点，表示为一对代理 java 字符（两个 16 位字符，一个补充 Unicode 字符），并且 {2} 量词适用于该对（ =代码点）。在第二个中，我们有两个不同的 BMP 字符，量词适用于最后一个 - 因此，不匹配。

但请注意，字符串长度是相同的（因为 Java 测量字符串长度时计算的是 Java 字符，而不是 Unicode 代码点）。

I've never done pattern matching with supplemental characters, but I think it's as simple as encoding them (in patterns and strings) as two 16 bits numbers (a UTF-16 surrogate pair) \unnnn\ummmm . java.util.regex ~~should be~~ is clever enough to interpret those two numbers (Java chars) as a single character in patterns and strings (though Java will still see them as two chars, as elements of the string).

Two links:

Java Unicode encoding

http://java.sun.com/developer/technicalArticles/Intl/Supplementary/

From the last link (refering to Java 5) :

The java.util.regex package has been
updated so that both pattern strings
and target strings can contain
supplementary characters, which will
be handled as complete units.

Note also that, if you are using UTF8 as your encoding (for your source files), you can also write them directly (see section "Representing Supplementary Characters in Source Files" in the last link).

For example:

    String pat1 = ".*\uD840\uDC00{2}.*";
    String s1  = "HI \uD840\uDC00\uD840\uDC00 BYE";
    System.out.println(s1.matches(pat1) + " len=" + s1.length());

    String pat2 = ".*\u0040\u0041{2}.*";
    String s2 = "HI \u0040\u0041\u0040\u0041 BYE";
    System.out.println(s2.matches(pat2) + " len=" + s2.length());

This, compiled with Java 6, prints

true len=11
false len=11

which agrees with the above. In the first case, we have a single code point, represented as a pair of surrogate java chars (two 16 bits chars, one suplemental Unicode character), and the {2} quantifier applies to the pair(=codepoint). In the second, we have two distinct BMP characters, the quantifier applies to the last one - hence, no match.

Notice, however, that the string length is the same (because Java measures the string length counting Java characters, not Unicode code points).

回复收藏 0 原文

拍不死你 2024-11-01 09:23:55

最简单的解决方案是对源代码使用 UTF-8 编码。然后直接把字符放进去就可以了。您永远不必在任何程序中指定单独的代码单元。

然而，字符类仍然存在一个问题，因为 Java 暴露的 UTF-16 内部编码使它们变得混乱。在 JDK7 之前，您不能使用完整的 Unicode，即使如此，您也必须使用间接 \x{HHHHH} 表示法来指定逻辑代码点。您仍然无法在 charclass 中拥有任何文字代码点，但您可以使用 \x{H..H} 来避开它。

虽不完美，但比原来好很多了。 UTF-16 始终是一种妥协。内部使用 UTF-8 或 UTF-32 的系统没有这些限制。它们也不会要求您指定与代码点不同的代码单元。

回复收藏 0 原文

~没有更多了~