Java 在其正则表达式库中支持非 BMP Unicode 字符(即代码点 > 0xFFFF)?

发布于 2024-10-25 09:23:55 字数 265 浏览 8 评论 0原文

我目前正在使用 Java 6(我无法选择迁移到 Java 7),并且我正在尝试使用 java.util.regex 包对包含 Unicode 字符的字符串进行模式匹配。

我知道 java.lang.String 支持补充字符(即代码点 > 0xFFFF 的字符)(自 Java 5 起),但我没有看到与这些字符进行模式匹配的简单方法。 java.util.regex.Pattern 仍然只允许使用 4 位数字表示十六进制(例如 \uFFFF)

有谁知道我这里是否缺少 API?

I'm currently using Java 6 (I don't have the option of moving to Java 7) and I'm trying to use the java.util.regex package to do pattern matching of strings that contain Unicode characters.

I know that java.lang.String supports supplemental characters (i.e. characters with codepoints > 0xFFFF) (since Java 5), but I don't see a simple way to do do pattern matching with these characters. java.util.regex.Pattern still only allows hexadecimals to be represented using 4 digits (e.g. \uFFFF)

Does anyone know if I'm missing an API here?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

沩ん囻菔务 2024-11-01 09:23:55

我从未对补充字符进行过模式匹配,但我认为这就像将它们(在模式和字符串中)编码为两个 16 位数字(UTF-16 代理项对) \unnnn\ummmm 一样简单。 java.util.regex 应该足够聪明,可以将这两个数字(Java 字符)解释为模式和字符串中的单个字符(尽管 Java 仍然会将它们视为两个字符,作为字符串的元素)。

两个链接:

Java Unicode 编码

http://java.sun.com/developer/technicalArticles/Intl/Supplementary/

来自最后一个链接(指Java 5):

java.util.regex 包已
更新以便两个模式字符串
目标字符串可以包含
补充字符,这将
作为完整的单元进行处理。

另请注意,如果您使用 UTF8 作为编码(对于源文件),您也可以直接写入它们(请参阅最后一个链接中的“表示源文件中的补充字符”部分)。

例如:

    String pat1 = ".*\uD840\uDC00{2}.*";
    String s1  = "HI \uD840\uDC00\uD840\uDC00 BYE";
    System.out.println(s1.matches(pat1) + " len=" + s1.length());

    String pat2 = ".*\u0040\u0041{2}.*";
    String s2 = "HI \u0040\u0041\u0040\u0041 BYE";
    System.out.println(s2.matches(pat2) + " len=" + s2.length());

This,用Java 6编译,打印

true len=11
false len=11

与上面一致。在第一种情况下,我们有一个代码点,表示为一对代理 java 字符(两个 16 位字符,一个补充 Unicode 字符),并且 {2} 量词适用于该对( =代码点)。在第二个中,我们有两个不同的 BMP 字符,量词适用于最后一个 - 因此,不匹配。

但请注意,字符串长度是相同的(因为 Java 测量字符串长度时计算的是 Java 字符,而不是 Unicode 代码点)。

I've never done pattern matching with supplemental characters, but I think it's as simple as encoding them (in patterns and strings) as two 16 bits numbers (a UTF-16 surrogate pair) \unnnn\ummmm . java.util.regex should be is clever enough to interpret those two numbers (Java chars) as a single character in patterns and strings (though Java will still see them as two chars, as elements of the string).

Two links:

Java Unicode encoding

http://java.sun.com/developer/technicalArticles/Intl/Supplementary/

From the last link (refering to Java 5) :

The java.util.regex package has been
updated so that both pattern strings
and target strings can contain
supplementary characters, which will
be handled as complete units.

Note also that, if you are using UTF8 as your encoding (for your source files), you can also write them directly (see section "Representing Supplementary Characters in Source Files" in the last link).

For example:

    String pat1 = ".*\uD840\uDC00{2}.*";
    String s1  = "HI \uD840\uDC00\uD840\uDC00 BYE";
    System.out.println(s1.matches(pat1) + " len=" + s1.length());

    String pat2 = ".*\u0040\u0041{2}.*";
    String s2 = "HI \u0040\u0041\u0040\u0041 BYE";
    System.out.println(s2.matches(pat2) + " len=" + s2.length());

This, compiled with Java 6, prints

true len=11
false len=11

which agrees with the above. In the first case, we have a single code point, represented as a pair of surrogate java chars (two 16 bits chars, one suplemental Unicode character), and the {2} quantifier applies to the pair(=codepoint). In the second, we have two distinct BMP characters, the quantifier applies to the last one - hence, no match.

Notice, however, that the string length is the same (because Java measures the string length counting Java characters, not Unicode code points).

拍不死你 2024-11-01 09:23:55

最简单的解决方案是对源代码使用 UTF-8 编码。然后直接把字符放进去就可以了。您永远不必在任何程序中指定单独的代码单元。

然而,字符类仍然存在一个问题,因为 Java 暴露的 UTF-16 内部编码使它们变得混乱。在 JDK7 之前,您不能使用完整的 Unicode,即使如此,您也必须使用间接 \x{HHHHH} 表示法来指定逻辑代码点。您仍然无法在 charclass 中拥有任何文字代码点,但您可以使用 \x{H..H} 来避开它。

虽不完美,但比原来好很多了。 UTF-16 始终是一种妥协。内部使用 UTF-8 或 UTF-32 的系统没有这些限制。它们也不会要求您指定与代码点不同的代码单元。

The easiest solution is to use a UTF-8 encoding for your source code. Then just put the characters in directly. You should never ever ever have to specify separate code units in any program.

There is still an issue with character classes, however, because Java’s lamely exposed UTF-16 internal encoding messes them up. You can’t use full Unicode until JDK7, where even then you will have to specify logical code points using an indirect \x{HHHHH} notation. You still won’t be able to have any literal code point in a charclass, but you can dodge it with \x{H..H}.

Imperfect, but it’s a lot better than it was. UTF-16 is always a compromise. Systems that use UTF-8 or UTF-32 internally don’t have these restrictions. They also never make you specify code units that aren’t identical to code points.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文