Java 在其正则表达式库中支持非 BMP Unicode 字符(即代码点 > 0xFFFF)?
我目前正在使用 Java 6(我无法选择迁移到 Java 7),并且我正在尝试使用 java.util.regex 包对包含 Unicode 字符的字符串进行模式匹配。
我知道 java.lang.String 支持补充字符(即代码点 > 0xFFFF 的字符)(自 Java 5 起),但我没有看到与这些字符进行模式匹配的简单方法。 java.util.regex.Pattern 仍然只允许使用 4 位数字表示十六进制(例如 \uFFFF)
有谁知道我这里是否缺少 API?
I'm currently using Java 6 (I don't have the option of moving to Java 7) and I'm trying to use the java.util.regex package to do pattern matching of strings that contain Unicode characters.
I know that java.lang.String supports supplemental characters (i.e. characters with codepoints > 0xFFFF) (since Java 5), but I don't see a simple way to do do pattern matching with these characters. java.util.regex.Pattern still only allows hexadecimals to be represented using 4 digits (e.g. \uFFFF)
Does anyone know if I'm missing an API here?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我从未对补充字符进行过模式匹配,但我认为这就像将它们(在模式和字符串中)编码为两个 16 位数字(UTF-16 代理项对) \unnnn\ummmm 一样简单。
java.util.regex
应该足够聪明,可以将这两个数字(Java 字符)解释为模式和字符串中的单个字符(尽管 Java 仍然会将它们视为两个字符,作为字符串的元素)。两个链接:
Java Unicode 编码
http://java.sun.com/developer/technicalArticles/Intl/Supplementary/
来自最后一个链接(指Java 5):
另请注意,如果您使用 UTF8 作为编码(对于源文件),您也可以直接写入它们(请参阅最后一个链接中的“表示源文件中的补充字符”部分)。
例如:
This,用Java 6编译,打印
与上面一致。在第一种情况下,我们有一个代码点,表示为一对代理 java 字符(两个 16 位字符,一个补充 Unicode 字符),并且
{2}
量词适用于该对( =代码点)。在第二个中,我们有两个不同的 BMP 字符,量词适用于最后一个 - 因此,不匹配。但请注意,字符串长度是相同的(因为 Java 测量字符串长度时计算的是 Java 字符,而不是 Unicode 代码点)。
I've never done pattern matching with supplemental characters, but I think it's as simple as encoding them (in patterns and strings) as two 16 bits numbers (a UTF-16 surrogate pair) \unnnn\ummmm .
java.util.regex
should beis clever enough to interpret those two numbers (Java chars) as a single character in patterns and strings (though Java will still see them as two chars, as elements of the string).Two links:
Java Unicode encoding
http://java.sun.com/developer/technicalArticles/Intl/Supplementary/
From the last link (refering to Java 5) :
Note also that, if you are using UTF8 as your encoding (for your source files), you can also write them directly (see section "Representing Supplementary Characters in Source Files" in the last link).
For example:
This, compiled with Java 6, prints
which agrees with the above. In the first case, we have a single code point, represented as a pair of surrogate java chars (two 16 bits chars, one suplemental Unicode character), and the
{2}
quantifier applies to the pair(=codepoint). In the second, we have two distinct BMP characters, the quantifier applies to the last one - hence, no match.Notice, however, that the string length is the same (because Java measures the string length counting Java characters, not Unicode code points).
最简单的解决方案是对源代码使用 UTF-8 编码。然后直接把字符放进去就可以了。您永远不必在任何程序中指定单独的代码单元。
然而,字符类仍然存在一个问题,因为 Java 暴露的 UTF-16 内部编码使它们变得混乱。在 JDK7 之前,您不能使用完整的 Unicode,即使如此,您也必须使用间接
\x{HHHHH}
表示法来指定逻辑代码点。您仍然无法在 charclass 中拥有任何文字代码点,但您可以使用\x{H..H}
来避开它。虽不完美,但比原来好很多了。 UTF-16 始终是一种妥协。内部使用 UTF-8 或 UTF-32 的系统没有这些限制。它们也不会要求您指定与代码点不同的代码单元。
The easiest solution is to use a UTF-8 encoding for your source code. Then just put the characters in directly. You should never ever ever have to specify separate code units in any program.
There is still an issue with character classes, however, because Java’s lamely exposed UTF-16 internal encoding messes them up. You can’t use full Unicode until JDK7, where even then you will have to specify logical code points using an indirect
\x{HHHHH}
notation. You still won’t be able to have any literal code point in a charclass, but you can dodge it with\x{H..H}
.Imperfect, but it’s a lot better than it was. UTF-16 is always a compromise. Systems that use UTF-8 or UTF-32 internally don’t have these restrictions. They also never make you specify code units that aren’t identical to code points.