不理解有效 XML 字符集的正则表达式
正如 w3c 所描述的,XML 的有效字符是有限的。
我们可以通过以下正则表达式识别无效的字符:
/*
* From xml spec valid chars:
* #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF.
*/
Pattern pattern = Pattern.compile("[^\\x09\\x0A\\x0D\\x20-\\xD7EF\\xE000-\\xFFFD\\x10000-x10FFFF]");
但我不知道为什么该表达式不是:
Pattern pattern = Pattern.compile("[^\\x09\\x0A\\x0D\\x20-\\xD7EF\\xE000-\\xFFFD\\x10000-\\x10FFFF]");
错误消息是:
java.util.regex.PatternSyntaxException: Illegal character range near index 49
[^\x09\x0A\x0D\x20-\xD7EF\xE000-\xFFFD\x10000-\x10FFFF]
As w3c describe the valid chars for XML is limited.
We can recognize invalid char by following regular expression:
/*
* From xml spec valid chars:
* #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF.
*/
Pattern pattern = Pattern.compile("[^\\x09\\x0A\\x0D\\x20-\\xD7EF\\xE000-\\xFFFD\\x10000-x10FFFF]");
But I dont know why the expression isn't :
Pattern pattern = Pattern.compile("[^\\x09\\x0A\\x0D\\x20-\\xD7EF\\xE000-\\xFFFD\\x10000-\\x10FFFF]");
The error message is :
java.util.regex.PatternSyntaxException: Illegal character range near index 49
[^\x09\x0A\x0D\x20-\xD7EF\xE000-\xFFFD\x10000-\x10FFFF]
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
简单回答:并非每个 Unicode 代码点 都可以在 Java 中表示为
char
。这是因为代码点由 21 位数字标识,但char
只有 16 位宽。因此,以 U+10000 开头的代码点使用两个char
进行编码:一个高代理,后跟一个低代理。字符串和正则表达式在char
上工作,而不是在代码点上工作,因此您必须自己翻译它们。Simple answer: Not every Unicode Code Point can be expressed as a
char
in Java. This is because a Code Point is identified by a 21-bit number, but achar
is only 16 bits wide. Therefore the Code Points starting with U+10000 are encoded using twochar
s: a High Surrogate followed by a Low Surrogate. The strings and regular expressions work onchar
s, not on Code Points, so you have to translate them yourself.