Java、JavaCC:如何解析BMP之外的字符?
我指的是XML 1.1 规范。
看一下NameStartChar
的定义:
NameStartChar ::= ":" | [AZ] | “_” | [阿兹] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
如果我正确解释这一点,最后一个范围 (#x10000-#xEFFFF
) 超出了 Java 的 UTF16 范围 >char
类型。所以它一定是UTF32,对吗?因此,我需要根据此范围检查成对的 char
,而不是单个 char
,对吗?
我的问题是:
- 如何使用标准 Java 方法检查此类字符范围?
- 如何在 JavaCC 中定义这样的范围?
- JavaCC 抱怨
\u10000
和\uEFFFF
- JavaCC 抱怨
谢谢!
注意: 别担心,我不会尝试编写自己的 XML 解析器。
编辑: 我正在编写一个解析器,它将检查来自其他(非 XML)文本格式的文本输入是否与有效的 XML 名称匹配。
I am referring to the XML 1.1 spec.
Look at the definition of NameStartChar
:
NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
If I interpret this correctly, the last range (#x10000-#xEFFFF
) goes beyond the UTF16 range of Java's char
type. So it must be UTF32, right? So, I need to check pairs of char
against this range, instead of single char
s, right?
My questions are:
- How do I check for such character ranges using standard Java methods?
- How is it possible to define such ranges in JavaCC?
- JavaCC complains about
\u10000
and\uEFFFF
- JavaCC complains about
Thank you!
NOTE: Don't worry, I am not trying to write an own XML-parser.
EDIT: I am writing a parser, which would check if text input from miscellaneous (non-XML) text formats would match valid XML names.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
看一下
Character.toCodePoint(char, char)
,它会将代理对转换为全范围代码点。String.codePointAt
可能对您也很有用。字符和字符串中还有许多其他代理支持。要确切知道要调用哪些方法,我们需要了解您情况的具体细节。
Have a look at
Character.toCodePoint(char, char)
which will convert a surrogate pair into a full range code point.String.codePointAt
may well be useful to you, too.There's a lot of other surrogate support within Character and String. To know exactly which methods to call, we'd need to know the exact details of your situation.
我发现 http://www.fileformat.info/info/unicode /char/10000/index.htm 是一个了解 Unicode 字符的便捷站点。
例如,u+10000 和 u+10FFFF 是
I've found http://www.fileformat.info/info/unicode/char/10000/index.htm to be a handy site for learning about Unicode characters.
For example, u+10000 and u+10FFFF are