2C10 是有效的 UTF-8 字符吗?
我正在通过 SAX 解析器运行一些 XML,并注意到解析器无法正常使用某些字符作为数据内容。 XML 应该采用 UTF-8 编码,并且 SAX 解析器设置为处理该编码。
缩小有问题的字符串并在十六进制编辑器中查看 XML 文件,我可以看到 2C10 会导致问题,如果我将其更改为 C2A2(维基百科上给出的示例字符),则 SAX 解析器可以工作。那么2C10不是一个有效的UTF8字符吗?
I'm running some XML through a SAX parser and have noticed the parser is not functioning correctly with certain characters as data content.
The XML is supposed to be in UTF-8 encoding and the SAX parser is set to process that encoding.
Narrowing down problematic strings and looking at the XML file in a hex editor I can see for example that 2C10 causes a problem, if I change this instead to C2A2 (an example character given on wikipedia) then the SAX parser works. So is 2C10 not a valid UTF8 character?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
U+2C10 是
GLAGOLITIC 大写字母 NASHI
。以下是它的属性:U+2C10 is
GLAGOLITIC CAPITAL LETTER NASHI
. Here are its properties:Wikipedia 上的 UTF-8 页面 暗示 2C10 将被解释为
,
后跟控制代码DLE
(显然是数据链路转义)。控制字符出现在 XML 中的 CDATA 块之外似乎是不合适的!The UTF-8 page on Wikipedia implies that 2C10 is to be interpreted as
,
followed by the control codeDLE
(data link escape, apparently). A control character appearing outside of a CDATA block in XML would seem inappropriate!