XmlDocument 误读 UTF-8“e-acute”特点
我正在阅读包含 é
(e锐)字符的 XML 文档。该文档已保存为UTF-8,并且我已使用二进制文件阅读器确认该字符为UTF-8(它是c3
+a9
)。然而,经过处理后,该字符变成了三字节混乱(c3
+83
+c2
)。
我的猜测是,.NET 已尝试将字符转换为 UTF-16(这是我最好的猜测),或者已将字符拆分为一个一字节字符和一个双字节 UTF-8 字符。
我正在像这样加载文档:
XmlDocuments document = new XmlDocuments();
document.Load("z:\\source.xml");
我应该如何加载这个?我应该通过 UTF-8 编码流阅读此内容吗?
[编辑]
我忘了提及我正在加载的文档将自身声明为 UTF-8。
<?xml version="1.0" encoding="utf-8"?>
I'm reading an XML document that contains the é
(e acute) character. The document has been saved as UTF-8 and I have confirmed that the character is UTF-8 with a binary file reader (it is c3
+a9
). However, after processing, the character becomes a three-byte jumble (c3
+83
+c2
).
My guess is that .NET has tried to convert the character(s) to UTF-16 (this is my best guess) or has split the character into one one-byte character and one double-byte UTF-8 character.
I'm loading the document like this:
XmlDocuments document = new XmlDocuments();
document.Load("z:\\source.xml");
How should I be loading this? Should I be reading this through a UTF-8-encoded stream?
[Edit]
I forgot to mention the document I'm loading is declaring itself as UTF-8.
<?xml version="1.0" encoding="utf-8"?>
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
é
在 UTF-8 中编码为C3 A9
。这两个字节在 Windows-1252 代码页(又名 ANSI 代码页或 .NET 中的Encoding.Default
)中解释为é
。以 UTF-8 重新编码这些内容会得到C3 83 C2 A9
,它与“三字节混乱”的前三个字节匹配。似乎某处的某些代码正在执行 Windows-1252 字节 -> System.String 字符 -> UTF-8 字节转换。我从未见过 .NET 在 XML 声明中明确指定时使用错误的编码(
XmlDocument.Load
应该“正常工作”),因此我怀疑您的代码中存在错误。您如何确定它加载不正确?一旦将其加载到 .NET 中,您将看到字符串,而不是字节,因此您报告的是错误的字节序列,而不是错误的字符序列,这对我来说似乎很奇怪。
é
is encoded in UTF-8 asC3 A9
. Those two bytes are interpreted in the Windows-1252 codepage (aka ANSI codepage orEncoding.Default
in .NET) asé
. Re-encoding these in UTF-8 givesC3 83 C2 A9
, which matches the first three bytes of your "three-byte jumble". It appears that some code somewhere is performing a Windows-1252 bytes -> System.String chars -> UTF-8 bytes conversion.I've never seen .NET use the wrong encoding when it's explicitly specified in the XML declaration (
XmlDocument.Load
should "just work"), so I would suspect that there is a bug in your code.How are you determining that it's loading incorrectly? Once it's loaded in .NET, you would see strings, not bytes, so it seems odd to me that you're reporting an incorrect byte sequence, not an incorrect sequence of characters.