XmlDocument 误读 UTF-8“e-acute”特点

发布于 2024-12-21 00:38:22 字数 557 浏览 2 评论 0原文

我正在阅读包含 é（e锐）字符的 XML 文档。该文档已保存为UTF-8，并且我已使用二进制文件阅读器确认该字符为UTF-8（它是c3+a9）。然而，经过处理后，该字符变成了三字节混乱（c3+83+c2）。

我的猜测是，.NET 已尝试将字符转换为 UTF-16（这是我最好的猜测），或者已将字符拆分为一个一字节字符和一个双字节 UTF-8 字符。

我正在像这样加载文档：

XmlDocuments document = new XmlDocuments();
document.Load("z:\\source.xml");

我应该如何加载这个？我应该通过 UTF-8 编码流阅读此内容吗？

[编辑]

我忘了提及我正在加载的文档将自身声明为 UTF-8。

<?xml version="1.0" encoding="utf-8"?>

原文

I'm reading an XML document that contains the é (e acute) character. The document has been saved as UTF-8 and I have confirmed that the character is UTF-8 with a binary file reader (it is c3+a9). However, after processing, the character becomes a three-byte jumble (c3+83+c2).

My guess is that .NET has tried to convert the character(s) to UTF-16 (this is my best guess) or has split the character into one one-byte character and one double-byte UTF-8 character.

I'm loading the document like this:

XmlDocuments document = new XmlDocuments();
document.Load("z:\\source.xml");

How should I be loading this? Should I be reading this through a UTF-8-encoded stream?

[Edit]

I forgot to mention the document I'm loading is declaring itself as UTF-8.

<?xml version="1.0" encoding="utf-8"?>

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

七月上 2024-12-28 00:38:22

é 在 UTF-8 中编码为 C3 A9。这两个字节在 Windows-1252 代码页（又名 ANSI 代码页或 .NET 中的 Encoding.Default）中解释为 é。以 UTF-8 重新编码这些内容会得到 C3 83 C2 A9，它与“三字节混乱”的前三个字节匹配。似乎某处的某些代码正在执行 Windows-1252 字节 -> System.String 字符 -> UTF-8 字节转换。

我从未见过 .NET 在 XML 声明中明确指定时使用错误的编码（XmlDocument.Load 应该“正常工作”），因此我怀疑您的代码中存在错误。

您如何确定它加载不正确？一旦将其加载到 .NET 中，您将看到字符串，而不是字节，因此您报告的是错误的字节序列，而不是错误的字符序列，这对我来说似乎很奇怪。