XmlDocument 误读 UTF-8“e-acute”特点

发布于 2024-12-21 00:38:22 字数 557 浏览 2 评论 0原文

我正在阅读包含 é(e锐)字符的 XML 文档。该文档已保存为UTF-8,并且我已使用二进制文件阅读器确认该字符为UTF-8(它是c3+a9)。然而,经过处理后,该字符变成了三字节混乱(c3+83+c2)。

我的猜测是,.NET 已尝试将字符转换为 UTF-16(这是我最好的猜测),或者已将字符拆分为一个一字节字符和一个双字节 UTF-8 字符。

我正在像这样加载文档:

XmlDocuments document = new XmlDocuments();
document.Load("z:\\source.xml");

我应该如何加载这个?我应该通过 UTF-8 编码流阅读此内容吗?


[编辑]

我忘了提及我正在加载的文档将自身声明为 UTF-8。

<?xml version="1.0" encoding="utf-8"?>

I'm reading an XML document that contains the é (e acute) character. The document has been saved as UTF-8 and I have confirmed that the character is UTF-8 with a binary file reader (it is c3+a9). However, after processing, the character becomes a three-byte jumble (c3+83+c2).

My guess is that .NET has tried to convert the character(s) to UTF-16 (this is my best guess) or has split the character into one one-byte character and one double-byte UTF-8 character.

I'm loading the document like this:

XmlDocuments document = new XmlDocuments();
document.Load("z:\\source.xml");

How should I be loading this? Should I be reading this through a UTF-8-encoded stream?


[Edit]

I forgot to mention the document I'm loading is declaring itself as UTF-8.

<?xml version="1.0" encoding="utf-8"?>

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

七月上 2024-12-28 00:38:22

é 在 UTF-8 中编码为 C3 A9。这两个字节在 Windows-1252 代码页(又名 ANSI 代码页或 .NET 中的 Encoding.Default)中解释为 é。以 UTF-8 重新编码这些内容会得到 C3 83 C2 A9,它与“三字节混乱”的前三个字节匹配。似乎某处的某些代码正在执行 Windows-1252 字节 -> System.String 字符 -> UTF-8 字节转换。

我从未见过 .NET 在 XML 声明中明确指定时使用错误的编码(XmlDocument.Load 应该“正常工作”),因此我怀疑您的代码中存在错误。

您如何确定它加载不正确?一旦将其加载到 .NET 中,您将看到字符串,而不是字节,因此您报告的是错误的字节序列,而不是错误的字符序列,这对我来说似乎很奇怪。

é is encoded in UTF-8 as C3 A9. Those two bytes are interpreted in the Windows-1252 codepage (aka ANSI codepage or Encoding.Default in .NET) as é. Re-encoding these in UTF-8 gives C3 83 C2 A9, which matches the first three bytes of your "three-byte jumble". It appears that some code somewhere is performing a Windows-1252 bytes -> System.String chars -> UTF-8 bytes conversion.

I've never seen .NET use the wrong encoding when it's explicitly specified in the XML declaration (XmlDocument.Load should "just work"), so I would suspect that there is a bug in your code.

How are you determining that it's loading incorrectly? Once it's loaded in .NET, you would see strings, not bytes, so it seems odd to me that you're reporting an incorrect byte sequence, not an incorrect sequence of characters.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文