XmlDocument.Load 失败，LoadXml 工作：

发布于 2024-12-05 16:46:38 字数 1019 浏览 7 评论 0原文

在回答这个问题时，我遇到了我不明白的情况。 OP 尝试从以下位置加载 XML：http:// www.google.com/ig/api?weather=12414&hl=it

明显的解决方案是：

string m_strFilePath = "http://www.google.com/ig/api?weather=12414&hl=it";
XmlDocument myXmlDocument = new XmlDocument();
myXmlDocument.Load(m_strFilePath); //Load NOT LoadXml

然而，这失败了

XmlException：给定编码中的字符无效。第 1 行，位置 499。

似乎被Umidità的à噎住了。

OTOH，以下工作正常：

var m_strFilePath = "http://www.google.com/ig/api?weather=12414&hl=it";
string xmlStr;
using(var wc = new WebClient())
{
    xmlStr = wc.DownloadString(m_strFilePath);
}
var xmlDoc = new XmlDocument();
xmlDoc.LoadXml(xmlStr);

我对此感到困惑。谁能解释为什么前者失败，但后者工作正常？

值得注意的是，文档的 xml 声明省略了编码。

原文

In answering this question, I came across a situation that I don't understand. The OP was trying to load XML from the following location: http://www.google.com/ig/api?weather=12414&hl=it

The obvious solution is:

string m_strFilePath = "http://www.google.com/ig/api?weather=12414&hl=it";
XmlDocument myXmlDocument = new XmlDocument();
myXmlDocument.Load(m_strFilePath); //Load NOT LoadXml

However this fails with

XmlException : Invalid character in the given encoding. Line 1, position 499.

It seems to be choking on the à of Umidità.

OTOH, the following works fine:

var m_strFilePath = "http://www.google.com/ig/api?weather=12414&hl=it";
string xmlStr;
using(var wc = new WebClient())
{
    xmlStr = wc.DownloadString(m_strFilePath);
}
var xmlDoc = new XmlDocument();
xmlDoc.LoadXml(xmlStr);

I'm baffled by this. Can anyone explain why the former fails, but the latter works fine?

Notably, the xml declaration of the document omits an encoding.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

娜些时光，永不杰束 2024-12-12 16:46:38

WebClient 使用 HTTP 响应标头中的编码信息来确定正确的编码（在本例中 ISO-8859-1 基于 ASCII，即每个字符 8 位）

看起来 XmlDocument.Load 没有使用此信息，并且编码也丢失了xml声明它必须猜测编码并弄错。经过一番研究，我相信它选择了 UTF-8。

如果我们想获得真正的技术性，它抛出的字符是“à”，在 ISO-8859-1 编码中是 0xE0，但这不是 UTF-8 中的有效字符 -具体来说，这个字符的二进制表示是：

11100000

如果您深入了解 UTF-8 Wikipedia 文章，我们可以看到这表示一个代码点（即字符）总共由 3 个字节组成，采用以下格式：

Byte 1      Byte 2      Byte 3
----------- ----------- -----------
1110xxxx    10xxxxxx    10xxxxxx

但是如果我们回顾一下文档，接下来的两个字符是“：”，即 ISO-8859-1 中的 0x3A 和 0x20。这意味着我们实际上最终得到的是：

Byte 1      Byte 2      Byte 3
----------- ----------- -----------
11100000    00111010    00100000

序列的第二个或第三个字节都没有 10 作为两个最高有效位（这将指示连续），因此这个字符在UTF-8。

The WebClient uses the encoding information in the headers of the HTTP response to determine the correct encoding (in this case ISO-8859-1 which is ASCII based, i.e. 8 bits per character)

It looks like XmlDocument.Load doesn't use this information and as the encoding is also missing from the xml declaration it has to guess at an encoding and gets it wrong. Some digging around leads me to believe that it chooses UTF-8.

If we want to get really technical the character it throws up on is "à", which is 0xE0 in the ISO-8859-1 encoding, but this isn't a valid character in UTF-8 - specifically the binary representation of this character is:

11100000

If you have a dig around in the UTF-8 Wikipedia article we can see that this indicates a code point (i.e. character) consisting of a total of 3 bytes that take the following format:

Byte 1      Byte 2      Byte 3
----------- ----------- -----------
1110xxxx    10xxxxxx    10xxxxxx

But if we have a look back at the document the next two characters are ": " which is 0x3A and 0x20 in ISO-8859-1. This means what we actually end up with is:

Byte 1      Byte 2      Byte 3
----------- ----------- -----------
11100000    00111010    00100000

Neither the 2nd or 3rd bytes of the sequence have 10 as the two most significant bits (which would indicate a continuation), and so this character makes no sense in UTF-8.

回复收藏 0 原文