XmlDocument.Load 失败,LoadXml 工作:
在回答这个问题时,我遇到了我不明白的情况。 OP 尝试从以下位置加载 XML:http:// www.google.com/ig/api?weather=12414&hl=it
明显的解决方案是:
string m_strFilePath = "http://www.google.com/ig/api?weather=12414&hl=it";
XmlDocument myXmlDocument = new XmlDocument();
myXmlDocument.Load(m_strFilePath); //Load NOT LoadXml
然而,这失败了
XmlException:给定编码中的字符无效。第 1 行,位置 499。
似乎被Umidità
的à
噎住了。
OTOH,以下工作正常:
var m_strFilePath = "http://www.google.com/ig/api?weather=12414&hl=it";
string xmlStr;
using(var wc = new WebClient())
{
xmlStr = wc.DownloadString(m_strFilePath);
}
var xmlDoc = new XmlDocument();
xmlDoc.LoadXml(xmlStr);
我对此感到困惑。谁能解释为什么前者失败,但后者工作正常?
值得注意的是,文档的 xml 声明省略了编码。
In answering this question, I came across a situation that I don't understand. The OP was trying to load XML from the following location: http://www.google.com/ig/api?weather=12414&hl=it
The obvious solution is:
string m_strFilePath = "http://www.google.com/ig/api?weather=12414&hl=it";
XmlDocument myXmlDocument = new XmlDocument();
myXmlDocument.Load(m_strFilePath); //Load NOT LoadXml
However this fails with
XmlException : Invalid character in the given encoding. Line 1, position 499.
It seems to be choking on the à
of Umidità
.
OTOH, the following works fine:
var m_strFilePath = "http://www.google.com/ig/api?weather=12414&hl=it";
string xmlStr;
using(var wc = new WebClient())
{
xmlStr = wc.DownloadString(m_strFilePath);
}
var xmlDoc = new XmlDocument();
xmlDoc.LoadXml(xmlStr);
I'm baffled by this. Can anyone explain why the former fails, but the latter works fine?
Notably, the xml declaration of the document omits an encoding.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
WebClient
使用 HTTP 响应标头中的编码信息来确定正确的编码(在本例中 ISO-8859-1 基于 ASCII,即每个字符 8 位)看起来
XmlDocument.Load
没有使用此信息,并且编码也丢失了xml声明它必须猜测编码并弄错。经过一番研究,我相信它选择了 UTF-8。如果我们想获得真正的技术性,它抛出的字符是“à”,在 ISO-8859-1 编码中是 0xE0,但这不是
UTF-8
中的有效字符 -具体来说,这个字符的二进制表示是:如果您深入了解 UTF-8 Wikipedia 文章,我们可以看到这表示一个代码点(即字符)总共由 3 个字节组成,采用以下格式:
但是如果我们回顾一下文档,接下来的两个字符是“:”,即 ISO-8859-1 中的 0x3A 和 0x20。这意味着我们实际上最终得到的是:
序列的第二个或第三个字节都没有
10
作为两个最高有效位(这将指示连续),因此这个字符在UTF-8。The
WebClient
uses the encoding information in the headers of the HTTP response to determine the correct encoding (in this case ISO-8859-1 which is ASCII based, i.e. 8 bits per character)It looks like
XmlDocument.Load
doesn't use this information and as the encoding is also missing from the xml declaration it has to guess at an encoding and gets it wrong. Some digging around leads me to believe that it chooses UTF-8.If we want to get really technical the character it throws up on is "à", which is 0xE0 in the ISO-8859-1 encoding, but this isn't a valid character in
UTF-8
- specifically the binary representation of this character is:If you have a dig around in the UTF-8 Wikipedia article we can see that this indicates a code point (i.e. character) consisting of a total of 3 bytes that take the following format:
But if we have a look back at the document the next two characters are ": " which is 0x3A and 0x20 in ISO-8859-1. This means what we actually end up with is:
Neither the 2nd or 3rd bytes of the sequence have
10
as the two most significant bits (which would indicate a continuation), and so this character makes no sense in UTF-8.作为节点内部文本的 Umidità 字符串必须位于 << ! [ CDATA [ Umidità ] ] >这不会在 XmlDocument.Load 中给出任何错误。
Umidità string as Node innertext must be inside < ! [ CDATA [ Umidità ] ] > this wont give any error in XmlDocument.Load.