Atom feed XML 的奇怪编码
我正在 WP7 上用 C# 创建一些 OPDS 阅读器应用程序,同时我发现了奇怪的行为(至少对我来说很奇怪)。 OPDS 是基于 Atom 的 XML,我正在使用 RestSharp 库,它提供了一些 XML 反序列化器。大多数提要都已正确下载和解析。但在解析某些提要时存在一些例外情况。
我调查了一些发生异常的原因,发现了这一点:
这些页面上发生异常(例如):
当我复制这些 XML 的代码,将其粘贴到 Notepad++ 并在粘贴的代码上应用 Tidy:重新缩进 XML 功能时,Notepad/Tidy 报告一些错误。当我查看错误发生的位置时,通常是在某些重音字符(或重音字符)上。
具体来说:在第一个链接上,第 161 行第 26 列有错误,即单词 What's,特别是类似撇号的字符。
当我查看真正下载的内容(通过 Wireshark)时,在 char 't' 和 char 's' 之间有三个字节。这些字节的十六进制值是 {e2,80,99}。它们都不是与撇号 char 类似的东西。
我敢打赌这是解析问题的原因,但我不太明白。
我的浏览器 (Opera) 正在进行什么转换?
- Opera 显示提要“ok”,
- 甚至显示代码“ok”,
- 但它复制了原始字节,
- 而 Notepad 的 Tidy 在它们上“崩溃”。
有人可以帮我解决这个问题吗?也许我错过了一些关于编码的基本知识......? (问题不是专门关于 WP7 解决方案,而是关于字符的一般编码)
I am creating some OPDS reader application in C# on WP7 while I found strange behavior (at least strange for me). OPDS is Atom-based XML and I am using RestSharp library, which provides some XML deserializer. Most feeds are downloaded and parsed right. But there where some exceptions while parsing certain feeds.
I investigated a little why the exception is occurring and found this:
Exception is occurring on these pages (for example):
When I copied the code of those XMLs, pasted it into Notepad++ and applied Tidy: reindent XML function on that pasted code, Notepad/Tidy reported some errors. When I looked where the errors happened, it was typically on some accent characters (or accented characters).
To be concrete: on the first link, there is error on line 161 column 26, which is word What’s, specifically the apostrophe-like character.
When I looked what is realy downloaded (through Wireshark), there are three bytes between the char 't' and char 's'. Values of those bytes in hex are {e2,80,99}. None of them is something similar to apostrophe char.
I bet this is the cause of the parsing problem, but I don't really get it.
What conversion is my browser (Opera) doing?
- Opera shows the feed 'ok',
- it shows even code 'ok',
- but it copies the original bytes,
- and Notepad's Tidy "crash" on them.
Can someone clear this for me. Maybe I am missing something basic about encoding...?
(question is not specifically about WP7 solution, but about general encoding of chars)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
e28099 是 Unicode 字符“右单引号”(U+2019),以 UTF-8 编码。没什么花哨的,很简单。我不熟悉您使用的工具,但请确保原始字节流的接收和 XML 解析器/反序列化器之间没有任何干扰。任何像样的解析器都应该能够处理这些提要及其使用的编码;我会更仔细地查看您的工具链的设置。
e28099 is the Unicode character 'RIGHT SINGLE QUOTATION MARK' (U+2019), encoded in UTF-8. Nothing fancy, quite straight forward. I'm not familiar with the tools you're using, but make sure that nothing intervenes between reception of the raw byte stream and the XML parser/deserializer. Any decent parser should be able to cope with these feeds and the encoding they use; I would look more carefully at the setup of your tool chain.
您的问题可能与这些不是有效的 ATOM feed 有关吗?
W3C 验证器中的结果显示无效:
http://validator.w3 .org/feed/check.cgi?url=http%3A%2F%2Fpragprog.com%2Fmagazines.opds
Might your problem be related to these not being valid ATOM feeds?
Results in W3C validator show invalid:
http://validator.w3.org/feed/check.cgi?url=http%3A%2F%2Fpragprog.com%2Fmagazines.opds