python utf-8 HTML 解码错误
我尝试使用 urllib2 下载网页并将其保存到 MySQL 数据库。 像这样:
result_text = result.read()
result_text = result_text.decode('utf-8')
但是我收到此错误:
数据:“utf8”编解码器无法解码字节 0x88
现在,HTML 元标记表明编码确实是 utf-8。 我设法用这一行解决了这个问题:
result_text = result_text.decode('utf-8','replace')
它替换了坏字符。但是,我不确定这是否表明下载的数据可能有问题,或者我正在删除有价值的字符。 IU 应该补充一点,该页面还包含 JavaScript - 尽管我认为这不应该是一个问题。
谁能告诉我为什么会发生这种情况? 谢谢
Im trying to use urllib2 to download a webpage and save it to a MySQL database.
like this :
result_text = result.read()
result_text = result_text.decode('utf-8')
however I get this error :
Data: 'utf8' codec can't decode byte 0x88
Now, the HTML meta tag states that the encoding is indeed utf-8.
Ive managed to get around this with this line :
result_text = result_text.decode('utf-8','replace')
Which replaces the bad characters. however, i'm not sure that this is not an indication that something could be wrong with the downloaded data, or that i'm removing valuable characters.
IU should add that the page also contains JavaScript - although this shouldn't be a problem I believe.
Can anyone tell me why this is happening?
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
对你的小数据样本的分析:
(1)这肯定不是带有偶尔无效序列的UTF-8;超过 50% 的 unicode 字符无效。换句话说,继续使用 data.decode('utf8', 'replace') 并不是一个好主意(基于这个小样本)。
(2) 字符
\x01
(两次)和\x08
让我怀疑您以某种方式获得了二进制数据。(3) 您在评论中引用的(截断的)错误消息提到了
0x88
,但示例数据中没有0x88
。(4) 请编辑您的问题以显示您应该在开始时执行的操作:(a) 重现问题所需的最少代码,包括您正在访问的 URL (b) 完整的错误消息和回溯 (c)确保您已复制/粘贴 (a) 和 (b),而不是凭记忆打字
Analysis of your tiny data sample:
(1) That's certainly not UTF-8 with an occasional invalid sequence; over 50% of the unicode characters are invalid. In other words, pressing ahead and using
data.decode('utf8', 'replace')
is NOT a good idea (based on this TINY sample).(2) The characters
\x01
(twice) and\x08
make me suspect that you have got binary data somehow.(3) The (truncated) error message that you quoted in a comment mentioned
0x88
but there is no0x88
in the sample data.(4) Please edit your question to show what you should have done at the start: (a) the minimal code necessary to reproduce the problem, including the URL that you are accessing (b) the full error message and traceback (c) an assurance that you have copied/pasted (a) and (b) rather than typing from memory