python utf-8 HTML 解码错误

发布于 2024-12-29 18:33:23 字数 480 浏览 7 评论 0原文

我尝试使用 urllib2 下载网页并将其保存到 MySQL 数据库。像这样：

result_text = result.read()
result_text = result_text.decode('utf-8')

但是我收到此错误：

数据：“utf8”编解码器无法解码字节 0x88

现在，HTML 元标记表明编码确实是 utf-8。我设法用这一行解决了这个问题：

result_text = result_text.decode('utf-8','replace')

它替换了坏字符。但是，我不确定这是否表明下载的数据可能有问题，或者我正在删除有价值的字符。 IU 应该补充一点，该页面还包含 JavaScript - 尽管我认为这不应该是一个问题。

谁能告诉我为什么会发生这种情况？谢谢

原文

Im trying to use urllib2 to download a webpage and save it to a MySQL database.
like this :

result_text = result.read()
result_text = result_text.decode('utf-8')

however I get this error :

Data: 'utf8' codec can't decode byte 0x88

Now, the HTML meta tag states that the encoding is indeed utf-8.
Ive managed to get around this with this line :

result_text = result_text.decode('utf-8','replace')

Which replaces the bad characters. however, i'm not sure that this is not an indication that something could be wrong with the downloaded data, or that i'm removing valuable characters.
IU should add that the page also contains JavaScript - although this shouldn't be a problem I believe.

Can anyone tell me why this is happening?
Thanks

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

子栖 2025-01-05 18:33:23

对你的小数据样本的分析：

>>> s = "\x98cW\x01\xa2\xbb\xba\xcc\xec\x90\xfc\xffP\xcb%\x01\x08"
>>> u = s.decode('utf8', 'replace')
>>> u
u'\ufffdcW\x01\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffdP\ufffd%\x01\x08'
>>> u.count(u'\ufffd')
9
>>> len(u)
16

（1）这肯定不是带有偶尔无效序列的UTF-8；超过 50% 的 unicode 字符无效。换句话说，继续使用 data.decode('utf8', 'replace') 并不是一个好主意（基于这个小样本）。

(2) 字符 \x01 （两次）和 \x08 让我怀疑您以某种方式获得了二进制数据。

(3) 您在评论中引用的（截断的）错误消息提到了 0x88，但示例数据中没有 0x88。

(4) 请编辑您的问题以显示您应该在开始时执行的操作：(a) 重现问题所需的最少代码，包括您正在访问的 URL (b) 完整的错误消息和回溯 (c)确保您已复制/粘贴 (a) 和 (b)，而不是凭记忆打字

Analysis of your tiny data sample:

>>> s = "\x98cW\x01\xa2\xbb\xba\xcc\xec\x90\xfc\xffP\xcb%\x01\x08"
>>> u = s.decode('utf8', 'replace')
>>> u
u'\ufffdcW\x01\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffdP\ufffd%\x01\x08'
>>> u.count(u'\ufffd')
9
>>> len(u)
16

(1) That's certainly not UTF-8 with an occasional invalid sequence; over 50% of the unicode characters are invalid. In other words, pressing ahead and using data.decode('utf8', 'replace') is NOT a good idea (based on this TINY sample).

(2) The characters \x01 (twice) and \x08 make me suspect that you have got binary data somehow.

(3) The (truncated) error message that you quoted in a comment mentioned 0x88 but there is no 0x88 in the sample data.

(4) Please edit your question to show what you should have done at the start: (a) the minimal code necessary to reproduce the problem, including the URL that you are accessing (b) the full error message and traceback (c) an assurance that you have copied/pasted (a) and (b) rather than typing from memory

回复收藏 0 原文

~没有更多了~