我只想下载这个网址...但它给了我一个错误！ ...unicode..（Python）

发布于 2024-08-12 06:03:10 字数 852 浏览 8 评论 0原文

theurl = 'http://bit.ly/6IcCtf/'
urlReq = urllib2.Request(theurl)
urlReq.add_header('User-Agent',random.choice(agents))
urlResponse = urllib2.urlopen(urlReq)
htmlSource = urlResponse.read()
if unicode == 1:
    #print urlResponse.headers['content-type']
    #encoding=urlResponse.headers['content-type'].split('charset=')[-1]
    #htmlSource = unicode(htmlSource, encoding)
    htmlSource =  htmlSource.encode('utf8')
return htmlSource

请查看 unicode 部分。我已经尝试过这两个选项...但不起作用。

htmlSource =  htmlSource.encode('utf8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 370747: ordinal not in range(128)

当我尝试更长的编码方法时也是如此...

_mysql_exceptions.Warning: Incorrect string value: '\xE7\xB9\x81\xE9\xAB\x94...' for column 'html' at row 1

原文

theurl = 'http://bit.ly/6IcCtf/'
urlReq = urllib2.Request(theurl)
urlReq.add_header('User-Agent',random.choice(agents))
urlResponse = urllib2.urlopen(urlReq)
htmlSource = urlResponse.read()
if unicode == 1:
    #print urlResponse.headers['content-type']
    #encoding=urlResponse.headers['content-type'].split('charset=')[-1]
    #htmlSource = unicode(htmlSource, encoding)
    htmlSource =  htmlSource.encode('utf8')
return htmlSource

Please take a look at the unicode portion. I've tried those two options...but doesn't work.

htmlSource =  htmlSource.encode('utf8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 370747: ordinal not in range(128)

and also this when I try the longer method of encoding...

_mysql_exceptions.Warning: Incorrect string value: '\xE7\xB9\x81\xE9\xAB\x94...' for column 'html' at row 1

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

贱贱哒 2024-08-19 06:03:10

您的 html 数据是来自互联网的字符串，已经使用某种编码进行了编码。在将其编码为 utf-8 之前，您必须先对其进行解码。

Python隐式尝试对其进行解码（这就是为什么您会得到 UnicodeDecodeError 而不是 UnicodeEncodeError）。

您可以通过在尝试将字节串重新编码为 utf-8 之前显式解码您的字节串（使用适当的编码）来解决该问题。

示例：

utf8encoded = htmlSource.decode('some_encoding').encode('utf-8')

首先使用页面编码的正确编码，而不是 'some_encoding'。

在解码字符串之前，您必须知道字符串使用哪种编码。

Your html data is a string that comes from the internet already encoded with some encoding. Before encoding it to utf-8, you must decode it first.

Python is implicity trying to decode it (That's why you get a UnicodeDecodeError not UnicodeEncodeError).

You can solve the problem by explicity decoding your bytestring (using the appropriate encoding) before trying to reencode it to utf-8.

Example:

utf8encoded = htmlSource.decode('some_encoding').encode('utf-8')

Use the correct encoding the page was encoded in first place, instead of 'some_encoding'.

You have to know which encoding a string is using before you can decode it.

回复收藏 0 原文

屋顶上的小猫咪 2024-08-19 06:03:10

不解码？ htmlSource = htmlSource.decode('utf8')

解码的意思是“从 utf8 编码解码 htmlSource”

编码的意思是“将 htmlSource 编码为 utf8 编码”

因为您正在提取现有数据（从网站抓取），所以您需要对其进行解码，当您插入 mysql 时，您可能需要根据 mysql db/table/fields 排序规则编码为 utf8。

回复收藏 0 原文

荒人说梦 2024-08-19 06:03:10

也许您想要解码 Utf8，而不是编码它：

htmlSource =  htmlSource.decode('utf8')

Probably you want to decode Utf8, not encode it:

htmlSource =  htmlSource.decode('utf8')

回复收藏 0 原文

~没有更多了~

关于作者

画尸师

暂无简介

0 文章

0 评论

23 人气

关注发私信

胡图图

文章 0 评论 0

关注

zt006

文章 0 评论 0

关注

z祗昰~

文章 0 评论 0

关注

冰葑

文章 0 评论 0

关注

野の

文章 0 评论 0

关注

天空

文章 0 评论 0

友情链接

文江博客

我只想下载这个网址...但它给了我一个错误！ ...unicode..（Python）

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

我只想下载这个网址...但它给了我一个错误！ ...unicode..（Python）

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。