我只想下载这个网址...但它给了我一个错误! ...unicode..(Python)

发布于 2024-08-12 06:03:10 字数 852 浏览 8 评论 0原文

theurl = 'http://bit.ly/6IcCtf/'
urlReq = urllib2.Request(theurl)
urlReq.add_header('User-Agent',random.choice(agents))
urlResponse = urllib2.urlopen(urlReq)
htmlSource = urlResponse.read()
if unicode == 1:
    #print urlResponse.headers['content-type']
    #encoding=urlResponse.headers['content-type'].split('charset=')[-1]
    #htmlSource = unicode(htmlSource, encoding)
    htmlSource =  htmlSource.encode('utf8')
return htmlSource

请查看 unicode 部分。我已经尝试过这两个选项...但不起作用。

htmlSource =  htmlSource.encode('utf8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 370747: ordinal not in range(128)

当我尝试更长的编码方法时也是如此...

_mysql_exceptions.Warning: Incorrect string value: '\xE7\xB9\x81\xE9\xAB\x94...' for column 'html' at row 1
theurl = 'http://bit.ly/6IcCtf/'
urlReq = urllib2.Request(theurl)
urlReq.add_header('User-Agent',random.choice(agents))
urlResponse = urllib2.urlopen(urlReq)
htmlSource = urlResponse.read()
if unicode == 1:
    #print urlResponse.headers['content-type']
    #encoding=urlResponse.headers['content-type'].split('charset=')[-1]
    #htmlSource = unicode(htmlSource, encoding)
    htmlSource =  htmlSource.encode('utf8')
return htmlSource

Please take a look at the unicode portion. I've tried those two options...but doesn't work.

htmlSource =  htmlSource.encode('utf8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 370747: ordinal not in range(128)

and also this when I try the longer method of encoding...

_mysql_exceptions.Warning: Incorrect string value: '\xE7\xB9\x81\xE9\xAB\x94...' for column 'html' at row 1

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

贱贱哒 2024-08-19 06:03:10

您的 html 数据是来自互联网的字符串,已经使用某种编码进行了编码。在将其编码为 utf-8 之前,您必须先对其进行解码

Python隐式尝试对其进行解码(这就是为什么您会得到 UnicodeDecodeError 而不是 UnicodeEncodeError)。

您可以通过尝试将字节串重新编码为 utf-8 之前显式解码您的字节串(使用适当的编码)来解决该问题。

示例:

utf8encoded = htmlSource.decode('some_encoding').encode('utf-8')

首先使用页面编码的正确编码,而不是 'some_encoding'

在解码字符串之前,您必须知道字符串使用哪种编码。

Your html data is a string that comes from the internet already encoded with some encoding. Before encoding it to utf-8, you must decode it first.

Python is implicity trying to decode it (That's why you get a UnicodeDecodeError not UnicodeEncodeError).

You can solve the problem by explicity decoding your bytestring (using the appropriate encoding) before trying to reencode it to utf-8.

Example:

utf8encoded = htmlSource.decode('some_encoding').encode('utf-8')

Use the correct encoding the page was encoded in first place, instead of 'some_encoding'.

You have to know which encoding a string is using before you can decode it.

屋顶上的小猫咪 2024-08-19 06:03:10

不解码? htmlSource = htmlSource.decode('utf8')

解码的意思是“从 utf8 编码解码 htmlSource”

编码的意思是“将 htmlSource 编码为 utf8 编码”

因为您正在提取现有数据(从网站抓取),所以您需要对其进行解码,当您插入 mysql 时,您可能需要根据 mysql db/table/fields 排序规则编码为 utf8。

Not decode? htmlSource = htmlSource.decode('utf8')

decode mean "decode htmlSource from utf8 encoding"

encode mean "encode htmlSource to utf8 encoding"

since you are extracting the existing data (crawling from website), you need to decode it, and when you insert to mysql, you may need to encode as utf8 according to your mysql db/table/fields collations.

荒人说梦 2024-08-19 06:03:10

也许您想要解码 Utf8,而不是编码它:

htmlSource =  htmlSource.decode('utf8')

Probably you want to decode Utf8, not encode it:

htmlSource =  htmlSource.decode('utf8')
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文