我只想下载这个网址...但它给了我一个错误! ...unicode..(Python)
theurl = 'http://bit.ly/6IcCtf/'
urlReq = urllib2.Request(theurl)
urlReq.add_header('User-Agent',random.choice(agents))
urlResponse = urllib2.urlopen(urlReq)
htmlSource = urlResponse.read()
if unicode == 1:
#print urlResponse.headers['content-type']
#encoding=urlResponse.headers['content-type'].split('charset=')[-1]
#htmlSource = unicode(htmlSource, encoding)
htmlSource = htmlSource.encode('utf8')
return htmlSource
请查看 unicode 部分。我已经尝试过这两个选项...但不起作用。
htmlSource = htmlSource.encode('utf8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 370747: ordinal not in range(128)
当我尝试更长的编码方法时也是如此...
_mysql_exceptions.Warning: Incorrect string value: '\xE7\xB9\x81\xE9\xAB\x94...' for column 'html' at row 1
theurl = 'http://bit.ly/6IcCtf/'
urlReq = urllib2.Request(theurl)
urlReq.add_header('User-Agent',random.choice(agents))
urlResponse = urllib2.urlopen(urlReq)
htmlSource = urlResponse.read()
if unicode == 1:
#print urlResponse.headers['content-type']
#encoding=urlResponse.headers['content-type'].split('charset=')[-1]
#htmlSource = unicode(htmlSource, encoding)
htmlSource = htmlSource.encode('utf8')
return htmlSource
Please take a look at the unicode portion. I've tried those two options...but doesn't work.
htmlSource = htmlSource.encode('utf8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 370747: ordinal not in range(128)
and also this when I try the longer method of encoding...
_mysql_exceptions.Warning: Incorrect string value: '\xE7\xB9\x81\xE9\xAB\x94...' for column 'html' at row 1
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您的 html 数据是来自互联网的字符串,已经使用某种编码进行了编码。在将其编码为
utf-8
之前,您必须先对其进行解码。Python隐式尝试对其进行解码(这就是为什么您会得到
UnicodeDecodeError
而不是UnicodeEncodeError
)。您可以通过在尝试将字节串重新编码为
utf-8
之前显式解码您的字节串(使用适当的编码)来解决该问题。示例:
首先使用页面编码的正确编码,而不是
'some_encoding'
。在解码字符串之前,您必须知道字符串使用哪种编码。
Your html data is a string that comes from the internet already encoded with some encoding. Before encoding it to
utf-8
, you must decode it first.Python is implicity trying to decode it (That's why you get a
UnicodeDecodeError
notUnicodeEncodeError
).You can solve the problem by explicity decoding your bytestring (using the appropriate encoding) before trying to reencode it to
utf-8
.Example:
Use the correct encoding the page was encoded in first place, instead of
'some_encoding'
.You have to know which encoding a string is using before you can decode it.
不解码?
htmlSource = htmlSource.decode('utf8')
解码的意思是“从 utf8 编码解码 htmlSource”
编码的意思是“将 htmlSource 编码为 utf8 编码”
因为您正在提取现有数据(从网站抓取),所以您需要对其进行解码,当您插入 mysql 时,您可能需要根据 mysql db/table/fields 排序规则编码为 utf8。
Not decode?
htmlSource = htmlSource.decode('utf8')
decode mean "decode htmlSource from utf8 encoding"
encode mean "encode htmlSource to utf8 encoding"
since you are extracting the existing data (crawling from website), you need to decode it, and when you insert to mysql, you may need to encode as utf8 according to your mysql db/table/fields collations.
也许您想要解码 Utf8,而不是编码它:
Probably you want to decode Utf8, not encode it: