反序列化来自 Google 的 json 对象时出现编码错误
作为练习,我构建了一个查询 Google Suggest JSON API 的小脚本。代码非常简单:
query = 'a'
url = "http://clients1.google.co.jp/complete/search?hl=ja&q=%s&json=t" %query
response = urllib.urlopen(url)
result = json.load(response)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x83 in position 0: invalid start byte
如果我尝试 read()
响应对象,这就是我得到的:
'["a",["amazon","ana","au","apple","adobe","alc","\x83A\x83}\x83]\x83\x93","\x83A\x83\x81\x83u\x83\x8d","\x83A\x83X\x83N\x83\x8b","\x83A\x83\x8b\x83N"],["","","","","","","","","",""]]'
所以它表明当 python 尝试解码字符串时会引发错误。这只发生在 google.co.jp 和日语中。我在不同的国家/语言中尝试了相同的代码,但我没有遇到同样的问题:当我尝试反序列化对象时,一切正常。
- 我检查了响应标头,它们总是指定 utf-8 作为响应编码。
- 我使用在线解析器(http://json.parser.online.fr/)检查了 JSON 字符串,并再次检查了所有接缝 OK
有解决此问题的想法吗?是什么导致 JSON load()
函数阻塞?
提前致谢。
As an exercise I built a little script that query Google Suggest JSON API. The code is quite simple:
query = 'a'
url = "http://clients1.google.co.jp/complete/search?hl=ja&q=%s&json=t" %query
response = urllib.urlopen(url)
result = json.load(response)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x83 in position 0: invalid start byte
If I try to read()
the response object, this is what I've got:
'["a",["amazon","ana","au","apple","adobe","alc","\x83A\x83}\x83]\x83\x93","\x83A\x83\x81\x83u\x83\x8d","\x83A\x83X\x83N\x83\x8b","\x83A\x83\x8b\x83N"],["","","","","","","","","",""]]'
So it seams that the error is raised when python try to decode the string. This only happens with google.co.jp and the Japanese language. I tried the same code with different contry/languages and I do not get the same issue: when I try to deserialize the object everything works OK.
- I checked the response headers for and they always specify utf-8 as the response encoding.
- I checked the JSON string with an online parser (http://json.parser.online.fr/) and again all seams OK
Any ideas to solve this problem? What make the JSON load()
function choke?
Thanks in advance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
响应标头 (
print response.header
) 包含以下信息:注意字符集。
如果您在 json.load 中指定此编码,它将起作用:
The response header (
print response.header
) contains the following information:Note the charset.
If you specify this encoding in
json.load
it will work:无论规范如何规定,字符串“\x83A\x83}\x83]\x83\x93”都不是 UTF-8。
据猜测,它是 [ "cp932", "shift_jis", "shift_jis_2004", "shift_jisx0213" ] 之一;尝试解码其中之一。
Regardless of what the spec says, the string "\x83A\x83}\x83]\x83\x93" is not UTF-8.
At a guess, it is one of [ "cp932", "shift_jis", "shift_jis_2004", "shift_jisx0213" ]; try decoding as one of these.