反序列化来自 Google 的 json 对象时出现编码错误

发布于 2024-10-06 07:19:43 字数 914 浏览 0 评论 0原文

作为练习,我构建了一个查询 Google Suggest JSON API 的小脚本。代码非常简单:

query = 'a'
url = "http://clients1.google.co.jp/complete/search?hl=ja&q=%s&json=t" %query
response = urllib.urlopen(url)
result = json.load(response)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x83 in position 0: invalid start byte

如果我尝试 read() 响应对象,这就是我得到的:

'["a",["amazon","ana","au","apple","adobe","alc","\x83A\x83}\x83]\x83\x93","\x83A\x83\x81\x83u\x83\x8d","\x83A\x83X\x83N\x83\x8b","\x83A\x83\x8b\x83N"],["","","","","","","","","",""]]'

所以它表明当 python 尝试解码字符串时会引发错误。这只发生在 google.co.jp 和日语中。我在不同的国家/语言中尝试了相同的代码,但我没有遇到同样的问题:当我尝试反序列化对象时,一切正常。

  • 我检查了响应标头,它们总是指定 utf-8 作为响应编码。
  • 我使用在线解析器(http://json.parser.online.fr/)检查了 JSON 字符串,并再次检查了所有接缝 OK

有解决此问题的想法吗?是什么导致 JSON load() 函数阻塞?

提前致谢。

As an exercise I built a little script that query Google Suggest JSON API. The code is quite simple:

query = 'a'
url = "http://clients1.google.co.jp/complete/search?hl=ja&q=%s&json=t" %query
response = urllib.urlopen(url)
result = json.load(response)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x83 in position 0: invalid start byte

If I try to read() the response object, this is what I've got:

'["a",["amazon","ana","au","apple","adobe","alc","\x83A\x83}\x83]\x83\x93","\x83A\x83\x81\x83u\x83\x8d","\x83A\x83X\x83N\x83\x8b","\x83A\x83\x8b\x83N"],["","","","","","","","","",""]]'

So it seams that the error is raised when python try to decode the string. This only happens with google.co.jp and the Japanese language. I tried the same code with different contry/languages and I do not get the same issue: when I try to deserialize the object everything works OK.

  • I checked the response headers for and they always specify utf-8 as the response encoding.
  • I checked the JSON string with an online parser (http://json.parser.online.fr/) and again all seams OK

Any ideas to solve this problem? What make the JSON load() function choke?

Thanks in advance.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

硪扪都還晓 2024-10-13 07:19:43

响应标头 (print response.header) 包含以下信息:

Content-Type: text/javascript; charset=Shift_JIS

注意字符集。

如果您在 json.load 中指定此编码,它将起作用:

result = json.load(response, encoding='shift_jis')

The response header (print response.header) contains the following information:

Content-Type: text/javascript; charset=Shift_JIS

Note the charset.

If you specify this encoding in json.load it will work:

result = json.load(response, encoding='shift_jis')
且行且努力 2024-10-13 07:19:43

无论规范如何规定,字符串“\x83A\x83}\x83]\x83\x93”都不是 UTF-8。

据猜测,它是 [ "cp932", "shift_jis", "shift_jis_2004", "shift_jisx0213" ] 之一;尝试解码其中之一。

Regardless of what the spec says, the string "\x83A\x83}\x83]\x83\x93" is not UTF-8.

At a guess, it is one of [ "cp932", "shift_jis", "shift_jis_2004", "shift_jisx0213" ]; try decoding as one of these.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文