解析从 URL 读取的 JSON 时出现问题
我遇到了一个问题,我相信有一个简单的解决方案。
我正在编写一个 Python 脚本,它从 URL 读取 JSON 字符串并解析它。为此,我使用 urllib2 和 simplejson。
我遇到的问题与编码有关。我正在读取的 URL 没有明确说明它的编码方式(据我所知),并且它返回一些冰岛字符。我无法给出我从这里读取的 URL,但我已经在自己的服务器上设置了一个示例 JSON 数据文件,并且在读取该文件时也遇到了问题。 这是文件: http://haukurhaf.net/json.txt
这是我的代码:
# coding: utf-8
#!/usr/bin/env python
import urllib2, re, os
from BeautifulSoup import BeautifulSoup
import simplejson as json
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3) Gecko/2008092417 Firefox/3.0.3'
def fetchPage(url):
req = urllib2.Request(url)
req.add_header('User-Agent', user_agent)
response = urllib2.urlopen(req)
html = response.read()
response.close()
return html
html = fetchPage("http://haukurhaf.net/json.txt")
jsonData = json.JSONDecoder().decode(html)
JSON解析器崩溃并显示以下错误消息: UnicodeDecodeError: 'utf8' codec can't Decode byte 0xe1 inposition 35: invalid continuation byte
由于我对保存 JSON 数据的服务器没有任何控制,因此我无法控制它发送出哪些编码标头。我希望我能以某种方式解决这个问题。
有什么想法吗?
I'm having a problem that I believe has a simple solution.
I'm writing a Python script which reads a JSON string from a URL and parses it. To do this I'm using urllib2 and simplejson.
The problem I'm having has got to do with encoding. The URL I'm reading from does not explicitly state in which encoding it is (as far as I can tell) and it returns some Icelandic characters. I cannot give out the URL I'm reading from here, but I've set up a sample JSON data file on my own server and I'm also having problems reading that.
Here is the file: http://haukurhaf.net/json.txt
This is my code:
# coding: utf-8
#!/usr/bin/env python
import urllib2, re, os
from BeautifulSoup import BeautifulSoup
import simplejson as json
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3) Gecko/2008092417 Firefox/3.0.3'
def fetchPage(url):
req = urllib2.Request(url)
req.add_header('User-Agent', user_agent)
response = urllib2.urlopen(req)
html = response.read()
response.close()
return html
html = fetchPage("http://haukurhaf.net/json.txt")
jsonData = json.JSONDecoder().decode(html)
The JSON parser crashes with this error message: UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position 35: invalid continuation byte
Since I do not have any control over the server which holds the JSON data, I cannot control which encoding headers it sends out. I'm hoping I can solve this on my end somehow.
Any ideas?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
该文件使用 Latin-1 而不是 UTF-8 进行编码,因此您必须指定编码:
顺便说一句:
html
对于 JSON 文档来说是一个坏名字...The file is encoded using Latin-1, not UTF-8, so you have to specify the encoding:
BTW:
html
is a bad name for a JSON document...此资源编码为 ISO-8859-1,或,更有可能的是 Windows 变体代码页 1252。它不是 UTF-8。
您可以使用
response.read().decode('cp1252')
读取它,以获取[simple]json
也应该能够解析的 Unicode 字符串。然而,在字节形式下,JSON 必须以 UTF 编码。因此,这不是有效的 JSON,如果您尝试从浏览器加载它也会失败。
This resource is encoded as ISO-8859-1, or, more likely, the Windows variant code page 1252. It is not UTF-8.
You can read it with
response.read().decode('cp1252')
to get a Unicode string which[simple]json
should also be able to parse.However, in byte form, JSON must be encoded in a UTF. Therefore this is not valid JSON, and it will fail if you attempt to load it from a browser too.
您需要首先将字符串设置为 unicode(现在是 latin-1):
或者,如果
simplejson
没有loads
:You need to make the string unicode first (it's latin-1 right now):
Or, if
simplejson
doesn't haveloads
: