解析从 URL 读取的 JSON 时出现问题

发布于 2025-01-08 22:03:24 字数 1107 浏览 5 评论 0原文

我遇到了一个问题，我相信有一个简单的解决方案。

我正在编写一个 Python 脚本，它从 URL 读取 JSON 字符串并解析它。为此，我使用 urllib2 和 simplejson。

我遇到的问题与编码有关。我正在读取的 URL 没有明确说明它的编码方式（据我所知），并且它返回一些冰岛字符。我无法给出我从这里读取的 URL，但我已经在自己的服务器上设置了一个示例 JSON 数据文件，并且在读取该文件时也遇到了问题。这是文件： http://haukurhaf.net/json.txt

这是我的代码：

# coding: utf-8
#!/usr/bin/env python
import urllib2, re, os
from BeautifulSoup import BeautifulSoup
import simplejson as json

user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3) Gecko/2008092417 Firefox/3.0.3'

def fetchPage(url):
    req = urllib2.Request(url)
    req.add_header('User-Agent', user_agent)
    response = urllib2.urlopen(req)
    html = response.read()
    response.close()
    return html

html = fetchPage("http://haukurhaf.net/json.txt")
jsonData = json.JSONDecoder().decode(html)

JSON解析器崩溃并显示以下错误消息： UnicodeDecodeError: 'utf8' codec can't Decode byte 0xe1 inposition 35: invalid continuation byte

由于我对保存 JSON 数据的服务器没有任何控制，因此我无法控制它发送出哪些编码标头。我希望我能以某种方式解决这个问题。

有什么想法吗？

原文

I'm having a problem that I believe has a simple solution.

I'm writing a Python script which reads a JSON string from a URL and parses it. To do this I'm using urllib2 and simplejson.

The problem I'm having has got to do with encoding. The URL I'm reading from does not explicitly state in which encoding it is (as far as I can tell) and it returns some Icelandic characters. I cannot give out the URL I'm reading from here, but I've set up a sample JSON data file on my own server and I'm also having problems reading that.
Here is the file: http://haukurhaf.net/json.txt

This is my code:

# coding: utf-8
#!/usr/bin/env python
import urllib2, re, os
from BeautifulSoup import BeautifulSoup
import simplejson as json

user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3) Gecko/2008092417 Firefox/3.0.3'

def fetchPage(url):
    req = urllib2.Request(url)
    req.add_header('User-Agent', user_agent)
    response = urllib2.urlopen(req)
    html = response.read()
    response.close()
    return html

html = fetchPage("http://haukurhaf.net/json.txt")
jsonData = json.JSONDecoder().decode(html)

The JSON parser crashes with this error message: UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position 35: invalid continuation byte

Since I do not have any control over the server which holds the JSON data, I cannot control which encoding headers it sends out. I'm hoping I can solve this on my end somehow.

Any ideas?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

压抑⊿情绪 2025-01-15 22:03:24

该文件使用 Latin-1 而不是 UTF-8 进行编码，因此您必须指定编码：

jsonData = json.JSONDecoder('latin1').decode(html)

顺便说一句： html 对于 JSON 文档来说是一个坏名字...

The file is encoded using Latin-1, not UTF-8, so you have to specify the encoding:

jsonData = json.JSONDecoder('latin1').decode(html)

BTW: html is a bad name for a JSON document...

回复收藏 0 原文

七婞 2025-01-15 22:03:24

http://haukurhaf.net/json.txt

此资源编码为 ISO-8859-1，或，更有可能的是 Windows 变体代码页 1252。它不是 UTF-8。

您可以使用 response.read().decode('cp1252') 读取它，以获取 [simple]json 也应该能够解析的 Unicode 字符串。

然而，在字节形式下，JSON 必须以 UTF 编码。因此，这不是有效的 JSON，如果您尝试从浏览器加载它也会失败。

回复收藏 0 原文

甜心小果奶 2025-01-15 22:03:24

您需要首先将字符串设置为 unicode（现在是 latin-1）：

uhtml = html.decode("latin-1")
jdata = json.loads(uhtml)

或者，如果 simplejson 没有 loads：

json.JSONDecoder().decode(uhtml)

You need to make the string unicode first (it's latin-1 right now):

uhtml = html.decode("latin-1")
jdata = json.loads(uhtml)

Or, if simplejson doesn't have loads:

json.JSONDecoder().decode(uhtml)

回复收藏 0 原文

~没有更多了~

关于作者

遗失的美好

暂无简介

文章

27 人气

关注发私信

友情链接

文江博客

解析从 URL 读取的 JSON 时出现问题

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

夢野间

百度③文鱼

小草泠泠

zhuwenyan

weirdo

坚持沉默

友情链接

解析从 URL 读取的 JSON 时出现问题

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

夢野间

百度③文鱼

小草泠泠

zhuwenyan

weirdo

坚持沉默

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。