编码/解码在浏览器中有效，但在终端中无效

发布于 2024-11-18 07:40:30 字数 309 浏览 0 评论 0原文

这是我的代码：

import urllib

print urllib.urlopen('http://www.indianexpress.com/news/heart-of-the-deal/811626/').read().decode('iso-8859-1')

当我在 Firefox 中查看页面时，文本显示正确。但是，在终端上，我发现字符编码存在问题。

以下是一些格式错误的输出示例：

long-term  in
Indias
no-go areas

如何修复此问题？

原文

Here's my code:

import urllib

print urllib.urlopen('http://www.indianexpress.com/news/heart-of-the-deal/811626/').read().decode('iso-8859-1')

When I view the page in Firefox, the text is displayed correctly. However, on the terminal, I see issues with character encoding.

Here are some malformed output examples:

long-term  in
Indias
no-go areas

How can I fix this?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

独木成林 2024-11-25 07:40:30

试试这个（忽略未知字符）

import urllib
url = 'http://www.indianexpress.com/news/heart-of-the-deal/811626/'
print urllib.urlopen(url).read().decode('iso-8859-1').encode('ascii','ignore')

Try this (ignore unknown chars)

import urllib
url = 'http://www.indianexpress.com/news/heart-of-the-deal/811626/'
print urllib.urlopen(url).read().decode('iso-8859-1').encode('ascii','ignore')

回复收藏 0 原文

瑕疵 2024-11-25 07:40:30

您需要使用服务器发送的实际字符集，而不是始终假设它是 ISO 8859- 1.使用功能强大的 HTML 解析器（例如 Beautiful Soup）会有所帮助。

回复收藏 0 原文

意中人 2024-11-25 07:40:30

网页撒谎；它以 cp1252 又名 windows-1252 编码，而不是 ISO-8859-1。

>>> import urllib
>>> guff = urllib.urlopen('http://www.indianexpress.com/news/heart-of-the-deal/811626/').read()
>>> uguff = guff.decode('latin1')
>>> baddies = set(c for c in uguff if u'\x80' <= c < u'\xa0')
>>> baddies
set([u'\x93', u'\x92', u'\x94', u'\x97'])

The web-page lies; it is encoded in cp1252 aka windows-1252, NOT in ISO-8859-1.

>>> import urllib
>>> guff = urllib.urlopen('http://www.indianexpress.com/news/heart-of-the-deal/811626/').read()
>>> uguff = guff.decode('latin1')
>>> baddies = set(c for c in uguff if u'\x80' <= c < u'\xa0')
>>> baddies
set([u'\x93', u'\x92', u'\x94', u'\x97'])

回复收藏 0 原文

~没有更多了~