Python urllib.request 和 utf8 解码问题
我正在编写一个简单的 Python CGI 脚本,用于抓取网页并在 Web 浏览器中显示 HTML 文件(充当代理)。这是脚本:
#!/usr/bin/env python3.0
import urllib.request
site = "http://reddit.com/"
site = urllib.request.urlopen(site)
site = site.read()
site = site.decode('utf8')
print("Content-type: text/html\n\n")
print(site)
从命令行运行时,该脚本工作正常,但当使用 Web 浏览器查看它时,它会显示一个空白页面。这是我在 Apache 的 error_log 中收到的错误:
Traceback (most recent call last):
File "/home/public/projects/proxy/script.cgi", line 11, in <module>
print(site)
File "/usr/local/lib/python3.0/io.py", line 1491, in write
b = encoder.encode(s)
File "/usr/local/lib/python3.0/encodings/ascii.py", line 22, in encode
return codecs.ascii_encode(input, self.errors)[0]
UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in position 33777: ordinal not in range(128)
I'm writing a simple Python CGI script that grabs a webpage and displays the HTML file in the web browser (acting like a proxy). Here is the script:
#!/usr/bin/env python3.0
import urllib.request
site = "http://reddit.com/"
site = urllib.request.urlopen(site)
site = site.read()
site = site.decode('utf8')
print("Content-type: text/html\n\n")
print(site)
This script works fine when run from the command line, but when it gets to viewing it with a web browser, it shows a blank page. Here is the error I get in Apache's error_log:
Traceback (most recent call last):
File "/home/public/projects/proxy/script.cgi", line 11, in <module>
print(site)
File "/usr/local/lib/python3.0/io.py", line 1491, in write
b = encoder.encode(s)
File "/usr/local/lib/python3.0/encodings/ascii.py", line 22, in encode
return codecs.ascii_encode(input, self.errors)[0]
UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in position 33777: ordinal not in range(128)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
当您在命令行中打印它时,您会将 Unicode 字符串打印到终端。终端有一个编码,因此 Python 会将您的 Unicode 字符串编码为该编码。这会工作得很好。
当您在 CGI 中使用它时,您最终会打印到没有编码的标准输出。因此,Python 尝试使用 ASCII 对字符串进行编码。这会失败,因为 ASCII 不包含您尝试打印的所有字符,因此您会收到上述错误。
解决这个问题的方法是将字符串编码为某种编码(为什么不是 UTF8?),并在标头中如此说明。
所以像这样:
在Python 2下,这也可以工作:
但在Python 3下,编码数据以字节为单位,所以它不能很好地打印。
当然,您会注意到现在首先从 UTF8 进行解码,然后重新编码。严格来说,你不需要这样做。但是,如果您想在其间修改 HTML,那么这样做实际上可能是一个好主意,并将所有修改保留为 Unicode。
When you print it at the command line, you print a Unicode string to the terminal. The terminal has an encoding, so Python will encode your Unicode string to that encoding. This will work fine.
When you use it in CGI, you end up printing to stdout, which does not have an encoding. Python therefore tries to encode the string with ASCII. This fails, as ASCII doesn't contain all the characters you try to print, so you get the above error.
The fix for this is to encode your string into some sort of encoding (why not UTF8?) and also say so in the header.
So something like this:
Under Python 2, this would work as well:
But under Python 3 the encoded data in bytes, so it won't print well.
Of course you'll notice that you now first decode from UTF8 and then re-encode it. You don't need to do that, strictly speaking. But if you want to modify the HTML in between, it may actually be a good idea to do so, and keep all modifications in Unicode.
您尝试打开的网站可能不是 UTF-8 编码的。尝试将
"iso-8859-1"
传递给解码方法。It could be that the site you are trying to open is not UTF-8 encoded. Try passing
"iso-8859-1"
to the decode method.与其费力地处理
sys.stdout
内部结构,更直接的做法是让 Web 服务器 (1) 将 CGI 环境变量PYTHONIOENCODING
(2) 设置为UTF8
。对于 Apache2,您必须启用
mod_env.so
的加载。在 Debian 安装中,这相当于在/etc/apache2/mods-enabled
中创建一个到/etc/apache2/mods-available/env.load
的符号链接,并创建一个配置/etc/apache2/conf-available/env.conf
,以及一个/etc/apache2/conf-enabled
中的符号链接,如果你想保留结构与所有其他模块加载器和配置相同。我创建的 env_mod.conf 文件的内容是:
在执行此操作之前,我的脚本报告 sys.stdout.encoding 为 ANSI ... " 并在尝试打印包含 Unicode 字符的字符串时出错,后来,它是
"UTF8"
,并正确地将所需的 UTF-8 发送到浏览器。(1) http://httpd.apache.org/docs/2.2/ howto/cgi.html#env
(2) http:// /docs.python.org/3.3/library/sys.html#sys.stdin
Rather than wrestling with the
sys.stdout
internals, much more straight-forward is to have the web server (1) set the CGI environment variablePYTHONIOENCODING
(2) toUTF8
.For Apache2, you'll have to enable the loading of
mod_env.so
. In a Debian installation, that equates to creating a symlink in/etc/apache2/mods-enabled
to/etc/apache2/mods-available/env.load
, and creating a configuration/etc/apache2/conf-available/env.conf
, and a symlink in/etc/apache2/conf-enabled
to that, if you wish to keep the structure the same as with all the other module loader and configs.The contents of the
env_mod.conf
file I created is:Before I did this, my script was reporting that
sys.stdout.encoding
was"ANSI ..."
and erroring out when trying to print a string containing Unicode characters, afterwards, it was"UTF8"
, and correctly sending the desired UTF-8 to the browser.(1) http://httpd.apache.org/docs/2.2/howto/cgi.html#env
(2) http://docs.python.org/3.3/library/sys.html#sys.stdin