Python urllib.request 和 utf8 解码问题

发布于 2024-10-10 06:34:27 字数 868 浏览 10 评论 0原文

我正在编写一个简单的 Python CGI 脚本,用于抓取网页并在 Web 浏览器中显示 HTML 文件(充当代理)。这是脚本:

#!/usr/bin/env python3.0

import urllib.request

site = "http://reddit.com/"
site = urllib.request.urlopen(site)
site = site.read()
site = site.decode('utf8')

print("Content-type: text/html\n\n")
print(site)

从命令行运行时,该脚本工作正常,但当使用 Web 浏览器查看它时,它会显示一个空白页面。这是我在 Apache 的 error_log 中收到的错误:

Traceback (most recent call last):
  File "/home/public/projects/proxy/script.cgi", line 11, in <module>
    print(site)
  File "/usr/local/lib/python3.0/io.py", line 1491, in write
    b = encoder.encode(s)
  File "/usr/local/lib/python3.0/encodings/ascii.py", line 22, in encode
    return codecs.ascii_encode(input, self.errors)[0]
UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in position 33777: ordinal not in range(128)

I'm writing a simple Python CGI script that grabs a webpage and displays the HTML file in the web browser (acting like a proxy). Here is the script:

#!/usr/bin/env python3.0

import urllib.request

site = "http://reddit.com/"
site = urllib.request.urlopen(site)
site = site.read()
site = site.decode('utf8')

print("Content-type: text/html\n\n")
print(site)

This script works fine when run from the command line, but when it gets to viewing it with a web browser, it shows a blank page. Here is the error I get in Apache's error_log:

Traceback (most recent call last):
  File "/home/public/projects/proxy/script.cgi", line 11, in <module>
    print(site)
  File "/usr/local/lib/python3.0/io.py", line 1491, in write
    b = encoder.encode(s)
  File "/usr/local/lib/python3.0/encodings/ascii.py", line 22, in encode
    return codecs.ascii_encode(input, self.errors)[0]
UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in position 33777: ordinal not in range(128)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

违心° 2024-10-17 06:34:27

当您在命令行中打印它时,您会将 Unicode 字符串打印到终端。终端有一个编码,因此 Python 会将您的 Unicode 字符串编码为该编码。这会工作得很好。

当您在 CGI 中使用它时,您最终会打印到没有编码的标准输出。因此,Python 尝试使用 ASCII 对字符串进行编码。这会失败,因为 ASCII 不包含您尝试打印的所有字符,因此您会收到上述错误。

解决这个问题的方法是将字符串编码为某种编码(为什么不是 UTF8?),并在标头中如此说明。

所以像这样:

sys.stdout.buffer.write(b"Content-type: text/html;encoding=UTF-8\n\n") # Not 100% sure about the spelling.
sys.stdout.buffer.write(site.encode('UTF8'))

在Python 2下,这也可以工作:

print("Content-type: text/html;encoding=UTF-8\n\n") # Not 100% sure about the spelling.
print(site.encode('UTF8'))

但在Python 3下,编码数据以字节为单位,所以它不能很好地打印。

当然,您会注意到现在首先从 UTF8 进行解码,然后重新编码。严格来说,你不需要这样做。但是,如果您想在其间修改 HTML,那么这样做实际上可能是一个好主意,并将所有修改保留为 Unicode。

When you print it at the command line, you print a Unicode string to the terminal. The terminal has an encoding, so Python will encode your Unicode string to that encoding. This will work fine.

When you use it in CGI, you end up printing to stdout, which does not have an encoding. Python therefore tries to encode the string with ASCII. This fails, as ASCII doesn't contain all the characters you try to print, so you get the above error.

The fix for this is to encode your string into some sort of encoding (why not UTF8?) and also say so in the header.

So something like this:

sys.stdout.buffer.write(b"Content-type: text/html;encoding=UTF-8\n\n") # Not 100% sure about the spelling.
sys.stdout.buffer.write(site.encode('UTF8'))

Under Python 2, this would work as well:

print("Content-type: text/html;encoding=UTF-8\n\n") # Not 100% sure about the spelling.
print(site.encode('UTF8'))

But under Python 3 the encoded data in bytes, so it won't print well.

Of course you'll notice that you now first decode from UTF8 and then re-encode it. You don't need to do that, strictly speaking. But if you want to modify the HTML in between, it may actually be a good idea to do so, and keep all modifications in Unicode.

殊姿 2024-10-17 06:34:27

您尝试打开的网站可能不是 UTF-8 编码的。尝试将 "iso-8859-1" 传递给解码方法。

It could be that the site you are trying to open is not UTF-8 encoded. Try passing "iso-8859-1" to the decode method.

扭转时空 2024-10-17 06:34:27

与其费力地处理 sys.stdout 内部结构,更直接的做法是让 Web 服务器 (1) 将 CGI 环境变量 PYTHONIOENCODING (2) 设置为 UTF8

对于 Apache2,您必须启用 mod_env.so 的加载。在 Debian 安装中,这相当于在 /etc/apache2/mods-enabled 中创建一个到 /etc/apache2/mods-available/env.load 的符号链接,并创建一个配置/etc/apache2/conf-available/env.conf,以及一个/etc/apache2/conf-enabled中的符号链接,如果你想保留结构与所有其他模块加载器和配置相同。

我创建的 env_mod.conf 文件的内容是:

<IfModule mod_env.c>
  SetEnv PYTHONIOENCODING UTF8
</IfModule>

在执行此操作之前,我的脚本报告 sys.stdout.encoding 为 ANSI ... " 并在尝试打印包含 Unicode 字符的字符串时出错,后来,它是 "UTF8",并正确地将所需的 UTF-8 发送到浏览器。

(1) http://httpd.apache.org/docs/2.2/ howto/cgi.html#env

(2) http:// /docs.python.org/3.3/library/sys.html#sys.stdin

Rather than wrestling with the sys.stdout internals, much more straight-forward is to have the web server (1) set the CGI environment variable PYTHONIOENCODING (2) to UTF8.

For Apache2, you'll have to enable the loading of mod_env.so. In a Debian installation, that equates to creating a symlink in /etc/apache2/mods-enabled to /etc/apache2/mods-available/env.load, and creating a configuration /etc/apache2/conf-available/env.conf, and a symlink in /etc/apache2/conf-enabled to that, if you wish to keep the structure the same as with all the other module loader and configs.

The contents of the env_mod.conf file I created is:

<IfModule mod_env.c>
  SetEnv PYTHONIOENCODING UTF8
</IfModule>

Before I did this, my script was reporting that sys.stdout.encoding was "ANSI ..." and erroring out when trying to print a string containing Unicode characters, afterwards, it was "UTF8", and correctly sending the desired UTF-8 to the browser.

(1) http://httpd.apache.org/docs/2.2/howto/cgi.html#env

(2) http://docs.python.org/3.3/library/sys.html#sys.stdin

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文