用 python 编写 UTF-8 友好的解析器

发布于 2024-11-18 20:37:29 字数 335 浏览 6 评论 0原文

我写了一个简单的文件解析器和编写器,但后来我看到一篇文章讨论 unicode 的重要性,然后我突然想到我假设输入文件是 ascii 编码的,但情况可能并非总是如此,尽管在我的情况下这种情况很少见。

在极少数情况下,我希望使用 UTF-8 编码的文件。

有没有一种方法可以通过简单地改变我的读写方式来处理 UTF-8 文件?我对字符串所做的就是存储它们,然后将它们写出来,所以我只需要确保我可以正确地读取它们、存储它们并写入它们。

此外,我是否必须分别处理 ascii 和 UTF-8 文件并为每个文件编写不同的函数?我还没有使用过除 ascii 文件以外的任何文件,只阅读了有关处理 unicode 的内容。

I wrote a simple file parser and writer, but then I came across an article talking about the importance of unicode and then it occurred to me that I'm assuming the input file is ascii encoded, which may not be the case all the time, though it would be rare in my situation.

In those rare cases, I would expect UTF-8 encoded files.

Is there a way to work with UTF-8 files by simply changing how I read and write? All I do with the strings is store them and then write them out, so I just need to make sure I can read them, store them, and write them properly.

Furthermore, would I have to treat ascii and UTF-8 files separately and write different functions for each? I have not worked with anything other than ascii files yet and only read about handling unicode.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

晚风撩人 2024-11-25 20:37:29

Python 本身支持 Unicode。如果直接从第一个文件读写第二个文件,则不会丢失任何数据,因为它会逐字复制字节。但是,如果您解码字符串然后重新编码,则需要确保使用正确的编码。

Python natively supports Unicode. If you directly read and write from the first file to the second, then no data is lost as it copies the bytes verbatim. However, if you decode the string and then re-encode it, you'll need to make sure you use the right encoding.

林空鹿饮溪 2024-11-25 20:37:29

如果您使用的是 Python 2,则只需将所有 str 对象更改为 unicode 对象即可。 Unicode 对象具有与字符串相同的方法,但以 unicode 格式而不是 ASCII 进行编码。请参阅http://docs.python.org/library/functions.html#unicode .

如果您使用的是 Python 3,则字符串默认采用 UTF-8 编码。

If you are using Python 2, you can simply change all your str objects to unicode objects. Unicode objects have all the same methods as strings but are encoded in a unicode format instead of ASCII. See http://docs.python.org/library/functions.html#unicode .

If you are using Python 3, strings are encoded in UTF-8 by default.

苏璃陌 2024-11-25 20:37:29

如果您使用的是 Python 2.6 或更高版本,则可以使用 io 库及其 io.open 方法打开所需的文件。它有一个 encoding 参数,在您的情况下应将其设置为 'utf-8' 。当您读取或写入返回的文件对象时,字符串会自动编码/解码。

不管怎样,你不需要对 ASCII 做一些特殊的事情,因为 UTF-8 是 ASCII 的超集。

If you are using Python 2.6 or later, you can use the io library and its io.open method to open the files you want. It has an encoding argument which should be set to 'utf-8' in your case. When you read or write the returned file objects, string are automatically en-/decoded.

Anyway, you don't need to do something special for ASCII, because UTF-8 is a superset of ASCII.

谁把谁当真 2024-11-25 20:37:29

只要您只是读取和写入文件并且不期望任何其他类型的编码输入,那么您不必做任何特殊的事情。

% cat /tmp/u
π is 3.14.

% file /tmp/u
/tmp/u: UTF-8 Unicode text

% cat f.py
f = open('/tmp/u', 'r')
d = f.read()
print d.split()
f.close()

% python f.py 
['\xcf\x80', 'is', '3.14.']

当您使用 UTF-8 声明或接受标准输入时,这种情况会发生变化。

% cat g.py
s = 'π is 3.14.'
print s.split()

% python g.py
  File "g.py", line 1
SyntaxError: Non-ASCII character '\xcf' in file g.py on line 1, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

要正确处理此问题,请按照 PEP 263 (由上面的 SyntaxError 异常引用)。

% cat h.py
# -*- coding: utf-8 -*-
s = 'π is 3.14.'
print s.split()

% python h.py
['\xcf\x80', 'is', '3.14.']

So long as you are only reading and writing to files and not expecting any other type of encoded input, then you should not have to do anything special.

% cat /tmp/u
π is 3.14.

% file /tmp/u
/tmp/u: UTF-8 Unicode text

% cat f.py
f = open('/tmp/u', 'r')
d = f.read()
print d.split()
f.close()

% python f.py 
['\xcf\x80', 'is', '3.14.']

This changes when you declare or accept standard input using UTF-8.

% cat g.py
s = 'π is 3.14.'
print s.split()

% python g.py
  File "g.py", line 1
SyntaxError: Non-ASCII character '\xcf' in file g.py on line 1, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

To handle this properly, declare the encoding for the Python program at the beginning per PEP 263 (referenced by the SyntaxError exception above).

% cat h.py
# -*- coding: utf-8 -*-
s = 'π is 3.14.'
print s.split()

% python h.py
['\xcf\x80', 'is', '3.14.']
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文