用 python 编写 UTF-8 友好的解析器
我写了一个简单的文件解析器和编写器,但后来我看到一篇文章讨论 unicode 的重要性,然后我突然想到我假设输入文件是 ascii 编码的,但情况可能并非总是如此,尽管在我的情况下这种情况很少见。
在极少数情况下,我希望使用 UTF-8 编码的文件。
有没有一种方法可以通过简单地改变我的读写方式来处理 UTF-8 文件?我对字符串所做的就是存储它们,然后将它们写出来,所以我只需要确保我可以正确地读取它们、存储它们并写入它们。
此外,我是否必须分别处理 ascii 和 UTF-8 文件并为每个文件编写不同的函数?我还没有使用过除 ascii 文件以外的任何文件,只阅读了有关处理 unicode 的内容。
I wrote a simple file parser and writer, but then I came across an article talking about the importance of unicode and then it occurred to me that I'm assuming the input file is ascii encoded, which may not be the case all the time, though it would be rare in my situation.
In those rare cases, I would expect UTF-8 encoded files.
Is there a way to work with UTF-8 files by simply changing how I read and write? All I do with the strings is store them and then write them out, so I just need to make sure I can read them, store them, and write them properly.
Furthermore, would I have to treat ascii and UTF-8 files separately and write different functions for each? I have not worked with anything other than ascii files yet and only read about handling unicode.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
Python 本身支持 Unicode。如果直接从第一个文件读写第二个文件,则不会丢失任何数据,因为它会逐字复制字节。但是,如果您解码字符串然后重新编码,则需要确保使用正确的编码。
Python natively supports Unicode. If you directly read and write from the first file to the second, then no data is lost as it copies the bytes verbatim. However, if you decode the string and then re-encode it, you'll need to make sure you use the right encoding.
如果您使用的是 Python 2,则只需将所有
str
对象更改为unicode
对象即可。 Unicode 对象具有与字符串相同的方法,但以 unicode 格式而不是 ASCII 进行编码。请参阅http://docs.python.org/library/functions.html#unicode .如果您使用的是 Python 3,则字符串默认采用 UTF-8 编码。
If you are using Python 2, you can simply change all your
str
objects tounicode
objects. Unicode objects have all the same methods as strings but are encoded in a unicode format instead of ASCII. See http://docs.python.org/library/functions.html#unicode .If you are using Python 3, strings are encoded in UTF-8 by default.
如果您使用的是 Python 2.6 或更高版本,则可以使用 io 库及其 io.open 方法打开所需的文件。它有一个
encoding
参数,在您的情况下应将其设置为'utf-8'
。当您读取或写入返回的文件对象时,字符串会自动编码/解码。不管怎样,你不需要对 ASCII 做一些特殊的事情,因为 UTF-8 是 ASCII 的超集。
If you are using Python 2.6 or later, you can use the
io
library and itsio.open
method to open the files you want. It has anencoding
argument which should be set to'utf-8'
in your case. When you read or write the returned file objects, string are automatically en-/decoded.Anyway, you don't need to do something special for ASCII, because UTF-8 is a superset of ASCII.
只要您只是读取和写入文件并且不期望任何其他类型的编码输入,那么您不必做任何特殊的事情。
当您使用 UTF-8 声明或接受标准输入时,这种情况会发生变化。
要正确处理此问题,请按照 PEP 263 (由上面的
SyntaxError
异常引用)。So long as you are only reading and writing to files and not expecting any other type of encoded input, then you should not have to do anything special.
This changes when you declare or accept standard input using UTF-8.
To handle this properly, declare the encoding for the Python program at the beginning per PEP 263 (referenced by the
SyntaxError
exception above).