当前位置：文江博客话题详情

Python UTF-8 IO Unicode

Unicode（UTF-8）读取和写入Python中的文件

发布于 2025-01-26 03:24:29 字数 1127 浏览 3 评论 0 原文

我在理解阅读和编写文件的文本时有一些大脑失败（Python 2.4）。

# The string, which has an a-acute in it.
ss = u'Capit\xe1n'
ss8 = ss.encode('utf8')
repr(ss), repr(ss8)

（“ u'capit \ xe1n'”，“'capit \ xc3 \ xa1n'”）

print ss, ss8
print >> open('f1','w'), ss8

>>> file('f1').read()
'Capit\xc3\xa1n\n'

所以我在我喜欢的编辑器中输入 capit \ xc3 \ xa1n 在文件f2中。

然后：

>>> open('f1').read()
'Capit\xc3\xa1n\n'
>>> open('f2').read()
'Capit\\xc3\\xa1n\n'
>>> open('f1').read().decode('utf8')
u'Capit\xe1n\n'
>>> open('f2').read().decode('utf8')
u'Capit\\xc3\\xa1n\n'

我在这里不了解什么？显然，我缺少一些重要的魔术（或良好的感觉）。文本文件中有什么类型以获得正确的转换？

我真正未能在这里掌握的是UTF-8表示的重点，如果您实际上无法让Python识别出来，那么它来自外部。也许我应该只是json丢弃字符串，然后使用它，因为那具有可靠的表示！更重要的是，从文件进入时，Python会识别和解码的Unicode对象的ASCII表示？如果是这样，我该如何明白？

>>> print simplejson.dumps(ss)
'"Capit\u00e1n"'
>>> print >> file('f3','w'), simplejson.dumps(ss)
>>> simplejson.load(open('f3'))
u'Capit\xe1n'

原文

I'm having some brain failure in understanding reading and writing text to a file (Python 2.4).

# The string, which has an a-acute in it.
ss = u'Capit\xe1n'
ss8 = ss.encode('utf8')
repr(ss), repr(ss8)

("u'Capit\xe1n'", "'Capit\xc3\xa1n'")

print ss, ss8
print >> open('f1','w'), ss8

>>> file('f1').read()
'Capit\xc3\xa1n\n'

So I type in Capit\xc3\xa1n into my favorite editor, in file f2.

Then:

>>> open('f1').read()
'Capit\xc3\xa1n\n'
>>> open('f2').read()
'Capit\\xc3\\xa1n\n'
>>> open('f1').read().decode('utf8')
u'Capit\xe1n\n'
>>> open('f2').read().decode('utf8')
u'Capit\\xc3\\xa1n\n'

What am I not understanding here? Clearly there is some vital bit of magic (or good sense) that I'm missing. What does one type into text files to get proper conversions?

What I'm truly failing to grok here, is what the point of the UTF-8 representation is, if you can't actually get Python to recognize it, when it comes from outside. Maybe I should just JSON dump the string, and use that instead, since that has an asciiable representation! More to the point, is there an ASCII representation of this Unicode object that Python will recognize and decode, when coming in from a file? If so, how do I get it?

>>> print simplejson.dumps(ss)
'"Capit\u00e1n"'
>>> print >> file('f3','w'), simplejson.dumps(ss)
>>> simplejson.load(open('f3'))
u'Capit\xe1n'

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

为人所爱 2025-02-02 03:24:29

与其混乱 .encode 和 .decode ，在打开文件时指定编码。 io iodule Python 2.6，提供 io.open 函数，允许指定文件的编码。

假设该文件是在UTF-8中编码的，我们可以使用：

>>> import io
>>> f = io.open("test", mode="r", encoding="utf-8")

然后 f.Read 返回一个解码的Unicode对象：

>>> f.read()
u'Capit\xe1l\n\n'

在3.x中， io.open 函数是一个内置打开函数的别名，该功能支持编码参数（它不在2.x中）。

我们还可以使用 open codecs 标准库模块：

>>> import codecs
>>> f = codecs.open("test", "r", "utf-8")
>>> f.read()
u'Capit\xe1l\n\n'

请注意，但是，此在混合 read（）和 readline（） 。

Rather than mess with .encode and .decode, specify the encoding when opening the file. The io module, added in Python 2.6, provides an io.open function, which allows specifying the file's encoding.

Supposing the file is encoded in UTF-8, we can use:

>>> import io
>>> f = io.open("test", mode="r", encoding="utf-8")

Then f.read returns a decoded Unicode object:

>>> f.read()
u'Capit\xe1l\n\n'

In 3.x, the io.open function is an alias for the built-in open function, which supports the encoding argument (it does not in 2.x).

We can also use open from the codecs standard library module:

>>> import codecs
>>> f = codecs.open("test", "r", "utf-8")
>>> f.read()
u'Capit\xe1l\n\n'

Note, however, that this can cause problems when mixing read() and readline().

回复收藏 0 原文

寄离 2025-02-02 03:24:29

在符号 u'capit \ xe1n \ n'中在3.0和3.1中， \ XE1 仅表示一个字符。 \ x 是一个逃生序列，表明 e1 在十六进制中。

写 Capit \ XC3 \ Xa1n 在文本编辑器中的文件中，它实际上包含 \ XC3 \ XA1 。这些是8个字节，代码全部读取它们。我们可以通过显示结果来看到这一点：

# Python 3.x - reading the file as bytes rather than text,
# to ensure we see the raw data
>>> open('f2', 'rb').read()
b'Capit\\xc3\\xa1n\n'

# Python 2.x
>>> open('f2').read()
'Capit\\xc3\\xa1n\n'

相反，编辑器中只有输入字符，例如á，然后应处理转换为UTF-8并保存。

在2.x中，可以使用 string_escape codec进行解码，实际上包含这些后斜线 - escape序列的字符串：

# Python 2.x
>>> print 'Capit\\xc3\\xa1n\n'.decode('string_escape')
Capitán

结果是在UTF-8中编码的 str 重音字符由原始字符串中编写 \\ xc3 \\ xa1 编写的两个字节表示。要获取 Unicode 结果，请再次用UTF-8解码。

在3.x中， string_escape codec被 unicode_escape 替换，并且严格强制强制我们只能从 str to bytes ，从 bytes decode to str 。 unicode_escape 需要从 bytes 开始，以处理逃生序列（另一方面， 添加了他们）;然后，它将将结果处理 \ XC3 和 \ xa1 作为 targine Escapes，而不是 byte eveapes。结果，我们必须做更多的工作：

# Python 3.x
>>> 'Capit\\xc3\\xa1n\n'.encode('ascii').decode('unicode_escape').encode('latin-1').decode('utf-8')
'Capitán\n'

In the notation u'Capit\xe1n\n' (should be just 'Capit\xe1n\n' in 3.x, and must be in 3.0 and 3.1), the \xe1 represents just one character. \x is an escape sequence, indicating that e1 is in hexadecimal.

Writing Capit\xc3\xa1n into the file in a text editor means that it actually contains \xc3\xa1. Those are 8 bytes and the code reads them all. We can see this by displaying the result:

# Python 3.x - reading the file as bytes rather than text,
# to ensure we see the raw data
>>> open('f2', 'rb').read()
b'Capit\\xc3\\xa1n\n'

# Python 2.x
>>> open('f2').read()
'Capit\\xc3\\xa1n\n'

Instead, just input characters like á in the editor, which should then handle the conversion to UTF-8 and save it.

In 2.x, a string that actually contains these backslash-escape sequences can be decoded using the string_escape codec:

# Python 2.x
>>> print 'Capit\\xc3\\xa1n\n'.decode('string_escape')
Capitán

The result is a str that is encoded in UTF-8 where the accented character is represented by the two bytes that were written \\xc3\\xa1 in the original string. To get a unicode result, decode again with UTF-8.

In 3.x, the string_escape codec is replaced with unicode_escape, and it is strictly enforced that we can only encode from a str to bytes, and decode from bytes to str. unicode_escape needs to start with a bytes in order to process the escape sequences (the other way around, it adds them); and then it will treat the resulting \xc3 and \xa1 as character escapes rather than byte escapes. As a result, we have to do a bit more work:

# Python 3.x
>>> 'Capit\\xc3\\xa1n\n'.encode('ascii').decode('unicode_escape').encode('latin-1').decode('utf-8')
'Capitán\n'

回复收藏 0 原文

时光清浅 2025-02-02 03:24:29

现在，您在Python3中需要的只是 open（fileName，'r'，encoding ='utf-8'）

[在2016-02-10上进行编辑以澄清]

python3将编码参数添加到其开放函数中。有关开放函数的以下信息从此处收集： https：//docs.python .org/3/library/functions.html＃Open

open(file, mode='r', buffering=-1, 
      encoding=None, errors=None, newline=None, 
      closefd=True, opener=None)

编码是用于解码或编码的编码的名称
文件。这仅应在文本模式下使用。默认编码是
平台依赖（无论 locale> locale.getpreferredencodencoding（）
返回），但任何文本编码可以使用python支持的文本。
请参阅 codecs 用于支持的编码列表的模块。 P>

因此，通过添加 encoding ='utf-8'作为打开函数的参数，文件读取和写入全部以utf8的形式完成（现在也是Python中完成的所有操作的默认编码）。

Now all you need in Python3 is open(Filename, 'r', encoding='utf-8')

[Edit on 2016-02-10 for requested clarification]

Python3 added the encoding parameter to its open function. The following information about the open function is gathered from here: https://docs.python.org/3/library/functions.html#open

open(file, mode='r', buffering=-1, 
      encoding=None, errors=None, newline=None, 
      closefd=True, opener=None)

Encoding is the name of the encoding used to decode or encode the
file. This should only be used in text mode. The default encoding is
platform dependent (whatever locale.getpreferredencoding()
returns), but any text encoding supported by Python can be used.
See the codecs module for the list of supported encodings.

So by adding encoding='utf-8' as a parameter to the open function, the file reading and writing is all done as utf8 (which is also now the default encoding of everything done in Python.)

回复收藏 0 原文

我爱人 2025-02-02 03:24:29

这适用于在Python 3.2中读取使用UTF-8编码的文件：

import codecs
f = codecs.open('file_name.txt', 'r', 'UTF-8')
for line in f:
    print(line)

This works for reading a file with UTF-8 encoding in Python 3.2:

import codecs
f = codecs.open('file_name.txt', 'r', 'UTF-8')
for line in f:
    print(line)

回复收藏 0 原文

泪意 2025-02-02 03:24:29

因此，我找到了一个解决问题的解决方案，即：

print open('f2').read().decode('string-escape').decode("utf-8")

这里有一些不寻常的编解码器在这里很有用。此特定的读数使人们可以从Python内获取UTF-8表示形式，将其复制到ASCII文件中，并将其读入Unicode。在“字符串 - 示例”解码下，斜线不会加倍。

这允许我想象的那种往返。

So, I've found a solution for what I'm looking for, which is:

print open('f2').read().decode('string-escape').decode("utf-8")

There are some unusual codecs that are useful here. This particular reading allows one to take UTF-8 representations from within Python, copy them into an ASCII file, and have them be read in to Unicode. Under the "string-escape" decode, the slashes won't be doubled.

This allows for the sort of round trip that I was imagining.

回复收藏 0 原文

别挽留 2025-02-02 03:24:29

# -*- encoding: utf-8 -*-

# converting a unknown formatting file in utf-8

import codecs
import commands

file_location = "jumper.sub"
file_encoding = commands.getoutput('file -b --mime-encoding %s' % file_location)

file_stream = codecs.open(file_location, 'r', file_encoding)
file_output = codecs.open(file_location+"b", 'w', 'utf-8')

for l in file_stream:
    file_output.write(l)

file_stream.close()
file_output.close()

# -*- encoding: utf-8 -*-

# converting a unknown formatting file in utf-8

import codecs
import commands

file_location = "jumper.sub"
file_encoding = commands.getoutput('file -b --mime-encoding %s' % file_location)

file_stream = codecs.open(file_location, 'r', file_encoding)
file_output = codecs.open(file_location+"b", 'w', 'utf-8')

for l in file_stream:
    file_output.write(l)

file_stream.close()
file_output.close()

回复收藏 0 原文

未蓝澄海的烟 2025-02-02 03:24:29

除 codecs.open（）外， io.open（）可以在2.x和3.x中使用以读写文本文件。例子：

import io

text = u'á'
encoding = 'utf8'

with io.open('data.txt', 'w', encoding=encoding, newline='\n') as fout:
    fout.write(text)

with io.open('data.txt', 'r', encoding=encoding, newline='\n') as fin:
    text2 = fin.read()

assert text == text2

Aside from codecs.open(), io.open() can be used in both 2.x and 3.x to read and write text files. Example:

import io

text = u'á'
encoding = 'utf8'

with io.open('data.txt', 'w', encoding=encoding, newline='\n') as fout:
    fout.write(text)

with io.open('data.txt', 'r', encoding=encoding, newline='\n') as fin:
    text2 = fin.read()

assert text == text2

回复收藏 0 原文

甜警司 2025-02-02 03:24:29

要在Unicode字符串中读取然后发送到HTML，我做到了：

fileline.decode("utf-8").encode('ascii', 'xmlcharrefreplace')

对Python驱动的HTTP服务器有用。

To read in an Unicode string and then send to HTML, I did this:

fileline.decode("utf-8").encode('ascii', 'xmlcharrefreplace')

Useful for python powered http servers.

回复收藏 0 原文

一桥轻雨一伞开 2025-02-02 03:24:29

好吧，您喜欢的文本编辑器没有意识到 \ XC3 \ XA1 应该是字符文字，但它将其解释为文本。这就是为什么您在最后一行中获得双重冲击的原因 - 现在，它是一个真正的Backslash + XC3 等。

如果要在Python中读取和写编码的文件，最好使用模块。

在终端和应用程序之间粘贴文本很困难，因为您不知道哪个程序会使用编码来解释您的文本。您可以尝试以下内容：

>>> s = file("f1").read()
>>> print unicode(s, "Latin-1")
CapitÃ¡n

然后将此字符串粘贴到您的编辑器中，并确保使用Latin-1将其存储。假设剪贴板不丢弃绳子，往返应该可以工作。

Well, your favorite text editor does not realize that \xc3\xa1 are supposed to be character literals, but it interprets them as text. That's why you get the double backslashes in the last line -- it's now a real backslash + xc3, etc. in your file.

If you want to read and write encoded files in Python, best use the codecs module.

Pasting text between the terminal and applications is difficult, because you don't know which program will interpret your text using which encoding. You could try the following:

>>> s = file("f1").read()
>>> print unicode(s, "Latin-1")
CapitÃ¡n

Then paste this string into your editor and make sure that it stores it using Latin-1. Under the assumption that the clipboard does not garble the string, the round trip should work.

回复收藏 0 原文

烏雲後面有陽光 2025-02-02 03:24:29

您偶然发现了编码的一般问题：我该如何确定在哪个编码文件中？

答案：除非 您不能为此提供文件格式。例如，XML从：

<?xml encoding="utf-8"?>

仔细选择此标头开始，以便无论编码如何，都可以阅读。就您而言，没有这样的提示，因此您的编辑和Python都不知道发生了什么。因此，您必须使用 codecs 模块并使用 codecs.open（路径，模式，编码），该在Python中提供了丢失位。

至于您的编辑器，您必须检查它是否提供了设置文件编码的某种方法。

UTF-8的重点是能够将21位字符（Unicode）作为8位数据流编码（因为这是世界上所有计算机都可以处理的唯一一件事）。但是，由于大多数OSS早于Unicode ERA，因此他们没有合适的工具将编码信息附加到硬盘上的文件。

下一个问题是Python的代表。这在 Heikogerlach 。您必须了解您的控制台只能显示ASCII。为了显示Unicode或任何内容＆GT; = CharCode 128，它必须使用某些逃脱手段。在您的编辑器中，您不得键入Expated Display字符串，但字符串的含义（在这种情况下，您必须输入UMLAUT并保存文件）。

也就是说，您可以使用Python函数est（）将ESC的字符串变成一个字符串：

>>> x = eval("'Capit\\xc3\\xa1n\\n'")
>>> x
'Capit\xc3\xa1n\n'
>>> x[5]
'\xc3'
>>> len(x[5])
1

如您所见，字符串“ \ xc3”已变成一个字符。现在，这是一个8位字符串，UTF-8编码。要获得Unicode：

>>> x.decode('utf-8')
u'Capit\xe1n\n'

gregg lind 问：我认为这里缺少一些文件：file f2包含：hex：hex：hex：

0000000: 4361 7069 745c 7863 335c 7861 316e  Capit\xc3\xa1n

例如那会起作用吗？

答：这取决于您的意思。 ASCII不能表示字符＆gt; 127。因此，您需要某种方法来说“接下来的几个字符意味着特殊的东西”，这就是序列“ \ x”所做的。它说：接下来的两个字符是一个字符的代码。 “ \ u”使用四个字符对Unicode进行编码为0xffff（65535）。

因此，您无法将Unicode直接写入ASCII（因为ASCII根本不包含相同的字符）。您可以将其写入字符串逃逸（如F2中）；在这种情况下，文件可以表示为ASCII。或者，您可以将其写为UTF-8，在这种情况下，您需要一个8位安全流。

您使用 decode（'String-escape'）的解决方案确实有效，但是您必须知道您使用的内存数量：使用 codecs.open（）的三倍。。

请记住，文件只是一个包含8位字节的序列。位和字节都没有含义。是你说“ 65表示'a'”。由于 \ XC3 \ XA1 应该变成“à”，但计算机无需知道，因此必须通过指定编写文件时使用的编码来告诉它。

You have stumbled over the general problem with encodings: How can I tell in which encoding a file is?

Answer: You can't unless the file format provides for this. XML, for example, begins with:

<?xml encoding="utf-8"?>

This header was carefully chosen so that it can be read no matter the encoding. In your case, there is no such hint, hence neither your editor nor Python has any idea what is going on. Therefore, you must use the codecs module and use codecs.open(path,mode,encoding) which provides the missing bit in Python.

As for your editor, you must check if it offers some way to set the encoding of a file.

The point of UTF-8 is to be able to encode 21-bit characters (Unicode) as an 8-bit data stream (because that's the only thing all computers in the world can handle). But since most OSs predate the Unicode era, they don't have suitable tools to attach the encoding information to files on the hard disk.

The next issue is the representation in Python. This is explained perfectly in the comment by heikogerlach. You must understand that your console can only display ASCII. In order to display Unicode or anything >= charcode 128, it must use some means of escaping. In your editor, you must not type the escaped display string but what the string means (in this case, you must enter the umlaut and save the file).

That said, you can use the Python function eval() to turn an escaped string into a string:

>>> x = eval("'Capit\\xc3\\xa1n\\n'")
>>> x
'Capit\xc3\xa1n\n'
>>> x[5]
'\xc3'
>>> len(x[5])
1

As you can see, the string "\xc3" has been turned into a single character. This is now an 8-bit string, UTF-8 encoded. To get Unicode:

>>> x.decode('utf-8')
u'Capit\xe1n\n'

Gregg Lind asked: I think there are some pieces missing here: the file f2 contains: hex:

0000000: 4361 7069 745c 7863 335c 7861 316e  Capit\xc3\xa1n

codecs.open('f2','rb', 'utf-8'), for example, reads them all in a separate chars (expected) Is there any way to write to a file in ASCII that would work?

Answer: That depends on what you mean. ASCII can't represent characters > 127. So you need some way to say "the next few characters mean something special" which is what the sequence "\x" does. It says: The next two characters are the code of a single character. "\u" does the same using four characters to encode Unicode up to 0xFFFF (65535).

So you can't directly write Unicode to ASCII (because ASCII simply doesn't contain the same characters). You can write it as string escapes (as in f2); in this case, the file can be represented as ASCII. Or you can write it as UTF-8, in which case, you need an 8-bit safe stream.

Your solution using decode('string-escape') does work, but you must be aware how much memory you use: Three times the amount of using codecs.open().

Remember that a file is just a sequence of bytes with 8 bits. Neither the bits nor the bytes have a meaning. It's you who says "65 means 'A'". Since \xc3\xa1 should become "à" but the computer has no means to know, you must tell it by specifying the encoding which was used when writing the file.

回复收藏 0 原文

思念满溢 2025-02-02 03:24:29

\ x ..序列是Python的特定物质。这不是通用的字节逃逸序列。

您如何在UTF-8编码的非ASCII中实际输入取决于您的OS和/或编辑器。这是您在Windows 中进行的操作。要使OS X带有急性重音输入 a ，您可以单击 option + e ，然后 a ， OS X中的几乎所有文本编辑器支持UTF-8。

回复收藏 0 原文

情话墙 2025-02-02 03:24:29

您还可以使用 partial 函数来改进原始 open（）函数来与Unicode文件一起使用。该解决方案的优点是您不需要更改任何旧代码。这是透明的。

import codecs
import functools
open = functools.partial(codecs.open, encoding='utf-8')

You can also improve the original open() function to work with Unicode files by replacing it in place, using the partial function. The beauty of this solution is you don't need to change any old code. It's transparent.

import codecs
import functools
open = functools.partial(codecs.open, encoding='utf-8')

回复收藏 0 原文

哽咽笑 2025-02-02 03:24:29

我试图解析 ical 使用python 2.7.9：

来自iCalendar导入日历

但我得到了：（

 Traceback (most recent call last):
 File "ical.py", line 92, in parse
    print "{}".format(e[attr])
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 7: ordinal not in range(128)

现在

print "{}".format(e[attr].encode("utf-8"))

它可以打印LikeáBöss。）

I was trying to parse iCal using Python 2.7.9:

from icalendar import Calendar

But I was getting:

 Traceback (most recent call last):
 File "ical.py", line 92, in parse
    print "{}".format(e[attr])
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 7: ordinal not in range(128)

and it was fixed with just:

print "{}".format(e[attr].encode("utf-8"))

(Now it can print liké á böss.)

回复收藏 0 原文

忆伤 2025-02-02 03:24:29

我通过更改整个脚本的默认编码为'utf-8'而找到了最简单的方法：

import sys
reload(sys)
sys.setdefaultencoding('utf8')

任何 Open ， print 或其他语句只会使用> UTF8 。

至少适用于 Python 2.7.9 。

THX转到https://markhneedham.com/blog/2015/05/21/python-unicodeencodeerror-ascii-codec-cant-encode-character-uxfc-in-position-11- Ordinal-not-in-range128/（看看结束）。

I found the most simple approach by changing the default encoding of the whole script to be 'UTF-8':

import sys
reload(sys)
sys.setdefaultencoding('utf8')

any open, print or other statement will just use utf8.

Works at least for Python 2.7.9.

Thx goes to https://markhneedham.com/blog/2015/05/21/python-unicodeencodeerror-ascii-codec-cant-encode-character-uxfc-in-position-11-ordinal-not-in-range128/ (look at the end).

回复收藏 0 原文

~没有更多了~

关于作者

欲拥i

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

Unicode（UTF-8）读取和写入Python中的文件

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（14）

关于作者

相关话题

热门标签

推荐作者

李珊平

Quxin

范无咎

github_ZOJ2N8YxBm

若言

南…巷孤猫

友情链接

Unicode（UTF-8）读取和写入Python中的文件

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（14）

关于作者

相关话题

热门标签

推荐作者

李珊平

Quxin

范无咎

github_ZOJ2N8YxBm

若言

南…巷孤猫

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。