我在理解阅读和编写文件的文本时有一些大脑失败(Python 2.4)。
# The string, which has an a-acute in it.
ss = u'Capit\xe1n'
ss8 = ss.encode('utf8')
repr(ss), repr(ss8)
(“ u'capit \ xe1n'”,“'capit \ xc3 \ xa1n'”)
print ss, ss8
print >> open('f1','w'), ss8
>>> file('f1').read()
'Capit\xc3\xa1n\n'
所以我在我喜欢的编辑器中输入 capit \ xc3 \ xa1n
在文件f2中。
然后:
>>> open('f1').read()
'Capit\xc3\xa1n\n'
>>> open('f2').read()
'Capit\\xc3\\xa1n\n'
>>> open('f1').read().decode('utf8')
u'Capit\xe1n\n'
>>> open('f2').read().decode('utf8')
u'Capit\\xc3\\xa1n\n'
我在这里不了解什么?显然,我缺少一些重要的魔术(或良好的感觉)。文本文件中有什么类型以获得正确的转换?
我真正未能在这里掌握的是UTF-8表示的重点,如果您实际上无法让Python识别出来,那么它来自外部。也许我应该只是json丢弃字符串,然后使用它,因为那具有可靠的表示!更重要的是,从文件进入时,Python会识别和解码的Unicode对象的ASCII表示?如果是这样,我该如何明白?
>>> print simplejson.dumps(ss)
'"Capit\u00e1n"'
>>> print >> file('f3','w'), simplejson.dumps(ss)
>>> simplejson.load(open('f3'))
u'Capit\xe1n'
I'm having some brain failure in understanding reading and writing text to a file (Python 2.4).
# The string, which has an a-acute in it.
ss = u'Capit\xe1n'
ss8 = ss.encode('utf8')
repr(ss), repr(ss8)
("u'Capit\xe1n'", "'Capit\xc3\xa1n'")
print ss, ss8
print >> open('f1','w'), ss8
>>> file('f1').read()
'Capit\xc3\xa1n\n'
So I type in Capit\xc3\xa1n
into my favorite editor, in file f2.
Then:
>>> open('f1').read()
'Capit\xc3\xa1n\n'
>>> open('f2').read()
'Capit\\xc3\\xa1n\n'
>>> open('f1').read().decode('utf8')
u'Capit\xe1n\n'
>>> open('f2').read().decode('utf8')
u'Capit\\xc3\\xa1n\n'
What am I not understanding here? Clearly there is some vital bit of magic (or good sense) that I'm missing. What does one type into text files to get proper conversions?
What I'm truly failing to grok here, is what the point of the UTF-8 representation is, if you can't actually get Python to recognize it, when it comes from outside. Maybe I should just JSON dump the string, and use that instead, since that has an asciiable representation! More to the point, is there an ASCII representation of this Unicode object that Python will recognize and decode, when coming in from a file? If so, how do I get it?
>>> print simplejson.dumps(ss)
'"Capit\u00e1n"'
>>> print >> file('f3','w'), simplejson.dumps(ss)
>>> simplejson.load(open('f3'))
u'Capit\xe1n'
发布评论
评论(14)
与其混乱
io iodule Python 2.6,提供
.encode
和.decode
,在打开文件时指定编码。io.open
函数,允许指定文件的编码
。假设该文件是在UTF-8中编码的,我们可以使用:
然后
f.Read
返回一个解码的Unicode对象:在3.x中,
io.open
函数是一个内置打开
函数的别名,该功能支持编码
参数(它不在2.x中)。我们还可以使用
open
codecs 标准库模块:
请注意,但是,此在混合
read()和
readline()
。
Rather than mess with
.encode
and.decode
, specify the encoding when opening the file. Theio
module, added in Python 2.6, provides anio.open
function, which allows specifying the file'sencoding
.Supposing the file is encoded in UTF-8, we can use:
Then
f.read
returns a decoded Unicode object:In 3.x, the
io.open
function is an alias for the built-inopen
function, which supports theencoding
argument (it does not in 2.x).We can also use
open
from thecodecs
standard library module:Note, however, that this can cause problems when mixing
read()
andreadline()
.在符号
u'capit \ xe1n \ n'
中在3.0和3.1中,\ XE1
仅表示一个字符。\ x
是一个逃生序列,表明e1
在十六进制中。写
Capit \ XC3 \ Xa1n
在文本编辑器中的文件中,它实际上包含\ XC3 \ XA1
。这些是8个字节,代码全部读取它们。我们可以通过显示结果来看到这一点:相反,编辑器中只有输入字符,例如
á
,然后应处理转换为UTF-8并保存。在2.x中,可以使用
string_escape
codec进行解码,实际上包含这些后斜线 - escape序列的字符串:结果是在UTF-8中编码的
str
重音字符由原始字符串中编写\\ xc3 \\ xa1
编写的两个字节表示。要获取Unicode
结果,请再次用UTF-8解码。在3.x中,
string_escape
codec被unicode_escape
替换,并且严格强制强制我们只能从str
tobytes
,从
tobytes
decodestr
。unicode_escape
需要从bytes
开始,以处理逃生序列(另一方面, 添加了他们);然后,它将将结果处理\ XC3
和\ xa1
作为 targine Escapes,而不是 byte eveapes。结果,我们必须做更多的工作:In the notation
u'Capit\xe1n\n'
(should be just'Capit\xe1n\n'
in 3.x, and must be in 3.0 and 3.1), the\xe1
represents just one character.\x
is an escape sequence, indicating thate1
is in hexadecimal.Writing
Capit\xc3\xa1n
into the file in a text editor means that it actually contains\xc3\xa1
. Those are 8 bytes and the code reads them all. We can see this by displaying the result:Instead, just input characters like
á
in the editor, which should then handle the conversion to UTF-8 and save it.In 2.x, a string that actually contains these backslash-escape sequences can be decoded using the
string_escape
codec:The result is a
str
that is encoded in UTF-8 where the accented character is represented by the two bytes that were written\\xc3\\xa1
in the original string. To get aunicode
result, decode again with UTF-8.In 3.x, the
string_escape
codec is replaced withunicode_escape
, and it is strictly enforced that we can onlyencode
from astr
tobytes
, anddecode
frombytes
tostr
.unicode_escape
needs to start with abytes
in order to process the escape sequences (the other way around, it adds them); and then it will treat the resulting\xc3
and\xa1
as character escapes rather than byte escapes. As a result, we have to do a bit more work:现在,您在Python3中需要的只是
open(fileName,'r',encoding ='utf-8')
[在2016-02-10上进行编辑以澄清]
python3将编码参数添加到其开放函数中。有关开放函数的以下信息从此处收集: https://docs.python .org/3/library/functions.html#Open
因此,通过添加
encoding ='utf-8'
作为打开函数的参数,文件读取和写入全部以utf8的形式完成(现在也是Python中完成的所有操作的默认编码)。Now all you need in Python3 is
open(Filename, 'r', encoding='utf-8')
[Edit on 2016-02-10 for requested clarification]
Python3 added the encoding parameter to its open function. The following information about the open function is gathered from here: https://docs.python.org/3/library/functions.html#open
So by adding
encoding='utf-8'
as a parameter to the open function, the file reading and writing is all done as utf8 (which is also now the default encoding of everything done in Python.)这适用于在Python 3.2中读取使用UTF-8编码的文件:
This works for reading a file with UTF-8 encoding in Python 3.2:
因此,我找到了一个解决问题的解决方案,即:
这里有一些不寻常的编解码器在这里很有用。此特定的读数使人们可以从Python内获取UTF-8表示形式,将其复制到ASCII文件中,并将其读入Unicode。在“字符串 - 示例”解码下,斜线不会加倍。
这允许我想象的那种往返。
So, I've found a solution for what I'm looking for, which is:
There are some unusual codecs that are useful here. This particular reading allows one to take UTF-8 representations from within Python, copy them into an ASCII file, and have them be read in to Unicode. Under the "string-escape" decode, the slashes won't be doubled.
This allows for the sort of round trip that I was imagining.
除
codecs.open()
外,io.open()
可以在2.x和3.x中使用以读写文本文件。例子:Aside from
codecs.open()
,io.open()
can be used in both 2.x and 3.x to read and write text files. Example:要在Unicode字符串中读取然后发送到HTML,我做到了:
对Python驱动的HTTP服务器有用。
To read in an Unicode string and then send to HTML, I did this:
Useful for python powered http servers.
好吧,您喜欢的文本编辑器没有意识到
\ XC3 \ XA1
应该是字符文字,但它将其解释为文本。这就是为什么您在最后一行中获得双重冲击的原因 - 现在,它是一个真正的Backslash +XC3
等。如果要在Python中读取和写编码的文件,最好使用模块。
在终端和应用程序之间粘贴文本很困难,因为您不知道哪个程序会使用编码来解释您的文本。您可以尝试以下内容:
然后将此字符串粘贴到您的编辑器中,并确保使用Latin-1将其存储。假设剪贴板不丢弃绳子,往返应该可以工作。
Well, your favorite text editor does not realize that
\xc3\xa1
are supposed to be character literals, but it interprets them as text. That's why you get the double backslashes in the last line -- it's now a real backslash +xc3
, etc. in your file.If you want to read and write encoded files in Python, best use the codecs module.
Pasting text between the terminal and applications is difficult, because you don't know which program will interpret your text using which encoding. You could try the following:
Then paste this string into your editor and make sure that it stores it using Latin-1. Under the assumption that the clipboard does not garble the string, the round trip should work.
您偶然发现了编码的一般问题:我该如何确定在哪个编码文件中?
答案:除非 您不能为此提供文件格式。例如,XML从:
仔细选择此标头开始,以便无论编码如何,都可以阅读。就您而言,没有这样的提示,因此您的编辑和Python都不知道发生了什么。因此,您必须使用
codecs
模块并使用codecs.open(路径,模式,编码)
,该在Python中提供了丢失位。至于您的编辑器,您必须检查它是否提供了设置文件编码的某种方法。
UTF-8的重点是能够将21位字符(Unicode)作为8位数据流编码(因为这是世界上所有计算机都可以处理的唯一一件事)。但是,由于大多数OSS早于Unicode ERA,因此他们没有合适的工具将编码信息附加到硬盘上的文件。
下一个问题是Python的代表。这在 Heikogerlach。您必须了解您的控制台只能显示ASCII。为了显示Unicode或任何内容> = CharCode 128,它必须使用某些逃脱手段。在您的编辑器中,您不得键入Expated Display字符串,但字符串的含义(在这种情况下,您必须输入UMLAUT并保存文件)。
也就是说,您可以使用Python函数est()将ESC的字符串变成一个字符串:
如您所见,字符串“ \ xc3”已变成一个字符。现在,这是一个8位字符串,UTF-8编码。要获得Unicode:
gregg lind 问:我认为这里缺少一些文件:file f2包含:hex:hex:hex:
例如那会起作用吗?
答:这取决于您的意思。 ASCII不能表示字符> 127。因此,您需要某种方法来说“接下来的几个字符意味着特殊的东西”,这就是序列“ \ x”所做的。它说:接下来的两个字符是一个字符的代码。 “ \ u”使用四个字符对Unicode进行编码为0xffff(65535)。
因此,您无法将Unicode直接写入ASCII(因为ASCII根本不包含相同的字符)。您可以将其写入字符串逃逸(如F2中);在这种情况下,文件可以表示为ASCII。或者,您可以将其写为UTF-8,在这种情况下,您需要一个8位安全流。
您使用
decode('String-escape')
的解决方案确实有效,但是您必须知道您使用的内存数量:使用codecs.open()
的三倍。 。请记住,文件只是一个包含8位字节的序列。位和字节都没有含义。是你说“ 65表示'a'”。由于
\ XC3 \ XA1
应该变成“à”,但计算机无需知道,因此必须通过指定编写文件时使用的编码来告诉它。You have stumbled over the general problem with encodings: How can I tell in which encoding a file is?
Answer: You can't unless the file format provides for this. XML, for example, begins with:
This header was carefully chosen so that it can be read no matter the encoding. In your case, there is no such hint, hence neither your editor nor Python has any idea what is going on. Therefore, you must use the
codecs
module and usecodecs.open(path,mode,encoding)
which provides the missing bit in Python.As for your editor, you must check if it offers some way to set the encoding of a file.
The point of UTF-8 is to be able to encode 21-bit characters (Unicode) as an 8-bit data stream (because that's the only thing all computers in the world can handle). But since most OSs predate the Unicode era, they don't have suitable tools to attach the encoding information to files on the hard disk.
The next issue is the representation in Python. This is explained perfectly in the comment by heikogerlach. You must understand that your console can only display ASCII. In order to display Unicode or anything >= charcode 128, it must use some means of escaping. In your editor, you must not type the escaped display string but what the string means (in this case, you must enter the umlaut and save the file).
That said, you can use the Python function eval() to turn an escaped string into a string:
As you can see, the string "\xc3" has been turned into a single character. This is now an 8-bit string, UTF-8 encoded. To get Unicode:
Gregg Lind asked: I think there are some pieces missing here: the file f2 contains: hex:
codecs.open('f2','rb', 'utf-8')
, for example, reads them all in a separate chars (expected) Is there any way to write to a file in ASCII that would work?Answer: That depends on what you mean. ASCII can't represent characters > 127. So you need some way to say "the next few characters mean something special" which is what the sequence "\x" does. It says: The next two characters are the code of a single character. "\u" does the same using four characters to encode Unicode up to 0xFFFF (65535).
So you can't directly write Unicode to ASCII (because ASCII simply doesn't contain the same characters). You can write it as string escapes (as in f2); in this case, the file can be represented as ASCII. Or you can write it as UTF-8, in which case, you need an 8-bit safe stream.
Your solution using
decode('string-escape')
does work, but you must be aware how much memory you use: Three times the amount of usingcodecs.open()
.Remember that a file is just a sequence of bytes with 8 bits. Neither the bits nor the bytes have a meaning. It's you who says "65 means 'A'". Since
\xc3\xa1
should become "à" but the computer has no means to know, you must tell it by specifying the encoding which was used when writing the file.\ x ..序列是Python的特定物质。这不是通用的字节逃逸序列。
您如何在UTF-8编码的非ASCII中实际输入取决于您的OS和/或编辑器。 这是您在Windows 中进行的操作。要使OS X带有急性重音输入 a ,您可以单击 option + e ,然后 a , OS X中的几乎所有文本编辑器支持UTF-8。
The \x.. sequence is something that's specific to Python. It's not a universal byte escape sequence.
How you actually enter in UTF-8-encoded non-ASCII depends on your OS and/or your editor. Here's how you do it in Windows. For OS X to enter a with an acute accent you can just hit option + E, then A, and almost all text editors in OS X support UTF-8.
您还可以使用
partial
函数来改进原始open()
函数来与Unicode文件一起使用。该解决方案的优点是您不需要更改任何旧代码。这是透明的。You can also improve the original
open()
function to work with Unicode files by replacing it in place, using thepartial
function. The beauty of this solution is you don't need to change any old code. It's transparent.我试图解析 ical 使用python 2.7.9:
但我得到了:(
现在
它可以打印LikeáBöss。)
I was trying to parse iCal using Python 2.7.9:
But I was getting:
and it was fixed with just:
(Now it can print liké á böss.)
我通过更改整个脚本的默认编码为'utf-8'而找到了最简单的方法:
任何
Open
,print
或其他语句只会使用> UTF8
。至少适用于
Python 2.7.9
。THX转到https://markhneedham.com/blog/2015/05/21/python-unicodeencodeerror-ascii-codec-cant-encode-character-uxfc-in-position-11- Ordinal-not-in-range128/(看看结束)。
I found the most simple approach by changing the default encoding of the whole script to be 'UTF-8':
any
open
,print
or other statement will just useutf8
.Works at least for
Python 2.7.9
.Thx goes to https://markhneedham.com/blog/2015/05/21/python-unicodeencodeerror-ascii-codec-cant-encode-character-uxfc-in-position-11-ordinal-not-in-range128/ (look at the end).