为什么打印到 utf-8 文件失败?
今天下午我遇到了一个问题,我能够解决它,但我不太明白为什么它有效。
这与我前一周遇到的问题有关: python check if utf -8 字符串
基本上是大写的,以下内容将不起作用:
#!/usr/bin/python
import codecs
from lxml import etree
outFile = codecs.open('test.xml', 'w', 'utf-8') #cannot use codecs.open()
root = etree.Element('root')
sect = etree.SubElement(root,'sect')
words = ( u'\u041c\u041e\u0421\u041a\u0412\u0410', # capital of Russia, all uppercase
u'R\xc9SUM\xc9', # RESUME with accents
u'R\xe9sum\xe9', # Resume with accents
u'R\xe9SUM\xe9', ) # ReSUMe with accents
for word in words:
print word
if word.encode('utf8').decode('utf8').isupper(): #.isupper won't function on utf8
title = etree.SubElement(sect,'title')
title.text = word
else:
item = etree.SubElement(sect,'item')
item.text = word
print>>outFile,etree.tostring(root,pretty_print=True,xml_declaration=True,encoding='utf-8')
它失败并显示以下内容:
回溯(最近一次调用最后一次):
文件“./temp.py”,第 25 行,位于
print >>outFile,etree.tostring(root,pretty_print=True,xml_declaration=True,encoding='utf-8')
文件“/usr/lib/python2.7/codecs.py”,
第 691 行,写入
返回 self.writer.write(data) 文件 "/usr/lib/python2.7/codecs.py",
第 351 行,写入
数据,消耗= self.encode(对象,self.errors)
UnicodeDecodeError:“ascii”编解码器
无法解码位置 66 中的字节 0xd0:
序数不在范围内(128)
但如果我在没有 codecs.open('test.xml', 'w', 'utf-8') 的情况下打开新文件,而是使用 outFile = open('test.xml', 'w')
它工作得很好。
那么发生了什么?
由于
encoding='utf-8'
是在etree.tostring()
中指定的,它是否对文件进行编码再次?如果我离开
codecs.open()
并删除encoding='utf-8'
,该文件就会变成ascii文件。为什么?因为etree.tostring()
我认为默认编码为 ascii?但是
etree.tostring()
只是被写入 stdout,然后重定向到作为 utf-8 文件创建的文件?print>>
是否未按我的预期工作?outFile.write(etree.tostring())
的行为方式相同。
基本上,为什么这行不通?这是怎么回事。这可能是微不足道的,但我显然有点困惑,并且渴望弄清楚为什么我的解决方案有效,
So I ran into a problem this afternoon, I was able to solve it, but I don't quite understand why it worked.
this is related to a problem I had the other week: python check if utf-8 string is uppercase
basically, the following will not work:
#!/usr/bin/python
import codecs
from lxml import etree
outFile = codecs.open('test.xml', 'w', 'utf-8') #cannot use codecs.open()
root = etree.Element('root')
sect = etree.SubElement(root,'sect')
words = ( u'\u041c\u041e\u0421\u041a\u0412\u0410', # capital of Russia, all uppercase
u'R\xc9SUM\xc9', # RESUME with accents
u'R\xe9sum\xe9', # Resume with accents
u'R\xe9SUM\xe9', ) # ReSUMe with accents
for word in words:
print word
if word.encode('utf8').decode('utf8').isupper(): #.isupper won't function on utf8
title = etree.SubElement(sect,'title')
title.text = word
else:
item = etree.SubElement(sect,'item')
item.text = word
print>>outFile,etree.tostring(root,pretty_print=True,xml_declaration=True,encoding='utf-8')
it fails with the following:
Traceback (most recent call last):
File "./temp.py", line 25, in
print >>outFile,etree.tostring(root,pretty_print=True,xml_declaration=True,encoding='utf-8')
File "/usr/lib/python2.7/codecs.py",
line 691, in write
return self.writer.write(data) File "/usr/lib/python2.7/codecs.py",
line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec
can't decode byte 0xd0 in position 66:
ordinal not in range(128)
but if I open the new file without codecs.open('test.xml', 'w', 'utf-8')
and instead useoutFile = open('test.xml', 'w')
it works perfectly.
So whats happening??
since
encoding='utf-8'
is specified inetree.tostring()
is it encoding the file again?if I leave
codecs.open()
and removeencoding='utf-8'
the file then becomes an ascii file. Why? becuaseetree.tostring()
has a default encoding of ascii I persume?but
etree.tostring()
is simply being written to stdout, and is then redirect to a file that was created as a utf-8 file??- is
print>>
not workings as I expect?outFile.write(etree.tostring())
behaves the same way.
- is
Basically, why wouldn't this work? what is going on here. It might be trivial, but I am obviously a bit confused and have a desire to figure out why my solution works,
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您已使用 UTF-8 编码打开文件,这意味着它需要 Unicode 字符串。
tostring 正在编码为 UTF-8(以字节串 str 的形式),并将其写入文件。
由于该文件需要 Unicode,因此它使用默认 ASCII 编码将字节字符串解码为 Unicode,以便随后可以将 Unicode 编码为 UTF-8。
不幸的是,字节串不是 ASCII。
编辑:避免此类问题的最佳建议是在内部使用 Unicode,对输入进行解码,对输出进行编码。
You've opened the file with UTF-8 encoding, which means that it expects Unicode strings.
tostring is encoding to UTF-8 (in the form of bytestrings, str), which you're writing to the file.
Because the file is expecting Unicode, it's decoding the bytestrings to Unicode using the default ASCII encoding so that it can then encode the Unicode to UTF-8.
Unfortunately, the bytestrings aren't ASCII.
EDIT: The best advice to avoid this kind of problem is to use Unicode internally, decoding on input and encoding on output.
使用
print>>outFile
有点奇怪。我没有安装lxml
,但内置的xml.etree
库类似(但不支持pretty_print
)。将root
元素包装在 ElementTree 中并使用 write 方法。另外,如果您使用
#coding
行来声明源文件的编码,则可以使用可读的 Unicode 字符串而不是转义码:Using
print>>outFile
is a little strange. I don't havelxml
installed, but the built-inxml.etree
library is similar (but doesn't supportpretty_print
). Wrap theroot
Element in an ElementTree and use the write method.Also, if you using a
# coding
line to declare the encoding of the source file, you can use readable Unicode strings instead of escape codes:除了 MRAB 之外,还回答了一些代码行:
In addition to MRABs answer some lines of code: