为什么打印到 utf-8 文件失败？

发布于 2024-11-17 13:42:23 字数 2246 浏览 1 评论 0原文

今天下午我遇到了一个问题，我能够解决它，但我不太明白为什么它有效。

这与我前一周遇到的问题有关： python check if utf -8 字符串

基本上是大写的，以下内容将不起作用：

#!/usr/bin/python

import codecs
from lxml import etree

outFile = codecs.open('test.xml', 'w', 'utf-8') #cannot use codecs.open()

root = etree.Element('root')
sect = etree.SubElement(root,'sect')


words = (   u'\u041c\u041e\u0421\u041a\u0412\u0410', # capital of Russia, all uppercase
            u'R\xc9SUM\xc9',    # RESUME with accents
            u'R\xe9sum\xe9',    # Resume with accents
            u'R\xe9SUM\xe9', )  # ReSUMe with accents

for word in words:
    print word
    if word.encode('utf8').decode('utf8').isupper(): #.isupper won't function on utf8 
        title = etree.SubElement(sect,'title')
        title.text = word
    else:
       item = etree.SubElement(sect,'item')
       item.text = word 

print>>outFile,etree.tostring(root,pretty_print=True,xml_declaration=True,encoding='utf-8')

它失败并显示以下内容：

回溯（最近一次调用最后一次）：
文件“./temp.py”，第 25 行，位于
print >>outFile,etree.tostring(root,pretty_print=True,xml_declaration=True,encoding='utf-8')
文件“/usr/lib/python2.7/codecs.py”，
第 691 行，写入
返回 self.writer.write(data) 文件 "/usr/lib/python2.7/codecs.py",
第 351 行，写入
数据，消耗= self.encode（对象，self.errors）
UnicodeDecodeError：“ascii”编解码器
无法解码位置 66 中的字节 0xd0：
序数不在范围内(128)

但如果我在没有 codecs.open('test.xml', 'w', 'utf-8') 的情况下打开新文件，而是使用 outFile = open('test.xml', 'w') 它工作得很好。

那么发生了什么？

由于 encoding='utf-8' 是在 etree.tostring() 中指定的，它是否对文件进行编码再次？
如果我离开codecs.open()并删除encoding='utf-8'，该文件就会变成ascii文件。为什么？因为 etree.tostring() 我认为默认编码为 ascii？
但是 etree.tostring() 只是被写入 stdout，然后重定向到作为 utf-8 文件创建的文件？
- print>> 是否未按我的预期工作？ outFile.write(etree.tostring()) 的行为方式相同。

基本上，为什么这行不通？这是怎么回事。这可能是微不足道的，但我显然有点困惑，并且渴望弄清楚为什么我的解决方案有效，

原文

So I ran into a problem this afternoon, I was able to solve it, but I don't quite understand why it worked.

this is related to a problem I had the other week: python check if utf-8 string is uppercase

basically, the following will not work:

#!/usr/bin/python

import codecs
from lxml import etree

outFile = codecs.open('test.xml', 'w', 'utf-8') #cannot use codecs.open()

root = etree.Element('root')
sect = etree.SubElement(root,'sect')


words = (   u'\u041c\u041e\u0421\u041a\u0412\u0410', # capital of Russia, all uppercase
            u'R\xc9SUM\xc9',    # RESUME with accents
            u'R\xe9sum\xe9',    # Resume with accents
            u'R\xe9SUM\xe9', )  # ReSUMe with accents

for word in words:
    print word
    if word.encode('utf8').decode('utf8').isupper(): #.isupper won't function on utf8 
        title = etree.SubElement(sect,'title')
        title.text = word
    else:
       item = etree.SubElement(sect,'item')
       item.text = word 

print>>outFile,etree.tostring(root,pretty_print=True,xml_declaration=True,encoding='utf-8')

it fails with the following:

Traceback (most recent call last):
File "./temp.py", line 25, in
print >>outFile,etree.tostring(root,pretty_print=True,xml_declaration=True,encoding='utf-8')
File "/usr/lib/python2.7/codecs.py",
line 691, in write
return self.writer.write(data) File "/usr/lib/python2.7/codecs.py",
line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec
can't decode byte 0xd0 in position 66:
ordinal not in range(128)

but if I open the new file without codecs.open('test.xml', 'w', 'utf-8') and instead use
outFile = open('test.xml', 'w') it works perfectly.

So whats happening??

since encoding='utf-8' is specified in etree.tostring() is it encoding the file again?
if I leave codecs.open() and remove encoding='utf-8' the file then becomes an ascii file. Why? becuase etree.tostring() has a default encoding of ascii I persume?
but etree.tostring() is simply being written to stdout, and is then redirect to a file that was created as a utf-8 file??
- is print>> not workings as I expect? outFile.write(etree.tostring()) behaves the same way.

Basically, why wouldn't this work? what is going on here. It might be trivial, but I am obviously a bit confused and have a desire to figure out why my solution works,

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

长不大的小祸害 2024-11-24 13:42:23

您已使用 UTF-8 编码打开文件，这意味着它需要 Unicode 字符串。

tostring 正在编码为 UTF-8（以字节串 str 的形式），并将其写入文件。

由于该文件需要 Unicode，因此它使用默认 ASCII 编码将字节字符串解码为 Unicode，以便随后可以将 Unicode 编码为 UTF-8。

不幸的是，字节串不是 ASCII。

编辑：避免此类问题的最佳建议是在内部使用 Unicode，对输入进行解码，对输出进行编码。

回复收藏 0 原文

风吹雨成花 2024-11-24 13:42:23

使用 print>>outFile 有点奇怪。我没有安装 lxml，但内置的 xml.etree 库类似（但不支持 pretty_print）。将 root 元素包装在 ElementTree 中并使用 write 方法。

另外，如果您使用 #coding 行来声明源文件的编码，则可以使用可读的 Unicode 字符串而不是转义码：

#!/usr/bin/python
# coding: utf8

import codecs
from xml.etree import ElementTree as etree

root = etree.Element(u'root')
sect = etree.SubElement(root,u'sect')


words = [u'МОСКВА',u'RÉSUMÉ',u'Résumé',u'RéSUMé']

for word in words:
    print word
    if word.isupper():
        title = etree.SubElement(sect,u'title')
        title.text = word
    else:
       item = etree.SubElement(sect,u'item')
       item.text = word 

tree = etree.ElementTree(root)
tree.write('text.xml',xml_declaration=True,encoding='utf-8')

Using print>>outFile is a little strange. I don't have lxml installed, but the built-in xml.etree library is similar (but doesn't support pretty_print). Wrap the root Element in an ElementTree and use the write method.

Also, if you using a # coding line to declare the encoding of the source file, you can use readable Unicode strings instead of escape codes:

#!/usr/bin/python
# coding: utf8

import codecs
from xml.etree import ElementTree as etree

root = etree.Element(u'root')
sect = etree.SubElement(root,u'sect')


words = [u'МОСКВА',u'RÉSUMÉ',u'Résumé',u'RéSUMé']

for word in words:
    print word
    if word.isupper():
        title = etree.SubElement(sect,u'title')
        title.text = word
    else:
       item = etree.SubElement(sect,u'item')
       item.text = word 

tree = etree.ElementTree(root)
tree.write('text.xml',xml_declaration=True,encoding='utf-8')

回复收藏 0 原文

凤舞天涯 2024-11-24 13:42:23

除了 MRAB 之外，还回答了一些代码行：

import codecs
from lxml import etree

root = etree.Element('root')
sect = etree.SubElement(root,'sect')

# do some other xml building here

with codecs.open('test.xml', 'w', encoding='utf-8') as f:
    f.write(etree.tostring(root, encoding=unicode))

In addition to MRABs answer some lines of code:

import codecs
from lxml import etree

root = etree.Element('root')
sect = etree.SubElement(root,'sect')

# do some other xml building here

with codecs.open('test.xml', 'w', encoding='utf-8') as f:
    f.write(etree.tostring(root, encoding=unicode))

回复收藏 0 原文

~没有更多了~