为什么打印到 utf-8 文件失败?

发布于 2024-11-17 13:42:23 字数 2246 浏览 1 评论 0原文

今天下午我遇到了一个问题,我能够解决它,但我不太明白为什么它有效。

这与我前一周遇到的问题有关: python check if utf -8 字符串

基本上是大写的,以下内容将不起作用:

#!/usr/bin/python

import codecs
from lxml import etree

outFile = codecs.open('test.xml', 'w', 'utf-8') #cannot use codecs.open()

root = etree.Element('root')
sect = etree.SubElement(root,'sect')


words = (   u'\u041c\u041e\u0421\u041a\u0412\u0410', # capital of Russia, all uppercase
            u'R\xc9SUM\xc9',    # RESUME with accents
            u'R\xe9sum\xe9',    # Resume with accents
            u'R\xe9SUM\xe9', )  # ReSUMe with accents

for word in words:
    print word
    if word.encode('utf8').decode('utf8').isupper(): #.isupper won't function on utf8 
        title = etree.SubElement(sect,'title')
        title.text = word
    else:
       item = etree.SubElement(sect,'item')
       item.text = word 

print>>outFile,etree.tostring(root,pretty_print=True,xml_declaration=True,encoding='utf-8')

它失败并显示以下内容:

回溯(最近一次调用最后一次):
文件“./temp.py”,第 25 行,位于
print >>outFile,etree.tostring(root,pretty_print=True,xml_declaration=True,encoding='utf-8')
文件“/usr/lib/python2.7/codecs.py”,
第 691 行,写入
返回 self.writer.write(data) 文件 "/usr/lib/python2.7/codecs.py",
第 351 行,写入
数据,消耗= self.encode(对象,self.errors)
UnicodeDecodeError:“ascii”编解码器
无法解码位置 66 中的字节 0xd0:
序数不在范围内(128)

但如果我在没有 codecs.open('test.xml', 'w', 'utf-8') 的情况下打开新文件,而是使用 outFile = open('test.xml', 'w') 它工作得很好。

那么发生了什么?

  • 由于 encoding='utf-8' 是在 etree.tostring() 中指定的,它是否对文件进行编码再次?

  • 如果我离开codecs.open()并删除encoding='utf-8',该文件就会变成ascii文件。为什么?因为 etree.tostring() 我认为默认编码为 ascii?

  • 但是 etree.tostring() 只是被写入 stdout,然后重定向到作为 utf-8 文件创建的文件?

    • print>> 是否未按我的预期工作? outFile.write(etree.tostring()) 的行为方式相同。

基本上,为什么这行不通?这是怎么回事。这可能是微不足道的,但我显然有点困惑,并且渴望弄清楚为什么我的解决方案有效,

So I ran into a problem this afternoon, I was able to solve it, but I don't quite understand why it worked.

this is related to a problem I had the other week: python check if utf-8 string is uppercase

basically, the following will not work:

#!/usr/bin/python

import codecs
from lxml import etree

outFile = codecs.open('test.xml', 'w', 'utf-8') #cannot use codecs.open()

root = etree.Element('root')
sect = etree.SubElement(root,'sect')


words = (   u'\u041c\u041e\u0421\u041a\u0412\u0410', # capital of Russia, all uppercase
            u'R\xc9SUM\xc9',    # RESUME with accents
            u'R\xe9sum\xe9',    # Resume with accents
            u'R\xe9SUM\xe9', )  # ReSUMe with accents

for word in words:
    print word
    if word.encode('utf8').decode('utf8').isupper(): #.isupper won't function on utf8 
        title = etree.SubElement(sect,'title')
        title.text = word
    else:
       item = etree.SubElement(sect,'item')
       item.text = word 

print>>outFile,etree.tostring(root,pretty_print=True,xml_declaration=True,encoding='utf-8')

it fails with the following:

Traceback (most recent call last):
File "./temp.py", line 25, in
print >>outFile,etree.tostring(root,pretty_print=True,xml_declaration=True,encoding='utf-8')
File "/usr/lib/python2.7/codecs.py",
line 691, in write
return self.writer.write(data) File "/usr/lib/python2.7/codecs.py",
line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec
can't decode byte 0xd0 in position 66:
ordinal not in range(128)

but if I open the new file without codecs.open('test.xml', 'w', 'utf-8') and instead use
outFile = open('test.xml', 'w') it works perfectly.

So whats happening??

  • since encoding='utf-8' is specified in etree.tostring() is it encoding the file again?

  • if I leave codecs.open() and remove encoding='utf-8' the file then becomes an ascii file. Why? becuase etree.tostring() has a default encoding of ascii I persume?

  • but etree.tostring() is simply being written to stdout, and is then redirect to a file that was created as a utf-8 file??

    • is print>> not workings as I expect? outFile.write(etree.tostring()) behaves the same way.

Basically, why wouldn't this work? what is going on here. It might be trivial, but I am obviously a bit confused and have a desire to figure out why my solution works,

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

长不大的小祸害 2024-11-24 13:42:23

您已使用 UTF-8 编码打开文件,这意味着它需要 Unicode 字符串。

tostring 正在编码为 UTF-8(以字节串 str 的形式),并将其写入文件。

由于该文件需要 Unicode,因此它使用默认 ASCII 编码将字节字符串解码为 Unicode,以便随后可以将 Unicode 编码为 UTF-8。

不幸的是,字节串不是 ASCII。

编辑:避免此类问题的最佳建议是在内部使用 Unicode,对输入进行解码,对输出进行编码。

You've opened the file with UTF-8 encoding, which means that it expects Unicode strings.

tostring is encoding to UTF-8 (in the form of bytestrings, str), which you're writing to the file.

Because the file is expecting Unicode, it's decoding the bytestrings to Unicode using the default ASCII encoding so that it can then encode the Unicode to UTF-8.

Unfortunately, the bytestrings aren't ASCII.

EDIT: The best advice to avoid this kind of problem is to use Unicode internally, decoding on input and encoding on output.

风吹雨成花 2024-11-24 13:42:23

使用 print>>outFile 有点奇怪。我没有安装 lxml,但内置的 xml.etree 库类似(但不支持 pretty_print)。将 root 元素包装在 ElementTree 中并使用 write 方法。

另外,如果您使用 #coding 行来声明源文件的编码,则可以使用可读的 Unicode 字符串而不是转义码:

#!/usr/bin/python
# coding: utf8

import codecs
from xml.etree import ElementTree as etree

root = etree.Element(u'root')
sect = etree.SubElement(root,u'sect')


words = [u'МОСКВА',u'RÉSUMÉ',u'Résumé',u'RéSUMé']

for word in words:
    print word
    if word.isupper():
        title = etree.SubElement(sect,u'title')
        title.text = word
    else:
       item = etree.SubElement(sect,u'item')
       item.text = word 

tree = etree.ElementTree(root)
tree.write('text.xml',xml_declaration=True,encoding='utf-8')

Using print>>outFile is a little strange. I don't have lxml installed, but the built-in xml.etree library is similar (but doesn't support pretty_print). Wrap the root Element in an ElementTree and use the write method.

Also, if you using a # coding line to declare the encoding of the source file, you can use readable Unicode strings instead of escape codes:

#!/usr/bin/python
# coding: utf8

import codecs
from xml.etree import ElementTree as etree

root = etree.Element(u'root')
sect = etree.SubElement(root,u'sect')


words = [u'МОСКВА',u'RÉSUMÉ',u'Résumé',u'RéSUMé']

for word in words:
    print word
    if word.isupper():
        title = etree.SubElement(sect,u'title')
        title.text = word
    else:
       item = etree.SubElement(sect,u'item')
       item.text = word 

tree = etree.ElementTree(root)
tree.write('text.xml',xml_declaration=True,encoding='utf-8')
凤舞天涯 2024-11-24 13:42:23

除了 MRAB 之外,还回答了一些代码行:

import codecs
from lxml import etree

root = etree.Element('root')
sect = etree.SubElement(root,'sect')

# do some other xml building here

with codecs.open('test.xml', 'w', encoding='utf-8') as f:
    f.write(etree.tostring(root, encoding=unicode))

In addition to MRABs answer some lines of code:

import codecs
from lxml import etree

root = etree.Element('root')
sect = etree.SubElement(root,'sect')

# do some other xml building here

with codecs.open('test.xml', 'w', encoding='utf-8') as f:
    f.write(etree.tostring(root, encoding=unicode))
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文