如何编写去掉 UTF-8 的 ElementTree

发布于 2024-12-12 10:57:51 字数 2354 浏览 8 评论 0原文

我有一个巨大的 (50MB) XML ElementTree，是我生成的，原始数据中的某个位置有一些没有被删除的 UTF-8 字母。尽管 tostring 中有一个“encoding='UTF-8'”选项，但 ElementTree.write 和 .tostring 似乎对 unicode 感到窒息。文档相当有限，我什至不确定 tostring 是否支持 UTF-8（查看源代码）。

所以我的问题是——如何去除整个元素树中的所有非 ASCII 字符，以便我可以将这个怪物写入磁盘（需要 8 个小时才能生成）？我现在已经腌好了。我还在大多数数据上使用了一个名为 latin1_to_ascii 的函数：

def latin1_to_ascii(unicrap):
        """
        This takes a UNICODE string and replaces Latin-1 characters with
        something equivalent in 7-bit ASCII. Anything not converted is deleted.
    #the unicode hammer approach: http://code.activestate.com/recipes/251871-latin1-to-ascii-the-unicode-hammer/
    """
    xlate={0xc0:'A', 0xc1:'A', 0xc2:'A', 0xc3:'A', 0xc4:'A', 0xc5:'A',
            0xc6:'Ae', 0xc7:'C',
            0xc8:'E', 0xc9:'E', 0xca:'E', 0xcb:'E',
            0xcc:'I', 0xcd:'I', 0xce:'I', 0xcf:'I',
            0xd0:'Th', 0xd1:'N',
            0xd2:'O', 0xd3:'O', 0xd4:'O', 0xd5:'O', 0xd6:'O', 0xd8:'O',
            0xd9:'U', 0xda:'U', 0xdb:'U', 0xdc:'U',
            0xdd:'Y', 0xde:'th', 0xdf:'ss',
            0xe0:'a', 0xe1:'a', 0xe2:'a', 0xe3:'a', 0xe4:'a', 0xe5:'a',
            0xe6:'ae', 0xe7:'c',
            0xe8:'e', 0xe9:'e', 0xea:'e', 0xeb:'e',
            0xec:'i', 0xed:'i', 0xee:'i', 0xef:'i',
            0xf0:'th', 0xf1:'n',
            0xf2:'o', 0xf3:'o', 0xf4:'o', 0xf5:'o', 0xf6:'o', 0xf8:'o',
            0xf9:'u', 0xfa:'u', 0xfb:'u', 0xfc:'u',
            0xfd:'y', 0xfe:'th', 0xff:'y',
            0xa1:'!', 0xa2:'{cent}', 0xa3:'{pound}', 0xa4:'{currency}',
            0xa5:'{yen}', 0xa6:'|', 0xa7:'{section}', 0xa8:'{umlaut}',
            0xa9:'{C}', 0xaa:'{^a}', 0xab:'<<', 0xac:'{not}',
            0xad:'-', 0xae:'{R}', 0xaf:'_', 0xb0:'{degrees}',
            0xb1:'{+/-}', 0xb2:'{^2}', 0xb3:'{^3}', 0xb4:"'",
            0xb5:'{micro}', 0xb6:'{paragraph}', 0xb7:'*', 0xb8:'{cedilla}',
            0xb9:'{^1}', 0xba:'{^o}', 0xbb:'>>', 
            0xbc:'{1/4}', 0xbd:'{1/2}', 0xbe:'{3/4}', 0xbf:'?',
            0xd7:'*', 0xf7:'/',0x92:'a'
            }
    r = ''
    for i in unicrap:
            if xlate.has_key(ord(i)):
                    r += xlate[ord(i)]
            elif ord(i) >= 0x80:
                    pass
            else:
                    r += str(i)
    return r

该“核选项”函数仅适用于字符串，现在我在 Element 中拥有数据，我似乎无法删除我错过的内容。

原文

I have a giant (50MB) XML ElementTree that I've generated and somewhere in the raw data were some UTF-8 letters that didn't get stripped out. ElementTree.write and .tostring seem to choke on unicode even though there's an "encoding='UTF-8'" option in tostring. The docs are rather limited and I'm not even sure that tostring is UTF-8 friendly (looking at the source).

So my question - how do I strip this whole elementtree of any non-ascii characters so I can write this monster to disk (which took 8 hours to generate)? I have pickled it for now. I also used a function called latin1_to_ascii on most of the data:

def latin1_to_ascii(unicrap):
        """
        This takes a UNICODE string and replaces Latin-1 characters with
        something equivalent in 7-bit ASCII. Anything not converted is deleted.
    #the unicode hammer approach: http://code.activestate.com/recipes/251871-latin1-to-ascii-the-unicode-hammer/
    """
    xlate={0xc0:'A', 0xc1:'A', 0xc2:'A', 0xc3:'A', 0xc4:'A', 0xc5:'A',
            0xc6:'Ae', 0xc7:'C',
            0xc8:'E', 0xc9:'E', 0xca:'E', 0xcb:'E',
            0xcc:'I', 0xcd:'I', 0xce:'I', 0xcf:'I',
            0xd0:'Th', 0xd1:'N',
            0xd2:'O', 0xd3:'O', 0xd4:'O', 0xd5:'O', 0xd6:'O', 0xd8:'O',
            0xd9:'U', 0xda:'U', 0xdb:'U', 0xdc:'U',
            0xdd:'Y', 0xde:'th', 0xdf:'ss',
            0xe0:'a', 0xe1:'a', 0xe2:'a', 0xe3:'a', 0xe4:'a', 0xe5:'a',
            0xe6:'ae', 0xe7:'c',
            0xe8:'e', 0xe9:'e', 0xea:'e', 0xeb:'e',
            0xec:'i', 0xed:'i', 0xee:'i', 0xef:'i',
            0xf0:'th', 0xf1:'n',
            0xf2:'o', 0xf3:'o', 0xf4:'o', 0xf5:'o', 0xf6:'o', 0xf8:'o',
            0xf9:'u', 0xfa:'u', 0xfb:'u', 0xfc:'u',
            0xfd:'y', 0xfe:'th', 0xff:'y',
            0xa1:'!', 0xa2:'{cent}', 0xa3:'{pound}', 0xa4:'{currency}',
            0xa5:'{yen}', 0xa6:'|', 0xa7:'{section}', 0xa8:'{umlaut}',
            0xa9:'{C}', 0xaa:'{^a}', 0xab:'<<', 0xac:'{not}',
            0xad:'-', 0xae:'{R}', 0xaf:'_', 0xb0:'{degrees}',
            0xb1:'{+/-}', 0xb2:'{^2}', 0xb3:'{^3}', 0xb4:"'",
            0xb5:'{micro}', 0xb6:'{paragraph}', 0xb7:'*', 0xb8:'{cedilla}',
            0xb9:'{^1}', 0xba:'{^o}', 0xbb:'>>', 
            0xbc:'{1/4}', 0xbd:'{1/2}', 0xbe:'{3/4}', 0xbf:'?',
            0xd7:'*', 0xf7:'/',0x92:'a'
            }
    r = ''
    for i in unicrap:
            if xlate.has_key(ord(i)):
                    r += xlate[ord(i)]
            elif ord(i) >= 0x80:
                    pass
            else:
                    r += str(i)
    return r

that "nuclear option" function only works on strings, and now that I have the data in an Element I can't seem to strip the stuff I missed.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

情何以堪。 2024-12-19 10:57:51

您需要解释“原始数据中的某个位置有一些没有被删除的 UTF-8 字母”——比如什么是“UTF-8 字母”，以及为什么要删除它们。

如果您解释“ElementTree.write 和 .tostring 似乎在 unicode 上窒息”的含义，也会有所帮助。请编辑您的问题以显示完整的错误消息和回溯。

为什么要使用该函数将 unicode 转换为 ASCII？仅仅是为了克服你所遇到的问题吗？

您可能将以 UTF-8 编码的 str 对象提供给 ElementTree。不要那样做。向其提供 unicode 对象，它就可以工作：

>>> e = et.Element('root')
>>> e.text = u''.join(unichr(i) for i in xrange(0x400, 0x408))
>>> e.text
u'\u0400\u0401\u0402\u0403\u0404\u0405\u0406\u0407'

如果您必须有 ASCII 输出（您正在通过 7 位宽的通道进行通信？）：

>>> et.tostring(e)
'<root>ЀЁЂЃЄЅІЇ</root>'

UTF-8 可以工作：

>>> et.tostring(e, 'UTF-8')
"<?xml version='1.0' encoding='UTF-8'?>\n<root>\xd0\x80\xd0\x81\xd0\x82\xd0\x83\xd0\x84\xd0\x85\xd0\x86\xd0\x87</root>"

您应该使用 ElementTree.write 方法来编写您的文件，优先于使用'串';它节省了双重处理。

You need to explain "somewhere in the raw data were some UTF-8 letters that didn't get stripped out" -- like what is a "UTF-8 letter", and why you want to strip them out.

It would also help if you explained what "ElementTree.write and .tostring seem to choke on unicode" means. Please edit your question to show the full error message and traceback.

Why do you want to used that function to bash your unicode into ASCII? Is it merely to overcome the problems that you are having?

It is probable that you are feeding str objects encoded in UTF-8 to ElementTree. Don't do that. Feed it unicode objects, and it just works:

>>> e = et.Element('root')
>>> e.text = u''.join(unichr(i) for i in xrange(0x400, 0x408))
>>> e.text
u'\u0400\u0401\u0402\u0403\u0404\u0405\u0406\u0407'

If you must have ASCII output (you're communicating over a 7-bit-wide channel?):

>>> et.tostring(e)
'<root>ЀЁЂЃЄЅІЇ</root>'

UTF-8 works:

>>> et.tostring(e, 'UTF-8')
"<?xml version='1.0' encoding='UTF-8'?>\n<root>\xd0\x80\xd0\x81\xd0\x82\xd0\x83\xd0\x84\xd0\x85\xd0\x86\xd0\x87</root>"

You should use the ElementTree.write method to write your file, in preference to using 'tostring'; it saves double-handling.

回复收藏 0 原文

佞臣 2024-12-19 10:57:51

我将再次运行该过程，在树创建期间将输入字符串解码为 unicode。八个小时可能是很长的时间，但您可以做其他事情，而不是等待其他人对内存修补的指示。

请务必在一小部分数据上进行测试，以确认您的代码可以正常工作，然后再继续。

回复收藏 0 原文

白馒头 2024-12-19 10:57:51

在我看来，问题更有可能是输出文件的编码，就像您正在使用的文件一样。您能否提供更多代码来说明您如何编写它？我不明白 ElementTree.write() 和 ElementTree.tostring() 会如何被它阻塞。

回复收藏 0 原文

哆兒滾 2024-12-19 10:57:51

好吧，即使你们认为我这样做很疯狂，它也有效：

我在 Notepad++ 中打开 pickle 文件并手动找到所有“\x??”字符与正则表达式，并删除它们。然后我在命令行中使用 ElementTree 将 pickle 导入到 python 中以另存为 XML 文件：

<块引用>
<块引用>
f = open('pulsewire/pulse_cleaned.pickle','rb')
导入泡菜
数据 = pickle.load(f)
导入 xml.etree.ElementTree 作为 ET
bob = ET.ElementTree(data) <-- 需要先将元素包装在 Tree 中
bob.write("pulsewire/testtree.xml")