如何编写去掉 UTF-8 的 ElementTree
我有一个巨大的 (50MB) XML ElementTree,是我生成的,原始数据中的某个位置有一些没有被删除的 UTF-8 字母。尽管 tostring 中有一个“encoding='UTF-8'”选项,但 ElementTree.write 和 .tostring 似乎对 unicode 感到窒息。文档相当有限,我什至不确定 tostring 是否支持 UTF-8(查看源代码)。
所以我的问题是——如何去除整个元素树中的所有非 ASCII 字符,以便我可以将这个怪物写入磁盘(需要 8 个小时才能生成)?我现在已经腌好了。我还在大多数数据上使用了一个名为 latin1_to_ascii 的函数:
def latin1_to_ascii(unicrap):
"""
This takes a UNICODE string and replaces Latin-1 characters with
something equivalent in 7-bit ASCII. Anything not converted is deleted.
#the unicode hammer approach: http://code.activestate.com/recipes/251871-latin1-to-ascii-the-unicode-hammer/
"""
xlate={0xc0:'A', 0xc1:'A', 0xc2:'A', 0xc3:'A', 0xc4:'A', 0xc5:'A',
0xc6:'Ae', 0xc7:'C',
0xc8:'E', 0xc9:'E', 0xca:'E', 0xcb:'E',
0xcc:'I', 0xcd:'I', 0xce:'I', 0xcf:'I',
0xd0:'Th', 0xd1:'N',
0xd2:'O', 0xd3:'O', 0xd4:'O', 0xd5:'O', 0xd6:'O', 0xd8:'O',
0xd9:'U', 0xda:'U', 0xdb:'U', 0xdc:'U',
0xdd:'Y', 0xde:'th', 0xdf:'ss',
0xe0:'a', 0xe1:'a', 0xe2:'a', 0xe3:'a', 0xe4:'a', 0xe5:'a',
0xe6:'ae', 0xe7:'c',
0xe8:'e', 0xe9:'e', 0xea:'e', 0xeb:'e',
0xec:'i', 0xed:'i', 0xee:'i', 0xef:'i',
0xf0:'th', 0xf1:'n',
0xf2:'o', 0xf3:'o', 0xf4:'o', 0xf5:'o', 0xf6:'o', 0xf8:'o',
0xf9:'u', 0xfa:'u', 0xfb:'u', 0xfc:'u',
0xfd:'y', 0xfe:'th', 0xff:'y',
0xa1:'!', 0xa2:'{cent}', 0xa3:'{pound}', 0xa4:'{currency}',
0xa5:'{yen}', 0xa6:'|', 0xa7:'{section}', 0xa8:'{umlaut}',
0xa9:'{C}', 0xaa:'{^a}', 0xab:'<<', 0xac:'{not}',
0xad:'-', 0xae:'{R}', 0xaf:'_', 0xb0:'{degrees}',
0xb1:'{+/-}', 0xb2:'{^2}', 0xb3:'{^3}', 0xb4:"'",
0xb5:'{micro}', 0xb6:'{paragraph}', 0xb7:'*', 0xb8:'{cedilla}',
0xb9:'{^1}', 0xba:'{^o}', 0xbb:'>>',
0xbc:'{1/4}', 0xbd:'{1/2}', 0xbe:'{3/4}', 0xbf:'?',
0xd7:'*', 0xf7:'/',0x92:'a'
}
r = ''
for i in unicrap:
if xlate.has_key(ord(i)):
r += xlate[ord(i)]
elif ord(i) >= 0x80:
pass
else:
r += str(i)
return r
该“核选项”函数仅适用于字符串,现在我在 Element 中拥有数据,我似乎无法删除我错过的内容。
I have a giant (50MB) XML ElementTree that I've generated and somewhere in the raw data were some UTF-8 letters that didn't get stripped out. ElementTree.write and .tostring seem to choke on unicode even though there's an "encoding='UTF-8'" option in tostring. The docs are rather limited and I'm not even sure that tostring is UTF-8 friendly (looking at the source).
So my question - how do I strip this whole elementtree of any non-ascii characters so I can write this monster to disk (which took 8 hours to generate)? I have pickled it for now. I also used a function called latin1_to_ascii on most of the data:
def latin1_to_ascii(unicrap):
"""
This takes a UNICODE string and replaces Latin-1 characters with
something equivalent in 7-bit ASCII. Anything not converted is deleted.
#the unicode hammer approach: http://code.activestate.com/recipes/251871-latin1-to-ascii-the-unicode-hammer/
"""
xlate={0xc0:'A', 0xc1:'A', 0xc2:'A', 0xc3:'A', 0xc4:'A', 0xc5:'A',
0xc6:'Ae', 0xc7:'C',
0xc8:'E', 0xc9:'E', 0xca:'E', 0xcb:'E',
0xcc:'I', 0xcd:'I', 0xce:'I', 0xcf:'I',
0xd0:'Th', 0xd1:'N',
0xd2:'O', 0xd3:'O', 0xd4:'O', 0xd5:'O', 0xd6:'O', 0xd8:'O',
0xd9:'U', 0xda:'U', 0xdb:'U', 0xdc:'U',
0xdd:'Y', 0xde:'th', 0xdf:'ss',
0xe0:'a', 0xe1:'a', 0xe2:'a', 0xe3:'a', 0xe4:'a', 0xe5:'a',
0xe6:'ae', 0xe7:'c',
0xe8:'e', 0xe9:'e', 0xea:'e', 0xeb:'e',
0xec:'i', 0xed:'i', 0xee:'i', 0xef:'i',
0xf0:'th', 0xf1:'n',
0xf2:'o', 0xf3:'o', 0xf4:'o', 0xf5:'o', 0xf6:'o', 0xf8:'o',
0xf9:'u', 0xfa:'u', 0xfb:'u', 0xfc:'u',
0xfd:'y', 0xfe:'th', 0xff:'y',
0xa1:'!', 0xa2:'{cent}', 0xa3:'{pound}', 0xa4:'{currency}',
0xa5:'{yen}', 0xa6:'|', 0xa7:'{section}', 0xa8:'{umlaut}',
0xa9:'{C}', 0xaa:'{^a}', 0xab:'<<', 0xac:'{not}',
0xad:'-', 0xae:'{R}', 0xaf:'_', 0xb0:'{degrees}',
0xb1:'{+/-}', 0xb2:'{^2}', 0xb3:'{^3}', 0xb4:"'",
0xb5:'{micro}', 0xb6:'{paragraph}', 0xb7:'*', 0xb8:'{cedilla}',
0xb9:'{^1}', 0xba:'{^o}', 0xbb:'>>',
0xbc:'{1/4}', 0xbd:'{1/2}', 0xbe:'{3/4}', 0xbf:'?',
0xd7:'*', 0xf7:'/',0x92:'a'
}
r = ''
for i in unicrap:
if xlate.has_key(ord(i)):
r += xlate[ord(i)]
elif ord(i) >= 0x80:
pass
else:
r += str(i)
return r
that "nuclear option" function only works on strings, and now that I have the data in an Element I can't seem to strip the stuff I missed.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您需要解释“原始数据中的某个位置有一些没有被删除的 UTF-8 字母”——比如什么是“UTF-8 字母”,以及为什么要删除它们。
如果您解释“ElementTree.write 和 .tostring 似乎在 unicode 上窒息”的含义,也会有所帮助。请编辑您的问题以显示完整的错误消息和回溯。
为什么要使用该函数将 unicode 转换为 ASCII?仅仅是为了克服你所遇到的问题吗?
您可能将以 UTF-8 编码的
str
对象提供给 ElementTree。不要那样做。向其提供unicode
对象,它就可以工作:如果您必须有 ASCII 输出(您正在通过 7 位宽的通道进行通信?):
UTF-8 可以工作:
您应该使用 ElementTree.write 方法 来编写您的文件,优先于使用'串';它节省了双重处理。
You need to explain "somewhere in the raw data were some UTF-8 letters that didn't get stripped out" -- like what is a "UTF-8 letter", and why you want to strip them out.
It would also help if you explained what "ElementTree.write and .tostring seem to choke on unicode" means. Please edit your question to show the full error message and traceback.
Why do you want to used that function to bash your unicode into ASCII? Is it merely to overcome the problems that you are having?
It is probable that you are feeding
str
objects encoded in UTF-8 to ElementTree. Don't do that. Feed itunicode
objects, and it just works:If you must have ASCII output (you're communicating over a 7-bit-wide channel?):
UTF-8 works:
You should use the ElementTree.write method to write your file, in preference to using 'tostring'; it saves double-handling.
我将再次运行该过程,在树创建期间将输入字符串解码为 unicode。八个小时可能是很长的时间,但您可以做其他事情,而不是等待其他人对内存修补的指示。
请务必在一小部分数据上进行测试,以确认您的代码可以正常工作,然后再继续。
I'd run the process again, decoding the input strings to unicode during the tree creation. Eight hours may be a long time, but you can do other things instead of waiting for pointers on in-memory patching from others.
Make sure to test on a small subset of the data to confirm your code works before continuing on.
在我看来,问题更有可能是输出文件的编码,就像您正在使用的文件一样。您能否提供更多代码来说明您如何编写它?我不明白
ElementTree.write()
和ElementTree.tostring()
会如何被它阻塞。it sounds to me like the problem is more likely to be the encoding of the output file-like that you're working with. could you provide more code for how you're trying to write it out? I don't see how
ElementTree.write()
andElementTree.tostring()
could be choking on it.好吧,即使你们认为我这样做很疯狂,它也有效:
我在 Notepad++ 中打开 pickle 文件并手动找到所有“\x??”字符与正则表达式,并删除它们。然后我在命令行中使用 ElementTree 将 pickle 导入到 python 中以另存为 XML 文件:
Okay even if you guys think I'm crazy for doing it this way, it works:
I opened the pickle file in Notepad++ and manually found all the "\x??" characters with regex, and removed them. Then I imported the pickle into python to save as an XML file using ElementTree at the command line: