如何用minidom解析unicode字符串?
我正在尝试使用 xml.dom.minidom 库解析一堆 xml 文件,以提取一些数据并将其放入文本文件中。大多数 XML 都运行良好,但对于其中一些 XML,我在调用 minidom.parsestring() 时收到以下错误:
UnicodeEncodeError:“ascii”编解码器无法对位置 5189 中的字符 u'\u2019' 进行编码:序号不在范围(128)
某些其他非 ascii 字符也会发生这种情况。我的问题是:我有什么选择?我是否应该在能够解析 XML 文件之前以某种方式删除/替换所有这些非英语字符?
I'm trying to parse a bunch of xml files with the library xml.dom.minidom, to extract some data and put it in a text file. Most of the XMLs go well, but for some of them I get the following error when calling minidom.parsestring():
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 5189: ordinal not in range(128)
It happens for some other non-ascii characters too. My question is: what are my options here? Am I supposed to somehow strip/replace all those non-English characters before being able to parse the XML files?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
尝试解码它:
Try to decode it:
如果您的字符串是“str”:
这对我有用。
In case your string is 'str':
This worked for me.
Minidom 不直接支持解析 Unicode 字符串;历史上它的支持和标准化都很差。许多 XML 工具仅将字节流识别为 XML 解析器可以使用的内容。
如果您有纯文件,您应该将它们作为字节字符串(而不是 Unicode!)读取并将其传递给
parseString()
,或者仅使用parse()
这将直接读取文件。Minidom doesn't directly support parsing Unicode strings; it's something that has historically had poor support and standardisation. Many XML tools recognise only byte streams as something an XML parser can consume.
If you have plain files, you should either read them in as byte strings (not Unicode!) and pass that to
parseString()
, or just useparse()
which will read a file directly.我知道 OP 询问了有关解析字符串的问题,但在通过 Document.writexml(...) 将 DOM 模型写入文件时,我遇到了同样的异常。如果有这个(相关)问题的人来到这里,我将提供我的解决方案。
我抛出 UnicodeEncodeError 的代码如下所示:
这将使用 encodings.utf_8.StreamWriter 的实例包装文件句柄,该实例将数据处理为 UTF-8 而不是 ASCII,并且 UnicodeEncodeError 消失了。我从阅读 xml.dom.minidom.Node.toprettyxml(...) 的源代码中得到了这个想法。
I know the O.P. asked about parsing strings, but I had the same exception upon writing the DOM model to a file via Document.writexml(...). In case people with that (related) problem land here, I will offer my solution.
My code which was throwing the UnicodeEncodeError looked like:
This will wrap the file handle with an instance of encodings.utf_8.StreamWriter, which handles the data as UTF-8 rather then ASCII, and the UnicodeEncodeError went away. I got the idea from reading the source of xml.dom.minidom.Node.toprettyxml(...).
我遇到这个错误几次,我处理它的老套方法就是这样做:
当然,这可能是一种愚蠢的方法,但它为我完成了工作,并且不需要我付出代价任何事情的速度。
I encounter this error a few times, and my hacky way of dealing with it is just to do this:
Of course, this is probably a dumb way of doing it, but it gets the job done for me, and doesn't cost me anything in speed.