如何用minidom解析unicode字符串?

发布于 2024-10-23 20:58:28 字数 317 浏览 1 评论 0原文

我正在尝试使用 xml.dom.minidom 库解析一堆 xml 文件,以提取一些数据并将其放入文本文件中。大多数 XML 都运行良好,但对于其中一些 XML,我在调用 minidom.parsestring() 时收到以下错误:

UnicodeEncodeError:“ascii”编解码器无法对位置 5189 中的字符 u'\u2019' 进行编码:序号不在范围(128)

某些其他非 ascii 字符也会发生这种情况。我的问题是:我有什么选择?我是否应该在能够解析 XML 文件之前以某种方式删除/替换所有这些非英语字符?

I'm trying to parse a bunch of xml files with the library xml.dom.minidom, to extract some data and put it in a text file. Most of the XMLs go well, but for some of them I get the following error when calling minidom.parsestring():

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 5189: ordinal not in range(128)

It happens for some other non-ascii characters too. My question is: what are my options here? Am I supposed to somehow strip/replace all those non-English characters before being able to parse the XML files?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

乙白 2024-10-30 20:58:28

尝试解码它:

> print u'abcdé'.encode('utf-8')
> abcdé

> print u'abcdé'.encode('utf-8').decode('utf-8')
> abcdé

Try to decode it:

> print u'abcdé'.encode('utf-8')
> abcdé

> print u'abcdé'.encode('utf-8').decode('utf-8')
> abcdé
澉约 2024-10-30 20:58:28

如果您的字符串是“str”:

xmldoc = minidom.parseString(u'{0}'.format(str).encode('utf-8'))

这对我有用。

In case your string is 'str':

xmldoc = minidom.parseString(u'{0}'.format(str).encode('utf-8'))

This worked for me.

吹泡泡o 2024-10-30 20:58:28

Minidom 不直接支持解析 Unicode 字符串;历史上它的支持和标准化都很差。许多 XML 工具仅将字节流识别为 XML 解析器可以使用的内容。

如果您有纯文件,您应该将它们作为字节字符串(而不是 Unicode!)读取并将其传递给 parseString(),或者仅使用 parse() 这将直接读取文件。

Minidom doesn't directly support parsing Unicode strings; it's something that has historically had poor support and standardisation. Many XML tools recognise only byte streams as something an XML parser can consume.

If you have plain files, you should either read them in as byte strings (not Unicode!) and pass that to parseString(), or just use parse() which will read a file directly.

一杯敬自由 2024-10-30 20:58:28

我知道 OP 询问了有关解析字符串的问题,但在通过 Document.writexml(...) 将 DOM 模型写入文件时,我遇到了同样的异常。如果有这个(相关)问题的人来到这里,我将提供我的解决方案。

我抛出 UnicodeEncodeError 的代码如下所示:

使用 tempfile.NamedTemporaryFile(delete=False) 作为 fh:
    dom.writexml(fh, 编码=“utf-8”)

请注意,“encoding”参数仅影响 XML 标头,对数据的处理没有影响。为了解决这个问题,我将其更改为:

使用 tempfile.NamedTemporaryFile(delete=False) 作为 fh:
    fh = codecs.lookup("utf-8")[3](fh)
    dom.writexml(fh, 编码=“utf-8”)

这将使用 encodings.utf_8.StreamWriter 的实例包装文件句柄,该实例将数据处理为 UTF-8 而不是 ASCII,并且 UnicodeEncodeError 消失了。我从阅读 xml.dom.minidom.Node.toprettyxml(...) 的源代码中得到了这个想法。

I know the O.P. asked about parsing strings, but I had the same exception upon writing the DOM model to a file via Document.writexml(...). In case people with that (related) problem land here, I will offer my solution.

My code which was throwing the UnicodeEncodeError looked like:

with tempfile.NamedTemporaryFile(delete=False) as fh:
    dom.writexml(fh, encoding="utf-8")

Note that the "encoding" param only effects the XML header and has no effect on the treatment of the data. To fix it, I changed it to:

with tempfile.NamedTemporaryFile(delete=False) as fh:
    fh = codecs.lookup("utf-8")[3](fh)
    dom.writexml(fh, encoding="utf-8")

This will wrap the file handle with an instance of encodings.utf_8.StreamWriter, which handles the data as UTF-8 rather then ASCII, and the UnicodeEncodeError went away. I got the idea from reading the source of xml.dom.minidom.Node.toprettyxml(...).

把昨日还给我 2024-10-30 20:58:28

我遇到这个错误几次,我处理它的老套方法就是这样做:

def getCleanString(word):   
   str = ""
   for character in word:
      try: 
         str_character = str(character)
         str = str + str_character
      except:
         dummy = 1 # this happens if character is unicode
   return str

当然,这可能是一种愚蠢的方法,但它为我完成了工作,并且不需要我付出代价任何事情的速度。

I encounter this error a few times, and my hacky way of dealing with it is just to do this:

def getCleanString(word):   
   str = ""
   for character in word:
      try: 
         str_character = str(character)
         str = str + str_character
      except:
         dummy = 1 # this happens if character is unicode
   return str

Of course, this is probably a dumb way of doing it, but it gets the job done for me, and doesn't cost me anything in speed.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文