当前位置：文江博客话题详情

如何用minidom解析unicode字符串？

发布于 2024-10-23 20:58:28 字数 317 浏览 1 评论 0原文

我正在尝试使用 xml.dom.minidom 库解析一堆 xml 文件，以提取一些数据并将其放入文本文件中。大多数 XML 都运行良好，但对于其中一些 XML，我在调用 minidom.parsestring() 时收到以下错误：

UnicodeEncodeError：“ascii”编解码器无法对位置 5189 中的字符 u'\u2019' 进行编码：序号不在范围（128）

某些其他非 ascii 字符也会发生这种情况。我的问题是：我有什么选择？我是否应该在能够解析 XML 文件之前以某种方式删除/替换所有这些非英语字符？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

乙白 2024-10-30 20:58:28

尝试解码它：

> print u'abcdé'.encode('utf-8')
> abcdÃ©

> print u'abcdé'.encode('utf-8').decode('utf-8')
> abcdé

Try to decode it:

> print u'abcdé'.encode('utf-8')
> abcdÃ©

> print u'abcdé'.encode('utf-8').decode('utf-8')
> abcdé

回复收藏 0 原文

澉约 2024-10-30 20:58:28

如果您的字符串是“str”：

xmldoc = minidom.parseString(u'{0}'.format(str).encode('utf-8'))

这对我有用。

In case your string is 'str':

xmldoc = minidom.parseString(u'{0}'.format(str).encode('utf-8'))

This worked for me.

回复收藏 0 原文

吹泡泡o 2024-10-30 20:58:28

Minidom 不直接支持解析 Unicode 字符串；历史上它的支持和标准化都很差。许多 XML 工具仅将字节流识别为 XML 解析器可以使用的内容。

如果您有纯文件，您应该将它们作为字节字符串（而不是 Unicode！）读取并将其传递给 parseString()，或者仅使用 parse() 这将直接读取文件。

回复收藏 0 原文

一杯敬自由 2024-10-30 20:58:28

我知道 OP 询问了有关解析字符串的问题，但在通过 Document.writexml(...) 将 DOM 模型写入文件时，我遇到了同样的异常。如果有这个（相关）问题的人来到这里，我将提供我的解决方案。

我抛出 UnicodeEncodeError 的代码如下所示：

使用 tempfile.NamedTemporaryFile(delete=False) 作为 fh：
    dom.writexml(fh, 编码=“utf-8”)
请注意，“encoding”参数仅影响 XML 标头，对数据的处理没有影响。为了解决这个问题，我将其更改为：
使用 tempfile.NamedTemporaryFile(delete=False) 作为 fh：
    fh = codecs.lookup("utf-8")[3](fh)
    dom.writexml(fh, 编码=“utf-8”)

这将使用 encodings.utf_8.StreamWriter 的实例包装文件句柄，该实例将数据处理为 UTF-8 而不是 ASCII，并且 UnicodeEncodeError 消失了。我从阅读 xml.dom.minidom.Node.toprettyxml(...) 的源代码中得到了这个想法。

I know the O.P. asked about parsing strings, but I had the same exception upon writing the DOM model to a file via Document.writexml(...). In case people with that (related) problem land here, I will offer my solution.

My code which was throwing the UnicodeEncodeError looked like:

with tempfile.NamedTemporaryFile(delete=False) as fh:
    dom.writexml(fh, encoding="utf-8")
Note that the "encoding" param only effects the XML header and has no effect on the treatment of the data. To fix it, I changed it to:
with tempfile.NamedTemporaryFile(delete=False) as fh:
    fh = codecs.lookup("utf-8")[3](fh)
    dom.writexml(fh, encoding="utf-8")

This will wrap the file handle with an instance of encodings.utf_8.StreamWriter, which handles the data as UTF-8 rather then ASCII, and the UnicodeEncodeError went away. I got the idea from reading the source of xml.dom.minidom.Node.toprettyxml(...).

回复收藏 0 原文

把昨日还给我 2024-10-30 20:58:28

我遇到这个错误几次，我处理它的老套方法就是这样做：

def getCleanString(word):   
   str = ""
   for character in word:
      try: 
         str_character = str(character)
         str = str + str_character
      except:
         dummy = 1 # this happens if character is unicode
   return str

当然，这可能是一种愚蠢的方法，但它为我完成了工作，并且不需要我付出代价任何事情的速度。

I encounter this error a few times, and my hacky way of dealing with it is just to do this:

def getCleanString(word):   
   str = ""
   for character in word:
      try: 
         str_character = str(character)
         str = str + str_character
      except:
         dummy = 1 # this happens if character is unicode
   return str

Of course, this is probably a dumb way of doing it, but it gets the job done for me, and doesn't cost me anything in speed.

回复收藏 0 原文

~没有更多了~