Python.expat 无法解析带有错误符号的 XML 文件。怎么绕过去？

发布于 2024-08-26 07:30:07 字数 408 浏览 11 评论 0原文

我正在尝试使用 expat 解析 XML 文件（OSM 数据），并且有一些包含 expat 无法解析的 Unicode 字符的行：（

<tag k="name"
v="абвгдежзиклмнопр�?туфхцчшщьыъ�?ю�?�?БВГДЕЖЗИКЛМ�?ОПРСТУФХЦЧШЩЬЫЪЭЮЯ" />

<tag k="name" v="Cin\x8e? Rex" />

开头行中的 XML 文件编码是“UTF-8”）

该文件相当老了，肯定有错误。在现代文件中，我看不到 UTF-8 错误，并且它们解析得很好。但是，如果我的程序遇到损坏的符号怎么办，我可以采取什么解决方法？是否可以加入 bz2 编解码器（我解析压缩文件）和 utf-8 编解码器以忽略损坏的字符，或将其更改为“？”？

原文

I'm trying to parse an XML file (OSM data) with expat, and there are lines with some Unicode characters that expat can't parse:

<tag k="name"
v="абвгдежзиклмнопр�?туфхцчшщьыъ�?ю�?�?БВГДЕЖЗИКЛМ�?ОПРСТУФХЦЧШЩЬЫЪЭЮЯ" />

<tag k="name" v="Cin\x8e? Rex" />

(XML file encoding in the opening line is "UTF-8")

The file is quite old, and there must have been errors. In modern files I don't see UTF-8 errors, and they are parsed fine. But what if my program meets a broken symbol, what workaround can I make? Is it possible to join bz2 codec (I parse a compressed file) and utf-8 codec to ignore the broken characters, or change them to "?"?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

若水般的淡然安静女子 2024-09-02 07:30:07

不确定“�”字符是否是通过复制粘贴字符串引入的，
但如果你在原始数据中有它们，那么它似乎是生成器
将 \uFFFD 字符引入为：

“用于替换值未知的传入字符或
在 Unicode 中无法表示”

引用自：
http://www.fileformat.info/info/unicode/char/ fffd/index.htm

解决方法吗？只是扩展的想法：

good = True
buf = None
while True:
if good:
        buf = f.read(buf_size)
        else:
        # try again with cleaned buffer
        pass
        try:
            xp.Parse(buf, len(buf) == 0)
            if (len(buf) == 0):
                    break
        good = True
    except ExpatError:
        if xp.ErrorCode  == XML_ERROR_BAD_CHAR_REF:
            # look at ErrorByteIndex (or nearby)
            # for 0xEF 0xBF 0xBD (UTF8 replacement char) and remove it
            good = False
        else:
            # other errors processing
            pass

或者清理输入缓冲区+极端情况（缓冲区末端的部分序列）。
我不记得 python 的 expat 是否允许分配自定义错误处理程序。
那会更容易。

如果我清除你的样本中的“�”字符，它就可以正常处理。
\xd1 不会失败。

OSM数据？

Not sure if '�' characters were introduced by copy-pasting string here,
but if you have them in original data, then it seems to be generator
problem which introduced \uFFFD charactes as:

"used to replace an incoming character whose value is unknown or
unrepresentable in Unicode"

citied from:
http://www.fileformat.info/info/unicode/char/fffd/index.htm

Workaround? Just idea for extension:

good = True
buf = None
while True:
if good:
        buf = f.read(buf_size)
        else:
        # try again with cleaned buffer
        pass
        try:
            xp.Parse(buf, len(buf) == 0)
            if (len(buf) == 0):
                    break
        good = True
    except ExpatError:
        if xp.ErrorCode  == XML_ERROR_BAD_CHAR_REF:
            # look at ErrorByteIndex (or nearby)
            # for 0xEF 0xBF 0xBD (UTF8 replacement char) and remove it
            good = False
        else:
            # other errors processing
            pass

Or clean input buffer instead + corner cases (partial sequence at the buffer end).
I can't recall if python's expat allows to assign custom error handler.
That would be easier then.

If i clean yours sample from '�' characters it's processed ok.
\xd1 does not fail.

OSM data?

回复收藏 0 原文

~没有更多了~