Python UTF-8 XML 解析 (SUDS)：删除“无效令牌”

发布于 2024-12-23 18:40:01 字数 607 浏览 2 评论 0原文

这是处理 UTF-8 时的一个常见错误 - “无效令牌”

在我的示例中，它来自于处理不尊重 unicode 字符的 SOAP 服务提供者，只是将值截断为 100 字节并忽略了第 100 个字节可能位于多字节字符的中间：例如：

<name xsi:type="xsd:string">浙江家庭教会五十人遭驱散及抓打 圣诞节聚会被断电及抢走物品(图、视频\xef\xbc</name>

在截断刀假定世界使用 1 字节之后，最后两个字节是 3 字节 unicode 字符的剩余部分人物。下一站，sax 解析器和：

xml.sax._exceptions.SAXParseException: <unknown>:1:2392: not well-formed (invalid token)

我不再关心这个角色了。应将其从文档中删除并允许 sax 解析器运行。

除了这些值之外，XML 回复在所有其他方面均有效。

问题：如何在不解析整个文档并重新发明 UTF-8 编码来检查每个字节的情况下删除该字符？

使用：Python+SUDS

原文

Here's a common error when dealing with UTF-8 - 'invalid tokens'

In my example, It comes from dealing with a SOAP service provider that had no respect for unicode characters, simply truncating values to 100 bytes and neglecting that the 100'th byte may be in the middle of a multi-byte character: for example:

<name xsi:type="xsd:string">浙江家庭教会五十人遭驱散及抓打 圣诞节聚会被断电及抢走物品(图、视频\xef\xbc</name>

The last two bytes are what remains of a 3 byte unicode character, after the truncation knife assumed that the world uses 1-byte characters. Next stop, sax parser and:

xml.sax._exceptions.SAXParseException: <unknown>:1:2392: not well-formed (invalid token)

I don't care about this character anymore. It should be removed from the document and allow the sax parser to function.

The XML reply is valid in every other respect except for these values.

Question: How do you remove this character without parsing the entire document and re-inventing UTF-8 encoding to check every byte?

Using: Python+SUDS

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

可可 2024-12-30 18:40:01

事实证明，SUDS 将 xml 视为“字符串”类型（而不是 unicode），因此这些是编码值。

1) 过滤器：

badXML = "your bad utf-8 xml here"  #(type <str>)

#Turn it into a python unicode string - ignore errors, kick out bad unicode
decoded = badXML.decode('utf-8', errors='ignore')  #(type <unicode>)

#turn it back into a string, using utf-8 encoding.
goodXML = decoded.encode('utf-8')   #(type <str>)

2) SUDS：参见 https://fedorahosted.org/suds/wiki/ Documentation#MessagePlugin

from suds.plugin import MessagePlugin
class UnicodeFilter(MessagePlugin):
    def received(self, context):
        decoded = context.reply.decode('utf-8', errors='ignore')
        reencoded = decoded.encode('utf-8')
        context.reply = reencoded

并

from suds.client import Client
client = Client(WSDL_url, plugins=[UnicodeFilter()])

希望这对某人有帮助。

注意：感谢 John Machin！

请参阅：为什么是python 解码替换编码字符串中的无效字节以上？

Python 有关 errors='ignore' 的问题8271 可能会妨碍您。如果没有在 python 中修复此错误，“忽略”将消耗接下来的几个字节来满足长度

在解码无效的 UTF-8 字节序列期间，仅
起始字节和连续字节现在被视为无效，
而不是起始字节指定的字节数

问题已修复：
Python 2.6.6 rc1
Python 2.7.1 rc1（以及 2.7 的所有未来版本）
Python 3.1.3 rc1（以及 3.x 的所有未来版本）

Python 2.5 及更低版本将包含此问题。

在上面的示例中，"\xef\xbc 应该返回 "，但在“有缺陷”的 python 版本中，它返回 "/name"。

前四位 (0xe) 描述了一个 3 字节的 UTF 字符，因此字节0xef、0xbc，然后（错误地）0x3c ('<') 被消耗。

0x3c 不是有效的连续字节，它首先创建了无效的 3 字节 UTF 字符。

python 的固定版本仅删除第一个字节和仅有效的连续字节，留下0x3c未使用

Turns out, SUDS sees xml as type 'string' (not unicode) so these are encoded values.

1) The FILTER:

badXML = "your bad utf-8 xml here"  #(type <str>)

#Turn it into a python unicode string - ignore errors, kick out bad unicode
decoded = badXML.decode('utf-8', errors='ignore')  #(type <unicode>)

#turn it back into a string, using utf-8 encoding.
goodXML = decoded.encode('utf-8')   #(type <str>)

2) SUDS: see https://fedorahosted.org/suds/wiki/Documentation#MessagePlugin

from suds.plugin import MessagePlugin
class UnicodeFilter(MessagePlugin):
    def received(self, context):
        decoded = context.reply.decode('utf-8', errors='ignore')
        reencoded = decoded.encode('utf-8')
        context.reply = reencoded

and

from suds.client import Client
client = Client(WSDL_url, plugins=[UnicodeFilter()])

Hope this helps someone.

Note: Thanks to John Machin!

See: Why is python decode replacing more than the invalid bytes from an encoded string?

Python issue8271 regarding errors='ignore' can get in your way here. Without this bug fixed in python, 'ignore' will consume the next few bytes to satisfy the length

during the decoding of an invalid UTF-8 byte sequence, only the
start byte and the continuation byte(s) are now considered invalid,
instead of the number of bytes specified by the start byte

Issue was fixed in:
Python 2.6.6 rc1
Python 2.7.1 rc1 (and all future releases of 2.7)
Python 3.1.3 rc1 (and all future release of 3.x)

Python 2.5 and below will contain this issue.

In the example above, "\xef\xbc</name".decode('utf-8', errors='ignore') should
return "</name", but in 'bugged' versions of python it returns "/name".

The first four bits (0xe) describes a 3-byte UTF character, so the bytes0xef, 0xbc, and then (erroneously) 0x3c ('<') are consumed.

0x3c is not a valid continuation byte which creates the invalid 3-byte UTF character in the first place.

Fixed versions of python only remove the first byte and only valid continuation bytes, leaving 0x3c unconsumed

回复收藏 0 原文

坏尐絯℡ 2024-12-30 18:40:01

@FlipMcF 是正确的答案 - 我只是为他的解决方案发布了我的过滤器，因为最初的过滤器对我来说不起作用（我的 XML 中有一些表情符号字符，它们以 UTF-8 正确编码，但它们XML 解析器仍然崩溃）：

class UnicodeFilter(MessagePlugin):
    def received(self, context):
        from lxml import etree
        from StringIO import StringIO
        parser = etree.XMLParser(recover=True) # recover=True is important here
        doc = etree.parse(StringIO(context.reply), parser)
        context.reply = etree.tostring(doc)

@FlipMcF's is the correct answer - I'm just posting my filter for his solution, because the original one didn't work out for me (I had some emoji characters in my XML, which were correctly encoded in UTF-8, but they still crashed XML parsers):

class UnicodeFilter(MessagePlugin):
    def received(self, context):
        from lxml import etree
        from StringIO import StringIO
        parser = etree.XMLParser(recover=True) # recover=True is important here
        doc = etree.parse(StringIO(context.reply), parser)
        context.reply = etree.tostring(doc)

回复收藏 0 原文

~没有更多了~