Python UTF-8 XML 解析 (SUDS):删除“无效令牌”
这是处理 UTF-8 时的一个常见错误 - “无效令牌”
在我的示例中,它来自于处理不尊重 unicode 字符的 SOAP 服务提供者,只是将值截断为 100 字节并忽略了第 100 个字节可能位于多字节字符的中间:例如:
<name xsi:type="xsd:string">浙江家庭教会五十人遭驱散及抓打 圣诞节聚会被断电及抢走物品(图、视频\xef\xbc</name>
在截断刀假定世界使用 1 字节之后,最后两个字节是 3 字节 unicode 字符的剩余部分 人物。下一站,sax 解析器和:
xml.sax._exceptions.SAXParseException: <unknown>:1:2392: not well-formed (invalid token)
我不再关心这个角色了。应将其从文档中删除并允许 sax 解析器运行。
除了这些值之外,XML 回复在所有其他方面均有效。
问题:如何在不解析整个文档并重新发明 UTF-8 编码来检查每个字节的情况下删除该字符?
使用:Python+SUDS
Here's a common error when dealing with UTF-8 - 'invalid tokens'
In my example, It comes from dealing with a SOAP service provider that had no respect for unicode characters, simply truncating values to 100 bytes and neglecting that the 100'th byte may be in the middle of a multi-byte character: for example:
<name xsi:type="xsd:string">浙江家庭教会五十人遭驱散及抓打 圣诞节聚会被断电及抢走物品(图、视频\xef\xbc</name>
The last two bytes are what remains of a 3 byte unicode character, after the truncation knife assumed that the world uses 1-byte characters. Next stop, sax parser and:
xml.sax._exceptions.SAXParseException: <unknown>:1:2392: not well-formed (invalid token)
I don't care about this character anymore. It should be removed from the document and allow the sax parser to function.
The XML reply is valid in every other respect except for these values.
Question: How do you remove this character without parsing the entire document and re-inventing UTF-8 encoding to check every byte?
Using: Python+SUDS
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
事实证明,SUDS 将 xml 视为“字符串”类型(而不是 unicode),因此这些是编码值。
1) 过滤器:
2) SUDS:参见 https://fedorahosted.org/suds/wiki/ Documentation#MessagePlugin
并
希望这对某人有帮助。
注意:感谢 John Machin!
请参阅:为什么是python 解码替换编码字符串中的无效字节以上?
Python 有关
errors='ignore'
的问题8271 可能会妨碍您。如果没有在 python 中修复此错误,“忽略”将消耗接下来的几个字节来满足长度问题已修复:
Python 2.6.6 rc1
Python 2.7.1 rc1(以及 2.7 的所有未来版本)
Python 3.1.3 rc1(以及 3.x 的所有未来版本)
Python 2.5 及更低版本将包含此问题。
在上面的示例中,
"\xef\xbc 应该
返回
",但在“有缺陷”的 python 版本中,它返回
"/name"
。前四位 (
0xe
) 描述了一个 3 字节的 UTF 字符,因此字节0xef
、0xbc
,然后(错误地)0x3c
('<'
) 被消耗。0x3c
不是有效的连续字节,它首先创建了无效的 3 字节 UTF 字符。python 的固定版本仅删除第一个字节和仅有效的连续字节,留下
0x3c
未使用Turns out, SUDS sees xml as type 'string' (not unicode) so these are encoded values.
1) The FILTER:
2) SUDS: see https://fedorahosted.org/suds/wiki/Documentation#MessagePlugin
and
Hope this helps someone.
Note: Thanks to John Machin!
See: Why is python decode replacing more than the invalid bytes from an encoded string?
Python issue8271 regarding
errors='ignore'
can get in your way here. Without this bug fixed in python, 'ignore' will consume the next few bytes to satisfy the lengthIssue was fixed in:
Python 2.6.6 rc1
Python 2.7.1 rc1 (and all future releases of 2.7)
Python 3.1.3 rc1 (and all future release of 3.x)
Python 2.5 and below will contain this issue.
In the example above,
"\xef\xbc</name".decode('utf-8', errors='ignore')
shouldreturn
"</name"
, but in 'bugged' versions of python it returns"/name"
.The first four bits (
0xe
) describes a 3-byte UTF character, so the bytes0xef
,0xbc
, and then (erroneously)0x3c
('<'
) are consumed.0x3c
is not a valid continuation byte which creates the invalid 3-byte UTF character in the first place.Fixed versions of python only remove the first byte and only valid continuation bytes, leaving
0x3c
unconsumed@FlipMcF 是正确的答案 - 我只是为他的解决方案发布了我的过滤器,因为最初的过滤器对我来说不起作用(我的 XML 中有一些表情符号字符,它们以 UTF-8 正确编码,但它们XML 解析器仍然崩溃):
@FlipMcF's is the correct answer - I'm just posting my filter for his solution, because the original one didn't work out for me (I had some emoji characters in my XML, which were correctly encoded in UTF-8, but they still crashed XML parsers):