Python minidom 和带有哈希引用的 UTF-8 编码 XML
我在我的家庭项目中遇到一些困难,我需要解析 SOAP 请求。 SOAP 使用 gSOAP 生成,并涉及带有特殊字符(例如丹麦字母“æøå”)的字符串参数。
gSOAP 默认使用 UTF-8 编码构建 SOAP 请求,但不是以原始格式发送特殊字符(即特殊字符“æ”的字节 C3A6),而是发送我认为所谓的字符哈希引用(即 & )。 #195;¦)。
我不完全理解为什么 gSOAP 这样做,因为我可以看到它已经将传入的有效负载标记为 UTF-8 编码(Content-Type: text/xml; charset=utf-8),但这是除此之外的问题(我认为)。
无论如何,我猜 gSOAP 可能遵守传输规则,或者什么?
当我使用 xml.dom.minidom.parseString() 解析 python 中的 gSOAP 请求时,我得到的元素值是 unicode 对象,这很好,但字符哈希引用不会解码为 UTF-8 字符代码。它对字符哈希引用进行转义,但之后不会对字符串进行解码。最后,我有一个采用 UTF-8 编码的 unicode 字符串对象:
因此,如果 XML 中包含字符串“æble”,则请求中的内容如下:
"æble"
解析 XML 后,DOM 文本节点数据成员中的 unicode 字符串看起来像这样:
u'\xc3\xa6ble'
我希望它看起来像这样:
u'\xe6ble'
我做错了什么?我应该在解析 SOAP XML 之前对其进行转义,还是应该在其他地方寻找解决方案,也许是 gSOAP?
提前致谢。
最好的问候雅各布·西蒙-加德
I am experiencing some difficulty in my home project where I need to parse a SOAP request. The SOAP is generated with gSOAP and involves string parameters with special characters like the danish letters "æøå".
gSOAP builds SOAP requests with UTF-8 encoding by default, but instead of sending the special chatacters in raw format (ie. bytes C3A6 for the special character "æ") it sends what I think is called character hash references (ie. æ).
I don't completely understand why gSOAP does it this way as I can see that it has marked the incomming payload as being UTF-8 encoded anyway (Content-Type: text/xml; charset=utf-8), but this is besides the question (I think).
Anyway I guess gSOAP probably is obeying transport rules, or what?
When I parse the request from gSOAP in python with xml.dom.minidom.parseString() I get element values as unicode objects which is fine, but the character hash references are not decoded as UTF-8 character codes. It unescapes the character hash references, but does not decode the string afterwards. In the end I have a unicode string object with UTF-8 encoding:
So if the string "æble" is contained in the XML, it comes like this in the request:
"æble"
After parsing the XML the unicode string in the DOM Text Node's data member looks like this:
u'\xc3\xa6ble'
I would expect it to look like this:
u'\xe6ble'
What am I doing wrong? Should I unescape the SOAP XML before parsing it, or is it somewhere else I should be looking for the solution, maybe gSOAP?
Thanks in advance.
Best regards Jakob Simon-Gaarde
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
æble
实际上是êble
。要解析后得到预期的 Unicode 字符串
u'\xe6ble'
,请求中的字符串应为æble
。æble
is actuallyæble
.To get the expected Unicode string
u'\xe6ble'
after parsing, the string in the request should beæble
.以下是如何避免此类内容: http://effbot.org/zone/re- sub.htm#unescape-html
然而,主要问题是您和/或这个“gSOAP”(请提供 URL)正在做什么...
您的示例字符是 LATIN SMALL LIGATURE AE (U+00E6)。正如您所说,以 UTF-8 编码,这是
\xc3\xa6
。 0xc3 == 195 和 0xa6 == 166。0xe6 == 230。转义你的角色应该产生'æ'
,而不是'æ '
。然而,它似乎首先编码为 UTF-8,然后进行转义。
您需要做的是向我们详细展示您正在使用的代码以及每个
str
str 和 unicode 对象。另请提供您正在使用的 gSOAP API 的文档。在接收端,请向我们展示您收到的原始 XML 的 repr()。
编辑回应另一个答案的评论:“”“问题是 minidom.parseString() 在解码为 unicode 之前似乎没有转义字符哈希表示。”“”
它(和任何其他 XML 解析器){不会、一般不能、也不得}在解码之前取消转义数字字符引用或预定义字符实体。
(1) 将
"<"
转义为"<"
会爆炸(2) 你会转义什么
"Ā"< /代码> 到?
“\xc4\x80”
?(3) 如果编码是UTF-16xx,它怎么可能完全转义呢?
Here's how to unescape such stuff: http://effbot.org/zone/re-sub.htm#unescape-html
However the primary problem is what you and/or this "gSOAP" (URL, please) are doing ...
Your example character is LATIN SMALL LIGATURE AE (U+00E6). As you say, encoded in UTF-8, this is
\xc3\xa6
. 0xc3 == 195 and 0xa6 == 166. 0xe6 == 230. Escaping your character should produce'æ'
, not'æ'
.However it appears that it is encoding to UTF-8 first and then doing the escaping.
What you need to do is to show us in fine detail the code that you are using together with diagnostic prints (using the repr() function so that we can see the type and unambiguously-represented contents) of each
str
andunicode
object involved in the process. Also provide the docs for the gSOAP API(s) that you are using.On the receiving end, please show us the repr() of the raw XML that you receive.
Edit in response to this comment on another answer: """The problem is that minidom.parseString() does not seem to unescape the character hash representation before it decodes to unicode."""
It (and any other XML parser) {does not, cannot in generality, and must not} unescape numerical character references or predefined character entities BEFORE decoding.
(1) unescaping
"<"
to"<"
would blow up(2) what would you unescape
"Ā"
to?"\xc4\x80"
?(3) how could it unescape at all if the encoding was UTF-16xx?
关于我的问题的更多细节。我正在创建的项目使用 wsgi。使用
environ['wsgi.input'].read()
提取 SOAP 请求。它似乎总是返回一个原始字符串。我创建了一个对字符散列进行转义的函数:执行此操作后,我解析 XML 并得到预期的结果。
不过我想知道您的想法,以及这是否是一个好的解决方案。我还写了这个函数,因为我在标准 python 模块中找不到完成这项工作的函数,这样的函数是否存在?
此致
雅各布·西蒙·加德
Some more detail about my problem. The project I am creating uses wsgi. The SOAP request is extracted using
environ['wsgi.input'].read()
. It always seems to return a raw string. I created a function that unescapes the character hashes:After doing this I parse the XML and I get the expected reslut.
Still I would like to know what you think, and if it is a good solution. Also I wrote the function because I couldn't find a function to do the job in the standard python modules, does such a function exist?
Best regards
Jakob Simon-Gaarde
请注意,
我们拥有的是 unicode 对象
u'\xc3\xa6'
,而我们真正想要的是字符串对象'\xc3\xa6'
。此转换可以使用raw-unicode-escape
编解码器执行:Note that
So we have is the unicode object
u'\xc3\xa6'
and we really want the string object'\xc3\xa6'
. This transformation can be performed with theraw-unicode-escape
codec:除非有人告诉我 gSOAP 没有生成有效的编码 SOAP XML:(请参阅 http://pastebin。 com/raw.php?i=9NS7vCMB 或下面的代码块)除了在解析 XML 之前取消转义字符哈希引用之外,我没有看到其他解决方案。
当然,正如 John Machin 所指出的,我无法转义 XML 控制字符,例如“<”和“>”。
/ 雅各布
Unless someone can tell me that gSOAP is not producing valid encoded SOAP XML: (see http://pastebin.com/raw.php?i=9NS7vCMB or the codeblock below) I see no other solution than to unescape character hash references before parsing the XML.
Of course as John Machin has pointed out, I cannot unescape XML control characters like "<" and ">".
/ Jakob