Python minidom 和带有哈希引用的 UTF-8 编码 XML

发布于 2024-10-11 05:24:46 字数 873 浏览 3 评论 0原文

我在我的家庭项目中遇到一些困难，我需要解析 SOAP 请求。 SOAP 使用 gSOAP 生成，并涉及带有特殊字符（例如丹麦字母“æøå”）的字符串参数。

gSOAP 默认使用 UTF-8 编码构建 SOAP 请求，但不是以原始格式发送特殊字符（即特殊字符“æ”的字节 C3A6），而是发送我认为所谓的字符哈希引用（即 & ）。＃195；＆#166；）。

我不完全理解为什么 gSOAP 这样做，因为我可以看到它已经将传入的有效负载标记为 UTF-8 编码（Content-Type: text/xml; charset=utf-8），但这是除此之外的问题（我认为）。

无论如何，我猜 gSOAP 可能遵守传输规则，或者什么？

当我使用 xml.dom.minidom.parseString() 解析 python 中的 gSOAP 请求时，我得到的元素值是 unicode 对象，这很好，但字符哈希引用不会解码为 UTF-8 字符代码。它对字符哈希引用进行转义，但之后不会对字符串进行解码。最后，我有一个采用 UTF-8 编码的 unicode 字符串对象：

因此，如果 XML 中包含字符串“æble”，则请求中的内容如下：

"&#195;&#166;ble"

解析 XML 后，DOM 文本节点数据成员中的 unicode 字符串看起来像这样：

u'\xc3\xa6ble'

我希望它看起来像这样：

u'\xe6ble'

我做错了什么？我应该在解析 SOAP XML 之前对其进行转义，还是应该在其他地方寻找解决方案，也许是 gSOAP？

提前致谢。

最好的问候雅各布·西蒙-加德

原文

I am experiencing some difficulty in my home project where I need to parse a SOAP request. The SOAP is generated with gSOAP and involves string parameters with special characters like the danish letters "æøå".

gSOAP builds SOAP requests with UTF-8 encoding by default, but instead of sending the special chatacters in raw format (ie. bytes C3A6 for the special character "æ") it sends what I think is called character hash references (ie. Ã¦).

I don't completely understand why gSOAP does it this way as I can see that it has marked the incomming payload as being UTF-8 encoded anyway (Content-Type: text/xml; charset=utf-8), but this is besides the question (I think).

Anyway I guess gSOAP probably is obeying transport rules, or what?

When I parse the request from gSOAP in python with xml.dom.minidom.parseString() I get element values as unicode objects which is fine, but the character hash references are not decoded as UTF-8 character codes. It unescapes the character hash references, but does not decode the string afterwards. In the end I have a unicode string object with UTF-8 encoding:

So if the string "æble" is contained in the XML, it comes like this in the request:

"Ã¦ble"

After parsing the XML the unicode string in the DOM Text Node's data member looks like this:

u'\xc3\xa6ble'

I would expect it to look like this:

u'\xe6ble'

What am I doing wrong? Should I unescape the SOAP XML before parsing it, or is it somewhere else I should be looking for the solution, maybe gSOAP?

Thanks in advance.

Best regards Jakob Simon-Gaarde

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

凉月流沐 2024-10-18 05:24:46

Ã¦ble 实际上是 êble。

要解析后得到预期的 Unicode 字符串 u'\xe6ble'，请求中的字符串应为 æble。

回复收藏 0 原文

念三年u 2024-10-18 05:24:46

以下是如何避免此类内容： http://effbot.org/zone/re- sub.htm#unescape-html

然而，主要问题是您和/或这个“gSOAP”（请提供 URL）正在做什么...

您的示例字符是 LATIN SMALL LIGATURE AE (U+00E6)。正如您所说，以 UTF-8 编码，这是 \xc3\xa6。 0xc3 == 195 和 0xa6 == 166。0xe6 == 230。转义你的角色应该产生 'æ'，而不是 'Ã¦ '。

然而，它似乎首先编码为 UTF-8，然后进行转义。

您需要做的是向我们详细展示您正在使用的代码以及每个 strstr 和 unicode 对象。另请提供您正在使用的 gSOAP API 的文档。

在接收端，请向我们展示您收到的原始 XML 的 repr()。

编辑回应另一个答案的评论：“”“问题是 minidom.parseString() 在解码为 unicode 之前似乎没有转义字符哈希表示。”“”

它（和任何其他 XML 解析器）{不会、一般不能、也不得}在解码之前取消转义数字字符引用或预定义字符实体。

(1) 将 "<" 转义为 "<" 会爆炸

(2) 你会转义什么 "Ā"< /代码> 到？ “\xc4\x80”？

(3) 如果编码是UTF-16xx，它怎么可能完全转义呢？

回复收藏 0 原文

农村范ル 2024-10-18 05:24:46

关于我的问题的更多细节。我正在创建的项目使用 wsgi。使用 environ['wsgi.input'].read() 提取 SOAP 请求。它似乎总是返回一个原始字符串。我创建了一个对字符散列进行转义的函数：

def unescape_hash_char(req):
  pat = re.compile('&#(\d+);',re.M)
  parts = pat.split(req)
  a=0
  ret = ''
  for p in parts:
    if a%2:
      n = chr(int(p))
    else:
      n = p
    ret += n
    a+=1
  return ret

执行此操作后，我解析 XML 并得到预期的结果。

不过我想知道您的想法，以及这是否是一个好的解决方案。我还写了这个函数，因为我在标准 python 模块中找不到完成这项工作的函数，这样的函数是否存在？

此致
雅各布·西蒙·加德

Some more detail about my problem. The project I am creating uses wsgi. The SOAP request is extracted using environ['wsgi.input'].read(). It always seems to return a raw string. I created a function that unescapes the character hashes:

def unescape_hash_char(req):
  pat = re.compile('&#(\d+);',re.M)
  parts = pat.split(req)
  a=0
  ret = ''
  for p in parts:
    if a%2:
      n = chr(int(p))
    else:
      n = p
    ret += n
    a+=1
  return ret

After doing this I parse the XML and I get the expected reslut.

Still I would like to know what you think, and if it is a good solution. Also I wrote the function because I couldn't find a function to do the job in the standard python modules, does such a function exist?

Best regards
Jakob Simon-Gaarde

回复收藏 0 原文

夜声 2024-10-18 05:24:46

请注意，

In [5]: 'æ'.encode('utf-8')
Out[5]: '\xc3\xa6'

我们拥有的是 unicode 对象 u'\xc3\xa6'，而我们真正想要的是字符串对象 '\xc3\xa6'。此转换可以使用 raw-unicode-escape 编解码器执行：

In [1]: text=u'\xc3\xa6'
In [2]: text.encode('raw-unicode-escape')
Out[2]: '\xc3\xa6ble'

In [3]: text.encode('raw-unicode-escape').decode('utf-8')
Out[3]: u'\xe6'

In [4]: print(text.encode('raw-unicode-escape').decode('utf-8'))
æ

Note that

In [5]: 'æ'.encode('utf-8')
Out[5]: '\xc3\xa6'

So we have is the unicode object u'\xc3\xa6' and we really want the string object'\xc3\xa6'. This transformation can be performed with the raw-unicode-escape codec:

In [1]: text=u'\xc3\xa6'
In [2]: text.encode('raw-unicode-escape')
Out[2]: '\xc3\xa6ble'

In [3]: text.encode('raw-unicode-escape').decode('utf-8')
Out[3]: u'\xe6'

In [4]: print(text.encode('raw-unicode-escape').decode('utf-8'))
æ

回复收藏 0 原文

欢烬 2024-10-18 05:24:46

除非有人告诉我 gSOAP 没有生成有效的编码 SOAP XML：（请参阅 http://pastebin。 com/raw.php?i=9NS7vCMB 或下面的代码块）除了在解析 XML 之前取消转义字符哈希引用之外，我没有看到其他解决方案。

当然，正如 John Machin 所指出的，我无法转义 XML 控制字符，例如“<”和“>”。

<?xml version="1.0" encoding="UTF-8"?>
<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:SOAP-ENC="http://schemas.xmlsoap.org/soap/encoding/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:ns1="urn:ShopService"><SOAP-ENV:Body SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/"><ns1:createCompany><company-code>DK-123</company-code><name>Ã¦ble</name></ns1:createCompany></SOAP-ENV:Body></SOAP-ENV:Envelope>

/ 雅各布

Unless someone can tell me that gSOAP is not producing valid encoded SOAP XML: (see http://pastebin.com/raw.php?i=9NS7vCMB or the codeblock below) I see no other solution than to unescape character hash references before parsing the XML.

Of course as John Machin has pointed out, I cannot unescape XML control characters like "<" and ">".

<?xml version="1.0" encoding="UTF-8"?>
<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:SOAP-ENC="http://schemas.xmlsoap.org/soap/encoding/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:ns1="urn:ShopService"><SOAP-ENV:Body SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/"><ns1:createCompany><company-code>DK-123</company-code><name>Ã¦ble</name></ns1:createCompany></SOAP-ENV:Body></SOAP-ENV:Envelope>

/ Jakob

回复收藏 0 原文

~没有更多了~