通过 Python 的元素树替换作为数字字符引用一部分的 & 符号

发布于 2025-01-05 19:45:56 字数 1160 浏览 0 评论 0原文

我正在使用 Python 的 elementtree 模块来编写一些 XML(我正在使用 Python 2.7 和 3.2)。我的一些元素的文本字段包含数字字符引用。

但是,一旦我使用 elementtree 的 tostring,字符引用中的所有 & 符号都会被 & 替换。显然,elementtree 或底层解析器无法识别此处的&符号是数字字符引用的一部分。

经过一番搜索后,我发现了这个: elementtree 和实体

但是,我也不热衷于此,因为在我当前的代码中,我预见这最终可能会导致其自身的问题。除此之外,我在这方面发现的东西很少,所以也许我只是忽略了一些明显的东西?

以下简单的测试代码说明了该问题(使用 Python 2.7 和 3.2 进行测试):

import sys
import xml.etree.ElementTree as ET

def main():
    # Text string that contains numeric character reference
    someText = "Ström"

    # Create element object
    testElement = ET.Element('rubbish')

    # Add someText to element's text attribute
    testElement.text = someText

    # Convert element to xml-formatted text string 
    testElementAsString = ET.tostring(testElement,'ascii', 'xml')

    print(testElementAsString)

   # Result: ampersand replaced with '&amp;': <rubbish>Str&amp;#246;m</rubbish>

main()

如果有人有任何想法或建议,那就太好了!

I'm using Python's elementtree module for writing some XML (I'm using Python 2.7 and 3.2). The text fields of some of my elements contain numeric character references.

However, once I use elementtree's tostring all ampersands in the character references are replaced by &amp;. Apparently elementtree or the underlying parser do not recognise that the ampersands here are part of a numeric character reference.

After some searching I found this: elementtree and entities

However, I'm not keen on this either, as in my current code I foresee that this may end up causing problems of its own. Other than that I found surprisingly little on this, so maybe I'm simply overlooking something obvious?

The following simple test code illustrates the problem (tested using Python 2.7 and 3.2):

import sys
import xml.etree.ElementTree as ET

def main():
    # Text string that contains numeric character reference
    someText = "Ström"

    # Create element object
    testElement = ET.Element('rubbish')

    # Add someText to element's text attribute
    testElement.text = someText

    # Convert element to xml-formatted text string 
    testElementAsString = ET.tostring(testElement,'ascii', 'xml')

    print(testElementAsString)

   # Result: ampersand replaced with '&': <rubbish>Str&#246;m</rubbish>

main()

If anyone has any ideas or suggestions that would be great!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

海未深 2025-01-12 19:45:56

您需要对输入中的字符引用进行解码。这是一个可以解码数字字符引用和 html 命名引用的函数;它接受字节字符串作为输入并返回 unicode。下面的代码适用于 Python 2.7 或 3.x。

import re
try:
    from htmlentitydefs import name2codepoint
except ImportError:
    # Must be Python 3.x
    from html.entities import name2codepoint
    unichr = chr

name2codepoint = name2codepoint.copy()
name2codepoint['apos']=ord("'")

EntityPattern = re.compile('&(?:#(\d+)|(?:#x([\da-fA-F]+))|([a-zA-Z]+));')

def decodeEntities(s, encoding='utf-8'):
    def unescape(match):
        code = match.group(1)
        if code:
            return unichr(int(code, 10))
        else:
            code = match.group(2)
            if code:
                return unichr(int(code, 16))
            else:
                code = match.group(3)
                if code in name2codepoint:
                    return unichr(name2codepoint[code])
        return match.group(0)

    return EntityPattern.sub(unescape, s.decode(encoding))

someText = decodeEntities(b"Ström")
print(someText)

当然,如果您可以避免一开始就获取字符串中的字符引用,那么您的生活会更轻松一些。

You need to decode the character references in your input. Here's a function that will decode both numeric character references and html named references; it accepts a byte string as input and returns unicode. The code below works for Python 2.7 or 3.x.

import re
try:
    from htmlentitydefs import name2codepoint
except ImportError:
    # Must be Python 3.x
    from html.entities import name2codepoint
    unichr = chr

name2codepoint = name2codepoint.copy()
name2codepoint['apos']=ord("'")

EntityPattern = re.compile('&(?:#(\d+)|(?:#x([\da-fA-F]+))|([a-zA-Z]+));')

def decodeEntities(s, encoding='utf-8'):
    def unescape(match):
        code = match.group(1)
        if code:
            return unichr(int(code, 10))
        else:
            code = match.group(2)
            if code:
                return unichr(int(code, 16))
            else:
                code = match.group(3)
                if code in name2codepoint:
                    return unichr(name2codepoint[code])
        return match.group(0)

    return EntityPattern.sub(unescape, s.decode(encoding))

someText = decodeEntities(b"Ström")
print(someText)

Of course, if you can avoid getting the character reference in the string to begin with that will make your life somewhat easier.

盛装女皇 2025-01-12 19:45:56

对上述内容的简短更新:我刚刚对我的代码进行了另一次批判性的审视,并意识到有一个更简单的解决方案(主要基于@Duncan的答案)至少对我有用。

在我的原始代码中,我使用实体引用来获取一些 Latin-15 编码文本(我从二进制文件中读取)的 ASCII 表示形式。因此,上面的 someText 变量实际上是从一个 bytes 对象开始的,随后被解码为 Latin-15 文本,最后转换为 ASCII。

感谢@Duncan 和@Inerdial,我现在知道ElementTree 可以自行完成Latin-15 到ASCII 的转换。经过一番实验后,我设法想出了一个简单到几乎微不足道的解决方案。不过,我想它可能对某些人有用,所以我决定在这里分享它:

import sys
import xml.etree.ElementTree as ET

def main():
    # Bytes object
    someBytes=b'Str\xf6m'

    # Decode to Latin-15
    someText=someBytes.decode('iso-8859-15','strict')

    # Create element object
    testElement=ET.Element('rubbish')

    # Add someText to element's text attribute
    testElement.text=someText

    # Convert element to xml-formatted text string 
    testElementAsString=ET.tostring(testElement,'ascii', 'xml').decode('ascii')

    print(testElementAsString)

main()

请注意,我在中添加了最终的.decode("ascii")为了使其适用于 Python 3(与 Python 2.7 不同,Python 3 将 testElementAsString 作为字节对象返回)。

再次感谢@Duncan、@Inerdial 和@Tomalak 为我指明了正确的方向,并感谢@Rik Poggi 纠正了我原来帖子中的格式!

Short update to the above: I just had another critical look at my code, and realised there's an even simpler solution (largely based on @Duncan's answer) that at least works for me.

In my original code I was using the entity references in order to get an ASCII representation of some Latin-15 encoded text (which I was reading from a binary file). So the someText variable above actually started its life as a bytes object, which was subsequently decoded to Latin-15 text, and finally transformed to ASCII.

Thanks to @Duncan and @Inerdial I now know that ElementTree can do the Latin-15 to ASCII conversion by itself. After some experimenting I managed to come up with a solution that is stupidly simple to the extent of being almost trivial. However, I imagine that it just might be useful to some, so I decided to share it here anyway:

import sys
import xml.etree.ElementTree as ET

def main():
    # Bytes object
    someBytes=b'Str\xf6m'

    # Decode to Latin-15
    someText=someBytes.decode('iso-8859-15','strict')

    # Create element object
    testElement=ET.Element('rubbish')

    # Add someText to element's text attribute
    testElement.text=someText

    # Convert element to xml-formatted text string 
    testElementAsString=ET.tostring(testElement,'ascii', 'xml').decode('ascii')

    print(testElementAsString)

main()

Note that I added the final .decode("ascii") in order to make this work with Python 3 (which, unlike Python 2.7, returns testElementAsString as a bytes object).

Thanks again to @Duncan, @Inerdial and @Tomalak for pointing me in the right direction, and @Rik Poggi for correcting the formatting in my original post!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文