通过 Python 的元素树替换作为数字字符引用一部分的 & 符号

发布于 2025-01-05 19:45:56 字数 1160 浏览 4 评论 0原文

我正在使用 Python 的 elementtree 模块来编写一些 XML（我正在使用 Python 2.7 和 3.2）。我的一些元素的文本字段包含数字字符引用。

但是，一旦我使用 elementtree 的 tostring，字符引用中的所有 & 符号都会被 &amp; 替换。显然，elementtree 或底层解析器无法识别此处的＆符号是数字字符引用的一部分。

经过一番搜索后，我发现了这个： elementtree 和实体

但是，我也不热衷于此，因为在我当前的代码中，我预见这最终可能会导致其自身的问题。除此之外，我在这方面发现的东西很少，所以也许我只是忽略了一些明显的东西？

以下简单的测试代码说明了该问题（使用 Python 2.7 和 3.2 进行测试）：

import sys
import xml.etree.ElementTree as ET

def main():
    # Text string that contains numeric character reference
    someText = "Str&#246;m"

    # Create element object
    testElement = ET.Element('rubbish')

    # Add someText to element's text attribute
    testElement.text = someText

    # Convert element to xml-formatted text string 
    testElementAsString = ET.tostring(testElement,'ascii', 'xml')

    print(testElementAsString)

   # Result: ampersand replaced with '&amp;': <rubbish>Str&amp;#246;m</rubbish>

main()

如果有人有任何想法或建议，那就太好了！

原文

I'm using Python's elementtree module for writing some XML (I'm using Python 2.7 and 3.2). The text fields of some of my elements contain numeric character references.

However, once I use elementtree's tostring all ampersands in the character references are replaced by &. Apparently elementtree or the underlying parser do not recognise that the ampersands here are part of a numeric character reference.

After some searching I found this: elementtree and entities

However, I'm not keen on this either, as in my current code I foresee that this may end up causing problems of its own. Other than that I found surprisingly little on this, so maybe I'm simply overlooking something obvious?

The following simple test code illustrates the problem (tested using Python 2.7 and 3.2):

import sys
import xml.etree.ElementTree as ET

def main():
    # Text string that contains numeric character reference
    someText = "Ström"

    # Create element object
    testElement = ET.Element('rubbish')

    # Add someText to element's text attribute
    testElement.text = someText

    # Convert element to xml-formatted text string 
    testElementAsString = ET.tostring(testElement,'ascii', 'xml')

    print(testElementAsString)

   # Result: ampersand replaced with '&': <rubbish>Str&#246;m</rubbish>

main()

If anyone has any ideas or suggestions that would be great!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

海未深 2025-01-12 19:45:56

您需要对输入中的字符引用进行解码。这是一个可以解码数字字符引用和 html 命名引用的函数；它接受字节字符串作为输入并返回 unicode。下面的代码适用于 Python 2.7 或 3.x。

import re
try:
    from htmlentitydefs import name2codepoint
except ImportError:
    # Must be Python 3.x
    from html.entities import name2codepoint
    unichr = chr

name2codepoint = name2codepoint.copy()
name2codepoint['apos']=ord("'")

EntityPattern = re.compile('&(?:#(\d+)|(?:#x([\da-fA-F]+))|([a-zA-Z]+));')

def decodeEntities(s, encoding='utf-8'):
    def unescape(match):
        code = match.group(1)
        if code:
            return unichr(int(code, 10))
        else:
            code = match.group(2)
            if code:
                return unichr(int(code, 16))
            else:
                code = match.group(3)
                if code in name2codepoint:
                    return unichr(name2codepoint[code])
        return match.group(0)

    return EntityPattern.sub(unescape, s.decode(encoding))

someText = decodeEntities(b"Ström")
print(someText)

当然，如果您可以避免一开始就获取字符串中的字符引用，那么您的生活会更轻松一些。

You need to decode the character references in your input. Here's a function that will decode both numeric character references and html named references; it accepts a byte string as input and returns unicode. The code below works for Python 2.7 or 3.x.

import re
try:
    from htmlentitydefs import name2codepoint
except ImportError:
    # Must be Python 3.x
    from html.entities import name2codepoint
    unichr = chr

name2codepoint = name2codepoint.copy()
name2codepoint['apos']=ord("'")

EntityPattern = re.compile('&(?:#(\d+)|(?:#x([\da-fA-F]+))|([a-zA-Z]+));')

def decodeEntities(s, encoding='utf-8'):
    def unescape(match):
        code = match.group(1)
        if code:
            return unichr(int(code, 10))
        else:
            code = match.group(2)
            if code:
                return unichr(int(code, 16))
            else:
                code = match.group(3)
                if code in name2codepoint:
                    return unichr(name2codepoint[code])
        return match.group(0)

    return EntityPattern.sub(unescape, s.decode(encoding))

someText = decodeEntities(b"Ström")
print(someText)

Of course, if you can avoid getting the character reference in the string to begin with that will make your life somewhat easier.

回复收藏 0 原文

盛装女皇 2025-01-12 19:45:56

对上述内容的简短更新：我刚刚对我的代码进行了另一次批判性的审视，并意识到有一个更简单的解决方案（主要基于@Duncan的答案）至少对我有用。

在我的原始代码中，我使用实体引用来获取一些 Latin-15 编码文本（我从二进制文件中读取）的 ASCII 表示形式。因此，上面的 someText 变量实际上是从一个 bytes 对象开始的，随后被解码为 Latin-15 文本，最后转换为 ASCII。

感谢@Duncan 和@Inerdial，我现在知道ElementTree 可以自行完成Latin-15 到ASCII 的转换。经过一番实验后，我设法想出了一个简单到几乎微不足道的解决方案。不过，我想它可能对某些人有用，所以我决定在这里分享它：

import sys
import xml.etree.ElementTree as ET

def main():
    # Bytes object
    someBytes=b'Str\xf6m'

    # Decode to Latin-15
    someText=someBytes.decode('iso-8859-15','strict')

    # Create element object
    testElement=ET.Element('rubbish')

    # Add someText to element's text attribute
    testElement.text=someText

    # Convert element to xml-formatted text string 
    testElementAsString=ET.tostring(testElement,'ascii', 'xml').decode('ascii')

    print(testElementAsString)

main()

请注意，我在中添加了最终的.decode("ascii")为了使其适用于 Python 3（与 Python 2.7 不同，Python 3 将 testElementAsString 作为字节对象返回）。

再次感谢@Duncan、@Inerdial 和@Tomalak 为我指明了正确的方向，并感谢@Rik Poggi 纠正了我原来帖子中的格式！

Short update to the above: I just had another critical look at my code, and realised there's an even simpler solution (largely based on @Duncan's answer) that at least works for me.

In my original code I was using the entity references in order to get an ASCII representation of some Latin-15 encoded text (which I was reading from a binary file). So the someText variable above actually started its life as a bytes object, which was subsequently decoded to Latin-15 text, and finally transformed to ASCII.

Thanks to @Duncan and @Inerdial I now know that ElementTree can do the Latin-15 to ASCII conversion by itself. After some experimenting I managed to come up with a solution that is stupidly simple to the extent of being almost trivial. However, I imagine that it just might be useful to some, so I decided to share it here anyway:

import sys
import xml.etree.ElementTree as ET

def main():
    # Bytes object
    someBytes=b'Str\xf6m'

    # Decode to Latin-15
    someText=someBytes.decode('iso-8859-15','strict')

    # Create element object
    testElement=ET.Element('rubbish')

    # Add someText to element's text attribute
    testElement.text=someText

    # Convert element to xml-formatted text string 
    testElementAsString=ET.tostring(testElement,'ascii', 'xml').decode('ascii')

    print(testElementAsString)

main()

Note that I added the final .decode("ascii") in order to make this work with Python 3 (which, unlike Python 2.7, returns testElementAsString as a bytes object).

Thanks again to @Duncan, @Inerdial and @Tomalak for pointing me in the right direction, and @Rik Poggi for correcting the formatting in my original post!

回复收藏 0 原文

~没有更多了~