通过 Python 的元素树替换作为数字字符引用一部分的 & 符号
我正在使用 Python 的 elementtree 模块来编写一些 XML(我正在使用 Python 2.7 和 3.2)。我的一些元素的文本字段包含数字字符引用。
但是,一旦我使用 elementtree 的 tostring
,字符引用中的所有 & 符号都会被 &
替换。显然,elementtree 或底层解析器无法识别此处的&符号是数字字符引用的一部分。
经过一番搜索后,我发现了这个: elementtree 和实体
但是,我也不热衷于此,因为在我当前的代码中,我预见这最终可能会导致其自身的问题。除此之外,我在这方面发现的东西很少,所以也许我只是忽略了一些明显的东西?
以下简单的测试代码说明了该问题(使用 Python 2.7 和 3.2 进行测试):
import sys
import xml.etree.ElementTree as ET
def main():
# Text string that contains numeric character reference
someText = "Ström"
# Create element object
testElement = ET.Element('rubbish')
# Add someText to element's text attribute
testElement.text = someText
# Convert element to xml-formatted text string
testElementAsString = ET.tostring(testElement,'ascii', 'xml')
print(testElementAsString)
# Result: ampersand replaced with '&': <rubbish>Str&#246;m</rubbish>
main()
如果有人有任何想法或建议,那就太好了!
I'm using Python's elementtree module for writing some XML (I'm using Python 2.7 and 3.2). The text fields of some of my elements contain numeric character references.
However, once I use elementtree's tostring
all ampersands in the character references are replaced by &
. Apparently elementtree or the underlying parser do not recognise that the ampersands here are part of a numeric character reference.
After some searching I found this: elementtree and entities
However, I'm not keen on this either, as in my current code I foresee that this may end up causing problems of its own. Other than that I found surprisingly little on this, so maybe I'm simply overlooking something obvious?
The following simple test code illustrates the problem (tested using Python 2.7 and 3.2):
import sys
import xml.etree.ElementTree as ET
def main():
# Text string that contains numeric character reference
someText = "Ström"
# Create element object
testElement = ET.Element('rubbish')
# Add someText to element's text attribute
testElement.text = someText
# Convert element to xml-formatted text string
testElementAsString = ET.tostring(testElement,'ascii', 'xml')
print(testElementAsString)
# Result: ampersand replaced with '&': <rubbish>Ström</rubbish>
main()
If anyone has any ideas or suggestions that would be great!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您需要对输入中的字符引用进行解码。这是一个可以解码数字字符引用和 html 命名引用的函数;它接受字节字符串作为输入并返回 unicode。下面的代码适用于 Python 2.7 或 3.x。
当然,如果您可以避免一开始就获取字符串中的字符引用,那么您的生活会更轻松一些。
You need to decode the character references in your input. Here's a function that will decode both numeric character references and html named references; it accepts a byte string as input and returns unicode. The code below works for Python 2.7 or 3.x.
Of course, if you can avoid getting the character reference in the string to begin with that will make your life somewhat easier.
对上述内容的简短更新:我刚刚对我的代码进行了另一次批判性的审视,并意识到有一个更简单的解决方案(主要基于@Duncan的答案)至少对我有用。
在我的原始代码中,我使用实体引用来获取一些 Latin-15 编码文本(我从二进制文件中读取)的 ASCII 表示形式。因此,上面的
someText
变量实际上是从一个 bytes 对象开始的,随后被解码为 Latin-15 文本,最后转换为 ASCII。感谢@Duncan 和@Inerdial,我现在知道ElementTree 可以自行完成Latin-15 到ASCII 的转换。经过一番实验后,我设法想出了一个简单到几乎微不足道的解决方案。不过,我想它可能对某些人有用,所以我决定在这里分享它:
请注意,我在中添加了最终的.decode("ascii")为了使其适用于 Python 3(与 Python 2.7 不同,Python 3 将
testElementAsString
作为字节对象返回)。再次感谢@Duncan、@Inerdial 和@Tomalak 为我指明了正确的方向,并感谢@Rik Poggi 纠正了我原来帖子中的格式!
Short update to the above: I just had another critical look at my code, and realised there's an even simpler solution (largely based on @Duncan's answer) that at least works for me.
In my original code I was using the entity references in order to get an ASCII representation of some Latin-15 encoded text (which I was reading from a binary file). So the
someText
variable above actually started its life as a bytes object, which was subsequently decoded to Latin-15 text, and finally transformed to ASCII.Thanks to @Duncan and @Inerdial I now know that ElementTree can do the Latin-15 to ASCII conversion by itself. After some experimenting I managed to come up with a solution that is stupidly simple to the extent of being almost trivial. However, I imagine that it just might be useful to some, so I decided to share it here anyway:
Note that I added the final
.decode("ascii")
in order to make this work with Python 3 (which, unlike Python 2.7, returnstestElementAsString
as a bytes object).Thanks again to @Duncan, @Inerdial and @Tomalak for pointing me in the right direction, and @Rik Poggi for correcting the formatting in my original post!