在 Python XML 解析中保留转义字符

发布于 2024-09-29 06:13:01 字数 603 浏览 4 评论 0原文

我正在尝试编写一个 python 脚本,它接受一两个 xml 文件,并根据输入文件的内容输出一两个新文件。我试图使用 minidom 模块编写这个脚本。但是,输入文件包含许多转义字符实例

内部节点属性。不幸的是,在输出文件中,这些字符已被转换为不同的字符,这似乎是换行符。

例如,输入文件中的一行如:

<Entry text="For English For Hearing Impaired&#xa;Press 3 on Keypad"

当我读到 minidom 导致此问题时,将输出

<Entry text="For English For Hearing Impaired
Press 3 on Keypad"

,因为它不允许 xml 属性中的转义字符(我认为)。这是真的吗?如果是这样,将 xml 文件解析为 python 文档、操作节点并将其与其他文档交换以及将文档输出回新文件的最佳工具/方法是什么?

如果有帮助的话,我还使用“utf-8”编码解析和保存这些文件。我不知道这是否是问题的一部分。感谢任何人可以提供的帮助。

-亚历克斯·凯泽

I'm trying to write a python script that takes in one or two xml files and outputs one or two new files based on the contents of the input files. I was trying to write this script using the minidom module. However, the input files contain a number of instances of the escape character

inside node attributes. Unfortunately, in the output files, these characters have been converted to different characters, which seem to be newline characters.

For example, a line in the input file such as:

<Entry text="For English For Hearing Impaired
Press 3 on Keypad"

Would be output as

<Entry text="For English For Hearing Impaired
Press 3 on Keypad"

I read that minidom is causing this, as it doesn't allow escape characters in xml attributes (I think). Is this true? And, if so, what's the best tool/method to use to parse an xml file into a python document, manipulate nodes and exchange them with other documents, and output documents back to new files?

If it helps, I was also parsing and saving these files using 'utf-8' encoding. I don't know if this is part of the problem or not. Thanks for any help anyone can give.

-Alex Kaiser

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

何处潇湘 2024-10-06 06:13:01

自从发现 lxml 以来,我就没有使用过 Python 的标准 xml 模块。它可以完成您正在寻找的一切。例如...

input.xml:

<?xml version="1.0" encoding='utf-8'?>
<root>
  <Button3 yposition="250" fontsize="16" language1="For English For Hearing Impaired
Press 3 on Keypad" />
</root>

和:

>>> from lxml import etree
>>> with open('input.xml') as f:
...     root = etree.parse(f)
...
>>> buttons = root.xpath('//Button3')
>>> buttons
[<Element Button3 at 101071f18>]
>>> buttons[0]
<Element Button3 at 101071f18>
>>> buttons[0].attrib
{'yposition': '250', 'language1': 'For English For Hearing Impaired\nPress 3 on Keypad', 'fontsize': '16'}
>>> buttons[0].attrib['foo'] = 'bar'
>>> s = etree.tostring(root, xml_declaration=True, encoding='utf-8', pretty_print=True)
>>> print(s)
<?xml version='1.0' encoding='utf-8'?>
<root>
  <Button3 yposition="250" fontsize="16" language1="For English For Hearing Impaired
Press 3 on Keypad" foo="bar"/>
</root>
>>> with open('output.xml','w') as f:
...     f.write(s)
>>>

I haven't used Python's standard xml modules since discovering lxml. It can do everything you're looking for. For example...

input.xml:

<?xml version="1.0" encoding='utf-8'?>
<root>
  <Button3 yposition="250" fontsize="16" language1="For English For Hearing Impaired
Press 3 on Keypad" />
</root>

and:

>>> from lxml import etree
>>> with open('input.xml') as f:
...     root = etree.parse(f)
...
>>> buttons = root.xpath('//Button3')
>>> buttons
[<Element Button3 at 101071f18>]
>>> buttons[0]
<Element Button3 at 101071f18>
>>> buttons[0].attrib
{'yposition': '250', 'language1': 'For English For Hearing Impaired\nPress 3 on Keypad', 'fontsize': '16'}
>>> buttons[0].attrib['foo'] = 'bar'
>>> s = etree.tostring(root, xml_declaration=True, encoding='utf-8', pretty_print=True)
>>> print(s)
<?xml version='1.0' encoding='utf-8'?>
<root>
  <Button3 yposition="250" fontsize="16" language1="For English For Hearing Impaired
Press 3 on Keypad" foo="bar"/>
</root>
>>> with open('output.xml','w') as f:
...     f.write(s)
>>>
泪冰清 2024-10-06 06:13:01

不幸的是,标准 xml 模块没有关闭转义的选项。因此,对我来说,最好的选择是使用 ElementTree 中的方法转义回来,该方法由 xml 本身用于此目的(来自 ElementTree 的方法) >sax.utils 不会转义 \n):

text = ElementTree._escape_attrib(text, 'utf-8')

源 xml 中的文本:

Here is a test message
With newline & ampersand

“解码”后的文本:

Here is a test message
With newline & ampersand

“转义回来”后的文本:

Here is a test message
With newline & ampersand

Unfortunately, standard xml module doesn't have option to turn off escaping. So, for me best option was to escape it back using method from ElementTree that is used by xml itself for this purpose (method from sax.utils doesn't escape \n):

text = ElementTree._escape_attrib(text, 'utf-8')

Text in source xml:

Here is a test message
With newline & ampersand

Text after "decoding":

Here is a test message
With newline & ampersand

Text after "escaping back":

Here is a test message
With newline & ampersand
空城缀染半城烟沙 2024-10-06 06:13:01

是字符 0x0a 或换行符的 XML 实体。解析器正在正确解析 XML 并为您提供指示的字符。如果您想禁止或以其他方式处理属性中的换行符,则在解析器将它们提供给您之后,您可以自由地对它们执行任何您喜欢的操作。

is the XML entity for character 0x0a, or a newline. The parser is correctly parsing the XML and giving you the characters indicated. If you want to forbid or otherwise deal with newlines in attributes, you are free to do whatever you like with them after the parser gives them to you.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文