在 Python XML 解析中保留转义字符
我正在尝试编写一个 python 脚本,它接受一两个 xml 文件,并根据输入文件的内容输出一两个新文件。我试图使用 minidom 模块编写这个脚本。但是,输入文件包含许多转义字符实例
内部节点属性。不幸的是,在输出文件中,这些字符已被转换为不同的字符,这似乎是换行符。
例如,输入文件中的一行如:
<Entry text="For English For Hearing Impaired
Press 3 on Keypad"
当我读到 minidom 导致此问题时,将输出
<Entry text="For English For Hearing Impaired
Press 3 on Keypad"
,因为它不允许 xml 属性中的转义字符(我认为)。这是真的吗?如果是这样,将 xml 文件解析为 python 文档、操作节点并将其与其他文档交换以及将文档输出回新文件的最佳工具/方法是什么?
如果有帮助的话,我还使用“utf-8”编码解析和保存这些文件。我不知道这是否是问题的一部分。感谢任何人可以提供的帮助。
-亚历克斯·凯泽
I'm trying to write a python script that takes in one or two xml files and outputs one or two new files based on the contents of the input files. I was trying to write this script using the minidom module. However, the input files contain a number of instances of the escape character
inside node attributes. Unfortunately, in the output files, these characters have been converted to different characters, which seem to be newline characters.
For example, a line in the input file such as:
<Entry text="For English For Hearing Impaired
Press 3 on Keypad"
Would be output as
<Entry text="For English For Hearing Impaired
Press 3 on Keypad"
I read that minidom is causing this, as it doesn't allow escape characters in xml attributes (I think). Is this true? And, if so, what's the best tool/method to use to parse an xml file into a python document, manipulate nodes and exchange them with other documents, and output documents back to new files?
If it helps, I was also parsing and saving these files using 'utf-8' encoding. I don't know if this is part of the problem or not. Thanks for any help anyone can give.
-Alex Kaiser
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
自从发现 lxml 以来,我就没有使用过 Python 的标准 xml 模块。它可以完成您正在寻找的一切。例如...
input.xml:
和:
I haven't used Python's standard xml modules since discovering lxml. It can do everything you're looking for. For example...
input.xml:
and:
不幸的是,标准
xml
模块没有关闭转义的选项。因此,对我来说,最好的选择是使用ElementTree
中的方法转义回来
,该方法由xml
本身用于此目的(来自ElementTree
的方法) >sax.utils 不会转义\n
):源 xml 中的文本:
“解码”后的文本:
“转义回来”后的文本:
Unfortunately, standard
xml
module doesn't have option to turn off escaping. So, for me best option was toescape it back
using method fromElementTree
that is used byxml
itself for this purpose (method fromsax.utils
doesn't escape\n
):Text in source xml:
Text after "decoding":
Text after "escaping back":
是字符 0x0a 或换行符的 XML 实体。解析器正在正确解析 XML 并为您提供指示的字符。如果您想禁止或以其他方式处理属性中的换行符,则在解析器将它们提供给您之后,您可以自由地对它们执行任何您喜欢的操作。
is the XML entity for character 0x0a, or a newline. The parser is correctly parsing the XML and giving you the characters indicated. If you want to forbid or otherwise deal with newlines in attributes, you are free to do whatever you like with them after the parser gives them to you.