如何让 Python XML 停止浪费子节点
我有一个简单的 XML 文档,我正在尝试使用 Python DOM 读入(见下文):
XML 文件:
<?xml version="1.0" encoding="utf-8"?>
<HeaderLookup>
<Header>
<Reserved>2</Reserved>
<CPU>1</CPU>
<Flag>1</Flag>
<VQI>12</VQI>
<Group_ID>16</Group_ID>
<DI>2</DI>
<DE>1</DE>
<ACOSS>5</ACOSS>
<RGH>8</RGH>
</Header>
</HeaderLookup>
Python 代码:
from xml.dom import minidom
xml_file = open("test.xml")
xmlroot = minidom.parse(xml_file).documentElement
xml_file.close()
for item in xmlroot.getElementsByTagName("Header")[0].childNodes:
print item
结果:
<DOM Text node "u'\n\t\t'">
<DOM Element: Reserved at 0x28d2828>
<DOM Text node "u'\n\t\t'">
<DOM Element: CPU at 0x28d28c8>
<DOM Text node "u'\n\t\t'">
<DOM Element: Flag at 0x28d2968>
<DOM Text node "u'\n\t\t'">
<DOM Element: VQI at 0x28d2a08>
<DOM Text node "u'\n\t\t'">
<DOM Element: Group_ID at 0x28d2ad0>
<DOM Text node "u'\n\t\t'">
<DOM Element: DI at 0x28d2b70>
<DOM Text node "u'\n\t\t'">
<DOM Element: DE at 0x28d2c10>
<DOM Text node "u'\n\t\t'">
<DOM Element: ACOSS at 0x28d2cb0>
<DOM Text node "u'\n\t\t'">
<DOM Element: RGH at 0x28d2d50>
<DOM Text node "u'\n\t'">
结果应该是 9 个子节点(Reserved、CPU、Flag、VQI、Group_ID、DI、DE、ACOSS 和 RGH),但由于某种原因,它返回 19 个节点的列表,其中 10 个为空格(为什么这是甚至是首先被认为是一个节点?!)。谁能告诉我是否有办法让 XML 解析器不包含空白节点?
I have a simple XML document I'm trying to read in with Python DOM (see below):
XML File:
<?xml version="1.0" encoding="utf-8"?>
<HeaderLookup>
<Header>
<Reserved>2</Reserved>
<CPU>1</CPU>
<Flag>1</Flag>
<VQI>12</VQI>
<Group_ID>16</Group_ID>
<DI>2</DI>
<DE>1</DE>
<ACOSS>5</ACOSS>
<RGH>8</RGH>
</Header>
</HeaderLookup>
Python Code:
from xml.dom import minidom
xml_file = open("test.xml")
xmlroot = minidom.parse(xml_file).documentElement
xml_file.close()
for item in xmlroot.getElementsByTagName("Header")[0].childNodes:
print item
Result:
<DOM Text node "u'\n\t\t'">
<DOM Element: Reserved at 0x28d2828>
<DOM Text node "u'\n\t\t'">
<DOM Element: CPU at 0x28d28c8>
<DOM Text node "u'\n\t\t'">
<DOM Element: Flag at 0x28d2968>
<DOM Text node "u'\n\t\t'">
<DOM Element: VQI at 0x28d2a08>
<DOM Text node "u'\n\t\t'">
<DOM Element: Group_ID at 0x28d2ad0>
<DOM Text node "u'\n\t\t'">
<DOM Element: DI at 0x28d2b70>
<DOM Text node "u'\n\t\t'">
<DOM Element: DE at 0x28d2c10>
<DOM Text node "u'\n\t\t'">
<DOM Element: ACOSS at 0x28d2cb0>
<DOM Text node "u'\n\t\t'">
<DOM Element: RGH at 0x28d2d50>
<DOM Text node "u'\n\t'">
The result should be 9 Child Nodes (Reserved, CPU, Flag, VQI, Group_ID, DI, DE, ACOSS, and RGH), but for some reason it is returning a list of 19 nodes with 10 of them being whitespace (why is this even being considered a node in the first place?!). Can anyone tell me if there's a way to get the XML parser to not include whitespace nodes?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
空格在 XML 中很重要,但请查看 ElementTree,它具有不同的API 比 DOM 更适合处理 XML。
示例
输出
示例(扩展之前的代码)
空格仍然存在,但位于
.tail
属性中。tail
是元素后面的文本节点(在一个元素的结尾和下一个元素的开头之间),而text
是开始/结束标记之间的文本节点一个元素的。输出
Whitespace is significant in XML, but check out ElementTree, which has a different API for processing XML than the DOM.
Example
Output
Example (extending previous code)
The whitespace is still present, but it is in
.tail
attributes.tail
is the text node that follows an element (between the end of one element and the start of the next), whiletext
is the text node between the start/end tag of an element.Output