如何让 Python XML 停止浪费子节点

发布于 2024-11-14 16:46:26 字数 1727 浏览 0 评论 0原文

我有一个简单的 XML 文档，我正在尝试使用 Python DOM 读入（见下文）：

XML 文件：

<?xml version="1.0" encoding="utf-8"?>
<HeaderLookup>
    <Header>
        <Reserved>2</Reserved>
        <CPU>1</CPU>
        <Flag>1</Flag>
        <VQI>12</VQI>
        <Group_ID>16</Group_ID>
        <DI>2</DI>
        <DE>1</DE>
        <ACOSS>5</ACOSS>
        <RGH>8</RGH>
    </Header>
</HeaderLookup>

Python 代码：

from xml.dom import minidom

xml_file = open("test.xml")
xmlroot = minidom.parse(xml_file).documentElement
xml_file.close()

for item in xmlroot.getElementsByTagName("Header")[0].childNodes:
    print item

结果：

<DOM Text node "u'\n\t\t'">
<DOM Element: Reserved at 0x28d2828>
<DOM Text node "u'\n\t\t'">
<DOM Element: CPU at 0x28d28c8>
<DOM Text node "u'\n\t\t'">
<DOM Element: Flag at 0x28d2968>
<DOM Text node "u'\n\t\t'">
<DOM Element: VQI at 0x28d2a08>
<DOM Text node "u'\n\t\t'">
<DOM Element: Group_ID at 0x28d2ad0>
<DOM Text node "u'\n\t\t'">
<DOM Element: DI at 0x28d2b70>
<DOM Text node "u'\n\t\t'">
<DOM Element: DE at 0x28d2c10>
<DOM Text node "u'\n\t\t'">
<DOM Element: ACOSS at 0x28d2cb0>
<DOM Text node "u'\n\t\t'">
<DOM Element: RGH at 0x28d2d50>
<DOM Text node "u'\n\t'">

结果应该是 9 个子节点（Reserved、CPU、Flag、VQI、Group_ID、DI、DE、ACOSS 和 RGH），但由于某种原因，它返回 19 个节点的列表，其中 10 个为空格（为什么这是甚至是首先被认为是一个节点？！）。谁能告诉我是否有办法让 XML 解析器不包含空白节点？

原文

I have a simple XML document I'm trying to read in with Python DOM (see below):

XML File:

<?xml version="1.0" encoding="utf-8"?>
<HeaderLookup>
    <Header>
        <Reserved>2</Reserved>
        <CPU>1</CPU>
        <Flag>1</Flag>
        <VQI>12</VQI>
        <Group_ID>16</Group_ID>
        <DI>2</DI>
        <DE>1</DE>
        <ACOSS>5</ACOSS>
        <RGH>8</RGH>
    </Header>
</HeaderLookup>

Python Code:

from xml.dom import minidom

xml_file = open("test.xml")
xmlroot = minidom.parse(xml_file).documentElement
xml_file.close()

for item in xmlroot.getElementsByTagName("Header")[0].childNodes:
    print item

Result:

<DOM Text node "u'\n\t\t'">
<DOM Element: Reserved at 0x28d2828>
<DOM Text node "u'\n\t\t'">
<DOM Element: CPU at 0x28d28c8>
<DOM Text node "u'\n\t\t'">
<DOM Element: Flag at 0x28d2968>
<DOM Text node "u'\n\t\t'">
<DOM Element: VQI at 0x28d2a08>
<DOM Text node "u'\n\t\t'">
<DOM Element: Group_ID at 0x28d2ad0>
<DOM Text node "u'\n\t\t'">
<DOM Element: DI at 0x28d2b70>
<DOM Text node "u'\n\t\t'">
<DOM Element: DE at 0x28d2c10>
<DOM Text node "u'\n\t\t'">
<DOM Element: ACOSS at 0x28d2cb0>
<DOM Text node "u'\n\t\t'">
<DOM Element: RGH at 0x28d2d50>
<DOM Text node "u'\n\t'">

The result should be 9 Child Nodes (Reserved, CPU, Flag, VQI, Group_ID, DI, DE, ACOSS, and RGH), but for some reason it is returning a list of 19 nodes with 10 of them being whitespace (why is this even being considered a node in the first place?!). Can anyone tell me if there's a way to get the XML parser to not include whitespace nodes?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

暖伴 2024-11-21 16:46:26

空格在 XML 中很重要，但请查看 ElementTree，它具有不同的API 比 DOM 更适合处理 XML。

示例

from xml.etree import ElementTree as et

data = '''\
<?xml version="1.0" encoding="utf-8"?>
<HeaderLookup>
    <Header>
        <Reserved>2</Reserved>
        <CPU>1</CPU>
        <Flag>1</Flag>
        <VQI>12</VQI>
        <Group_ID>16</Group_ID>
        <DI>2</DI>
        <DE>1</DE>
        <ACOSS>5</ACOSS>
        <RGH>8</RGH>
    </Header>
</HeaderLookup>
'''

tree = et.fromstring(data)
for n in tree.find('Header'):
    print n.tag,'=',n.text

输出

Reserved = 2
CPU = 1
Flag = 1
VQI = 12
Group_ID = 16
DI = 2
DE = 1
ACOSS = 5
RGH = 8

示例（扩展之前的代码）

空格仍然存在，但位于 .tail 属性中。 tail 是元素后面的文本节点（在一个元素的结尾和下一个元素的开头之间），而 text 是开始/结束标记之间的文本节点一个元素的。

def dump(e):
    print '<%s>' % e.tag
    print 'text =',repr(e.text)
    for n in e:
        dump(n)
    print '</%s>' % e.tag
    print 'tail =',repr(e.tail)

dump(tree)

输出

<HeaderLookup>
text = '\n    '
<Header>
text = '\n        '
<Reserved>
text = '2'
</Reserved>
tail = '\n        '
<CPU>
text = '1'
</CPU>
tail = '\n        '
<Flag>
text = '1'
</Flag>
tail = '\n        '
<VQI>
text = '12'
</VQI>
tail = '\n        '
<Group_ID>
text = '16'
</Group_ID>
tail = '\n        '
<DI>
text = '2'
</DI>
tail = '\n        '
<DE>
text = '1'
</DE>
tail = '\n        '
<ACOSS>
text = '5'
</ACOSS>
tail = '\n        '
<RGH>
text = '8'
</RGH>
tail = '\n    '
</Header>
tail = '\n'
</HeaderLookup>
tail = None

Whitespace is significant in XML, but check out ElementTree, which has a different API for processing XML than the DOM.

Example

from xml.etree import ElementTree as et

data = '''\
<?xml version="1.0" encoding="utf-8"?>
<HeaderLookup>
    <Header>
        <Reserved>2</Reserved>
        <CPU>1</CPU>
        <Flag>1</Flag>
        <VQI>12</VQI>
        <Group_ID>16</Group_ID>
        <DI>2</DI>
        <DE>1</DE>
        <ACOSS>5</ACOSS>
        <RGH>8</RGH>
    </Header>
</HeaderLookup>
'''

tree = et.fromstring(data)
for n in tree.find('Header'):
    print n.tag,'=',n.text

Output

Reserved = 2
CPU = 1
Flag = 1
VQI = 12
Group_ID = 16
DI = 2
DE = 1
ACOSS = 5
RGH = 8

Example (extending previous code)

The whitespace is still present, but it is in .tail attributes. tail is the text node that follows an element (between the end of one element and the start of the next), while text is the text node between the start/end tag of an element.

def dump(e):
    print '<%s>' % e.tag
    print 'text =',repr(e.text)
    for n in e:
        dump(n)
    print '</%s>' % e.tag
    print 'tail =',repr(e.tail)

dump(tree)

Output

<HeaderLookup>
text = '\n    '
<Header>
text = '\n        '
<Reserved>
text = '2'
</Reserved>
tail = '\n        '
<CPU>
text = '1'
</CPU>
tail = '\n        '
<Flag>
text = '1'
</Flag>
tail = '\n        '
<VQI>
text = '12'
</VQI>
tail = '\n        '
<Group_ID>
text = '16'
</Group_ID>
tail = '\n        '
<DI>
text = '2'
</DI>
tail = '\n        '
<DE>
text = '1'
</DE>
tail = '\n        '
<ACOSS>
text = '5'
</ACOSS>
tail = '\n        '
<RGH>
text = '8'
</RGH>
tail = '\n    '
</Header>
tail = '\n'
</HeaderLookup>
tail = None

回复收藏 0 原文

~没有更多了~