使用 Python 和 minidom 进行 XML 解析
我正在使用 Python (minidom) 解析一个 XML 文件,该文件打印一个看起来像这样的层次结构(这里使用缩进来显示重要的层次关系):
My Document
Overview
Basic Features
About This Software
Platforms Supported
相反,程序在节点上迭代多次并生成以下内容,打印重复节点。 (查看每次迭代的节点列表,很明显为什么这样做,但我似乎找不到一种方法来获取我正在寻找的节点列表。)
My Document
Overview
Basic Features
About This Software
Platforms Supported
Basic Features
About This Software
Platforms Supported
Platforms Supported
这是 XML 源文件:
<?xml version="1.0" encoding="UTF-8"?>
<DOCMAP>
<Topic Target="ALL">
<Title>My Document</Title>
</Topic>
<Topic Target="ALL">
<Title>Overview</Title>
<Topic Target="ALL">
<Title>Basic Features</Title>
</Topic>
<Topic Target="ALL">
<Title>About This Software</Title>
<Topic Target="ALL">
<Title>Platforms Supported</Title>
</Topic>
</Topic>
</Topic>
</DOCMAP>
这是 Python 程序:
import xml.dom.minidom
from xml.dom.minidom import Node
dom = xml.dom.minidom.parse("test.xml")
Topic=dom.getElementsByTagName('Topic')
i = 0
for node in Topic:
alist=node.getElementsByTagName('Title')
for a in alist:
Title= a.firstChild.data
print Title
我可以通过不嵌套“Topic”元素,将较低级别的主题名称更改为“SubTopic1”和“SubTopic2”之类的名称来解决问题。但是,我想利用内置的 XML 层次结构,而不需要不同的元素名称;看来我应该能够嵌套“主题”元素,并且应该有某种方法可以知道我当前正在查看哪个级别的“主题”。
我尝试了许多不同的 XPath 函数,但没有取得太大成功。
I'm using Python (minidom) to parse an XML file that prints a hierarchical structure that looks something like this (indentation is used here to show the significant hierarchical relationship):
My Document
Overview
Basic Features
About This Software
Platforms Supported
Instead, the program iterates multiple times over the nodes and produces the following, printing duplicate nodes. (Looking at the node list at each iteration, it's obvious why it does this but I can't seem to find a way to get the node list I'm looking for.)
My Document
Overview
Basic Features
About This Software
Platforms Supported
Basic Features
About This Software
Platforms Supported
Platforms Supported
Here is the XML source file:
<?xml version="1.0" encoding="UTF-8"?>
<DOCMAP>
<Topic Target="ALL">
<Title>My Document</Title>
</Topic>
<Topic Target="ALL">
<Title>Overview</Title>
<Topic Target="ALL">
<Title>Basic Features</Title>
</Topic>
<Topic Target="ALL">
<Title>About This Software</Title>
<Topic Target="ALL">
<Title>Platforms Supported</Title>
</Topic>
</Topic>
</Topic>
</DOCMAP>
Here is the Python program:
import xml.dom.minidom
from xml.dom.minidom import Node
dom = xml.dom.minidom.parse("test.xml")
Topic=dom.getElementsByTagName('Topic')
i = 0
for node in Topic:
alist=node.getElementsByTagName('Title')
for a in alist:
Title= a.firstChild.data
print Title
I could fix the problem by not nesting 'Topic' elements, by changing the lower level topic names to something like 'SubTopic1' and 'SubTopic2'. But, I want to take advantage of built-in XML hierarchical structuring without needing different element names; it seems that I should be able to nest 'Topic' elements and that there should be some way to know which level 'Topic' I'm currently looking at.
I've tried a number of different XPath functions without much success.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
getElementsByTagName 是递归的,您将获得具有匹配 tagName 的所有 后代。由于您的主题包含其他也有标题的主题,因此调用将多次获取较低的标题。
如果您只想询问所有匹配的直接子级,并且没有可用的 XPath,您可以编写一个简单的过滤器,例如:
getElementsByTagName is recursive, you'll get all descendents with a matching tagName. Because your Topics contain other Topics that also have Titles, the call will get the lower-down Titles many times.
If you want to ask for all matching direct children only, and you don't have XPath available, you can write a simple filter, eg.:
以下作品:
The following works:
我认为这可以帮助
输出:
I think that can help
Output:
您可以使用以下生成器来运行列表并获取具有缩进级别的标题:
如果您使用文件对其进行测试:
您将获得一个包含以下元组的列表:
当然,这只是一个需要微调的基本想法。如果您只想在开头添加空格,则可以直接在生成器中进行编码,但随着级别的提高,您将具有更大的灵活性。您还可以自动检测第一个级别(这里将级别初始化为 -1 只是一个糟糕的工作......)。
You could use the following generator to run through the list and get titles with indentation levels:
If you test it with your file:
you will get a list with the following tuples:
It is only a basic idea to be fine-tuned of course. If you just want spaces at the beginning you can code that directly in the generator, though with the level you have more flexibility. You could also detect the first level automatically (here it's just a poor job of initializing the level to -1...).
递归函数:
您的 xml:
您想要的输出:
Recusive function:
Your xml:
Your desired output: