使用LXML和Etree获取嵌套元素
我正在使用libxml eTree来解析一个XML文件,然后将其写入新文件。在大多数情况下,我能够通过指定树中的点来获得单个元素,但是当试图捕获一个节点及其所有子孙时,我遇到了一个问题。
输入节点看起来像这样:
<abstract>
<p>
Some explanation text
</p>
<p>
Some more explanation text
<italic>
Title for link
</italic>some brief description, visit
<uri xlink:href="https://example.com">
https://example.com
</uri>
</p>
</abstract>
我得到以下内容:
<abstract>
<p>
Some explanation text.
</p>
<p>
Some more explanation text
</p>
<italic>
Title for link
</italic>
<uri>
https://example.com
</uri>
</abstract>
我追求的是输入的确切复制品(包括,如果可能的话,是命名空间)。在我的Python转换脚本中,我有以下内容:
# abstract
if tree.find(".//abstract") != None:
abstract = etree.SubElement(mods,'abstract')
abs_list = tree.xpath(".//abstract/descendant::*")
for para in abs_list:
ab_p = etree.SubElement(abstract,para.tag)
ab_p.text = para.text
更新2022-06-16
我现在能够获得结构:
<p>
Some explanation text
</p>
<p>
Some more explanation text
<italic>
Title for link
</italic>some brief description, visit
<uri>
https://example.com
</uri>
</p>
</abstract>
但是我无法获得命名空间,而我的代码有限因为它假设:
&lt; atraves&gt;
标签的所有直接儿童均为&lt; p&gt;
标签标签,- 每个标签每个标签都属于上一个
&lt; p&gt;
标记
我现在拥有的代码是:
if tree.find(".//abstract") != None:
abstract = etree.SubElement(mods,'abstract')
abs_elem = tree.xpath(".//abstract/descendant::*")
for para in abs_elem:
if para.tag == "p":
p = etree.SubElement(abstract,'p')
p.text = para.text
else:
ab_p = etree.SubElement(p,para.tag)
ab_p.text = para.text
虽然对于大多数用例,但这不是很精致,如果有一份文档,即无法满足上述两个假设中的一个(或两个),那么这将失败。
关于如何重写这一点以使对这两个假设的依赖无效?
I'm using LibXML etree to parse one XML file and write elements of it to a new file. For the most part I've been able to get single elements by specifying points in the tree but I've come across a problem when trying to capture a node and all its children/grandchildren.
The input node looks like this:
<abstract>
<p>
Some explanation text
</p>
<p>
Some more explanation text
<italic>
Title for link
</italic>some brief description, visit
<uri xlink:href="https://example.com">
https://example.com
</uri>
</p>
</abstract>
And I get the following:
<abstract>
<p>
Some explanation text.
</p>
<p>
Some more explanation text
</p>
<italic>
Title for link
</italic>
<uri>
https://example.com
</uri>
</abstract>
What I'm after is an exact replica of the input (including, if possible, the namespace). In my Python conversion script, I have the following:
# abstract
if tree.find(".//abstract") != None:
abstract = etree.SubElement(mods,'abstract')
abs_list = tree.xpath(".//abstract/descendant::*")
for para in abs_list:
ab_p = etree.SubElement(abstract,para.tag)
ab_p.text = para.text
UPDATE 2022-06-16
I'm now able to get the structure:
<p>
Some explanation text
</p>
<p>
Some more explanation text
<italic>
Title for link
</italic>some brief description, visit
<uri>
https://example.com
</uri>
</p>
</abstract>
But I haven't been able to get the namespace and my code is limited in that it assumes that:
- all direct children of the
<abstract>
tag are<p>
tags - every other tag belongs to the previous
<p>
tag
The code I now have is:
if tree.find(".//abstract") != None:
abstract = etree.SubElement(mods,'abstract')
abs_elem = tree.xpath(".//abstract/descendant::*")
for para in abs_elem:
if para.tag == "p":
p = etree.SubElement(abstract,'p')
p.text = para.text
else:
ab_p = etree.SubElement(p,para.tag)
ab_p.text = para.text
Whilst this should be sufficient for most use cases, it's not very refined and if there is a document where one (or both) of the 2 assumptions above are not met, then this will fail.
Any suggestions on how I can rewrite this to nullify the reliance on those 2 assumptions?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论