使用LXML和Etree获取嵌套元素

发布于 2025-02-08 08:57:44 字数 2768 浏览 1 评论 0原文

我正在使用libxml eTree来解析一个XML文件，然后将其写入新文件。在大多数情况下，我能够通过指定树中的点来获得单个元素，但是当试图捕获一个节点及其所有子孙时，我遇到了一个问题。

输入节点看起来像这样：

            <abstract>
                <p>
                    Some explanation text
                </p>
                <p>
                    Some more explanation text 
                    <italic>
                        Title for link 
                    </italic>some brief description, visit 
                    <uri xlink:href="https://example.com">
                        https://example.com
                    </uri> 
                </p>
            </abstract>

我得到以下内容：

          <abstract>
            <p>
             Some explanation text.
             </p>
            <p>
             Some more explanation text 
             </p>
            <italic>
              Title for link
            </italic>
            <uri>
               https://example.com
            </uri>
          </abstract>

我追求的是输入的确切复制品（包括，如果可能的话，是命名空间）。在我的Python转换脚本中，我有以下内容：

# abstract
if tree.find(".//abstract") != None:
       abstract = etree.SubElement(mods,'abstract')
       abs_list = tree.xpath(".//abstract/descendant::*")
       for para in abs_list:
          ab_p = etree.SubElement(abstract,para.tag)
          ab_p.text = para.text

更新2022-06-16

我现在能够获得结构：

                <p>
                    Some explanation text
                </p>
                <p>
                    Some more explanation text 
                    <italic>
                        Title for link 
                    </italic>some brief description, visit 
                    <uri>
                        https://example.com
                    </uri> 
                </p>
            </abstract>

但是我无法获得命名空间，而我的代码有限因为它假设：

＆lt; atraves＆gt;标签的所有直接儿童均为＆lt; p＆gt;标签标签，
每个标签每个标签都属于上一个＆lt; p＆gt;标记

我现在拥有的代码是：

if tree.find(".//abstract") != None:
       abstract = etree.SubElement(mods,'abstract')
       abs_elem = tree.xpath(".//abstract/descendant::*")
       for para in abs_elem:
              if para.tag == "p":
                     p = etree.SubElement(abstract,'p')  
                     p.text = para.text
              else:
                   ab_p = etree.SubElement(p,para.tag)
                   ab_p.text = para.text

虽然对于大多数用例，但这不是很精致，如果有一份文档，即无法满足上述两个假设中的一个（或两个），那么这将失败。

关于如何重写这一点以使对这两个假设的依赖无效？

原文

I'm using LibXML etree to parse one XML file and write elements of it to a new file. For the most part I've been able to get single elements by specifying points in the tree but I've come across a problem when trying to capture a node and all its children/grandchildren.

The input node looks like this:

            <abstract>
                <p>
                    Some explanation text
                </p>
                <p>
                    Some more explanation text 
                    <italic>
                        Title for link 
                    </italic>some brief description, visit 
                    <uri xlink:href="https://example.com">
                        https://example.com
                    </uri> 
                </p>
            </abstract>

And I get the following:

          <abstract>
            <p>
             Some explanation text.
             </p>
            <p>
             Some more explanation text 
             </p>
            <italic>
              Title for link
            </italic>
            <uri>
               https://example.com
            </uri>
          </abstract>

What I'm after is an exact replica of the input (including, if possible, the namespace). In my Python conversion script, I have the following:

# abstract
if tree.find(".//abstract") != None:
       abstract = etree.SubElement(mods,'abstract')
       abs_list = tree.xpath(".//abstract/descendant::*")
       for para in abs_list:
          ab_p = etree.SubElement(abstract,para.tag)
          ab_p.text = para.text

UPDATE 2022-06-16

I'm now able to get the structure:

                <p>
                    Some explanation text
                </p>
                <p>
                    Some more explanation text 
                    <italic>
                        Title for link 
                    </italic>some brief description, visit 
                    <uri>
                        https://example.com
                    </uri> 
                </p>
            </abstract>

But I haven't been able to get the namespace and my code is limited in that it assumes that:

all direct children of the <abstract> tag are <p> tags
every other tag belongs to the previous <p> tag

The code I now have is:

if tree.find(".//abstract") != None:
       abstract = etree.SubElement(mods,'abstract')
       abs_elem = tree.xpath(".//abstract/descendant::*")
       for para in abs_elem:
              if para.tag == "p":
                     p = etree.SubElement(abstract,'p')  
                     p.text = para.text
              else:
                   ab_p = etree.SubElement(p,para.tag)
                   ab_p.text = para.text

Whilst this should be sufficient for most use cases, it's not very refined and if there is a document where one (or both) of the 2 assumptions above are not met, then this will fail.

Any suggestions on how I can rewrite this to nullify the reliance on those 2 assumptions?

分享到QQ

分享到微博