使用LXML和Etree获取嵌套元素

发布于 2025-02-08 08:57:44 字数 2768 浏览 1 评论 0原文

我正在使用libxml eTree来解析一个XML文件,然后将其写入新文件。在大多数情况下,我能够通过指定树中的点来获得单个元素,但是当试图捕获一个节点及其所有子孙时,我遇到了一个问题。

输入节点看起来像这样:

            <abstract>
                <p>
                    Some explanation text
                </p>
                <p>
                    Some more explanation text 
                    <italic>
                        Title for link 
                    </italic>some brief description, visit 
                    <uri xlink:href="https://example.com">
                        https://example.com
                    </uri> 
                </p>
            </abstract>

我得到以下内容:

          <abstract>
            <p>
             Some explanation text.
             </p>
            <p>
             Some more explanation text 
             </p>
            <italic>
              Title for link
            </italic>
            <uri>
               https://example.com
            </uri>
          </abstract>

我追求的是输入的确切复制品(包括,如果可能的话,是命名空间)。在我的Python转换脚本中,我有以下内容:

# abstract
if tree.find(".//abstract") != None:
       abstract = etree.SubElement(mods,'abstract')
       abs_list = tree.xpath(".//abstract/descendant::*")
       for para in abs_list:
          ab_p = etree.SubElement(abstract,para.tag)
          ab_p.text = para.text

更新2022-06-16

我现在能够获得结构:

                <p>
                    Some explanation text
                </p>
                <p>
                    Some more explanation text 
                    <italic>
                        Title for link 
                    </italic>some brief description, visit 
                    <uri>
                        https://example.com
                    </uri> 
                </p>
            </abstract>

但是我无法获得命名空间,而我的代码有限因为它假设:

  1. &lt; atraves&gt;标签的所有直接儿童均为&lt; p&gt;标签标签,
  2. 每个标签每个标签都属于上一个&lt; p&gt;标记

我现在拥有的代码是:

if tree.find(".//abstract") != None:
       abstract = etree.SubElement(mods,'abstract')
       abs_elem = tree.xpath(".//abstract/descendant::*")
       for para in abs_elem:
              if para.tag == "p":
                     p = etree.SubElement(abstract,'p')  
                     p.text = para.text
              else:
                   ab_p = etree.SubElement(p,para.tag)
                   ab_p.text = para.text

虽然对于大多数用例,但这不是很精致,如果有一份文档,即无法满足上述两个假设中的一个(或两个),那么这将失败。

关于如何重写这一点以使对这两个假设的依赖无效?

I'm using LibXML etree to parse one XML file and write elements of it to a new file. For the most part I've been able to get single elements by specifying points in the tree but I've come across a problem when trying to capture a node and all its children/grandchildren.

The input node looks like this:

            <abstract>
                <p>
                    Some explanation text
                </p>
                <p>
                    Some more explanation text 
                    <italic>
                        Title for link 
                    </italic>some brief description, visit 
                    <uri xlink:href="https://example.com">
                        https://example.com
                    </uri> 
                </p>
            </abstract>

And I get the following:

          <abstract>
            <p>
             Some explanation text.
             </p>
            <p>
             Some more explanation text 
             </p>
            <italic>
              Title for link
            </italic>
            <uri>
               https://example.com
            </uri>
          </abstract>

What I'm after is an exact replica of the input (including, if possible, the namespace). In my Python conversion script, I have the following:

# abstract
if tree.find(".//abstract") != None:
       abstract = etree.SubElement(mods,'abstract')
       abs_list = tree.xpath(".//abstract/descendant::*")
       for para in abs_list:
          ab_p = etree.SubElement(abstract,para.tag)
          ab_p.text = para.text

UPDATE 2022-06-16

I'm now able to get the structure:

                <p>
                    Some explanation text
                </p>
                <p>
                    Some more explanation text 
                    <italic>
                        Title for link 
                    </italic>some brief description, visit 
                    <uri>
                        https://example.com
                    </uri> 
                </p>
            </abstract>

But I haven't been able to get the namespace and my code is limited in that it assumes that:

  1. all direct children of the <abstract> tag are <p> tags
  2. every other tag belongs to the previous <p> tag

The code I now have is:

if tree.find(".//abstract") != None:
       abstract = etree.SubElement(mods,'abstract')
       abs_elem = tree.xpath(".//abstract/descendant::*")
       for para in abs_elem:
              if para.tag == "p":
                     p = etree.SubElement(abstract,'p')  
                     p.text = para.text
              else:
                   ab_p = etree.SubElement(p,para.tag)
                   ab_p.text = para.text

Whilst this should be sufficient for most use cases, it's not very refined and if there is a document where one (or both) of the 2 assumptions above are not met, then this will fail.

Any suggestions on how I can rewrite this to nullify the reliance on those 2 assumptions?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文