有没有一种更Pythonic的方式来使用lxml访问父元素的子元素

发布于 2024-09-07 12:25:19 字数 1614 浏览 2 评论 0原文

我正在研究 XBRL 文档,试图了解如何有效地提取和使用数据。我一直在努力解决的一件事是确保正确使用上下文信息。下面是我正在使用的文档之一的片段(来自 Mattel 最新的 10-K),

我希望能够有效地收集上下文键值对,因为它们对于帮助对齐“真实”数据非常重要。是上下文元素的一个示例

- <context id="eol_PE6050----0910-K0010_STD_0_20091231_0">
  - <entity>
     <identifier scheme="http://www.sec.gov/CIK">0000063276</identifier> 
   </entity>
  - <period>
   <instant>2009-12-31</instant> 
   </period>
   </context>

当我开始这个时,我认为如果存在父子关系,我应该能够直接通过将方法(?)应用于父母。但孩子们虽然可以从父母那里找到他们,但仍然保持独立性。我的意思是,如果子级具有属性、键、值和/或文本,则无法从父级直接访问这些构造,而是必须确定/识别子级,并从子级访问所需的数据或元数据。

我不完全确定为什么这段代码是一个很好的起点:

 from lxml import etree
 test_tree=etree.parse(r'c:\temp\test_xml\mat-20091231.xml')
 tree_list=[p for p in test_tree.getiterator() 

所以我的 tree_list 是确定存在于我的 xml 文件中的元素的列表
因为我的 tree_list 中只有 664 个项目,所以我做了一个非常糟糕的假设,即父级中的所有元素都包含在父级中,因此我不断尝试通过仅引用这些元素(而不是它们的子级)来访问实体、周期和瞬时

for each in tree_list:
    if 'context' in each.tag:
        contextlist.append(each)

也就是说,我不断对上下文列表中的项目应用不同的方法,并且感到非常沮丧。最后,当我写出问题时,我试图获得一些帮助,弄清楚什么方法可以为我提供实体和期间,我只是决定尝试一下,

children=[c for c in contextlist[0].iterchildren()]

列表中第一项的所有子项

这样我的列表子项就包含了上下文 Children 是实体元素,另一个是 period 元素

现在,应该每个子元素都有一个子元素,entity 元素有一个标识符子元素,period 元素有一个即时子元素 这比今天早上看起来要复杂得多。

我必须知道上下文元素报告的详细信息,才能正确评估和操作真实数据。看来我必须测试上下文元素的每个子元素是否有更快更有效的方法来获取这些值?换句话说,有没有一种方法可以拥有一些元素并创建一个包含其所有子元素和孙子元素等的数据结构,而无需执行大量的 try else 语句

一旦我拥有它们,我就可以开始构建数据字典并分配数据根据上下文将元素转换为特定条目。因此,高效、完整地获取上下文元素对于我的任务至关重要。

I am poking at XBRL documents trying to get my head around how to effectively extract and use the data. One thing I have been struggling with is making sure I use the context information correctly. Below is a snippet from one of the documents I am playing with (this is from Mattel's latest 10-K)

I want to be able to efficiently collect the context key value pairs as they are important to help align the 'real' data' Here is an example of a context element

- <context id="eol_PE6050----0910-K0010_STD_0_20091231_0">
  - <entity>
     <identifier scheme="http://www.sec.gov/CIK">0000063276</identifier> 
   </entity>
  - <period>
   <instant>2009-12-31</instant> 
   </period>
   </context>

When I started this I thought that if there was a parent-child relationship I should be able to get the attributes, keys, values and text of all the children directly from applying a method (?) to the parent. But the children retain their independence though they can be found from the parent. What I mean is that if the children have attributes, keys, values and or text those constructs cannot be directly accessed from the parent instead you have to determine/identify the children and from the children access the data or metadata that is needed.

I am not fully certain why this block of code is a good starting point:

 from lxml import etree
 test_tree=etree.parse(r'c:\temp\test_xml\mat-20091231.xml')
 tree_list=[p for p in test_tree.getiterator() 

so my tree_list is a list of the elements that were determined to exist in my xml file
Because there were only 664 items in my tree_list I made the very bad assumption that all of the elements within a parent were subsumed in the parent so I kept trying to access the entity, period and instant by referencing just those elements (not their children)

for each in tree_list:
    if 'context' in each.tag:
        contextlist.append(each)

That is I kept applying different methods to the items in the contextlist and got really frustrated. Finally while I was writing out the question I was trying to get some help figuring out what method would give me the entity and period I just decided to try

children=[c for c in contextlist[0].iterchildren()]

so my list children has all of the children from the first item in my contextlist

One of the children is the entity element, the other is the period element

Now, it should be that each of those children have a child, the entity element has an identifier child element and the period element has an instant child element
This is getting much more complicated than it seemed this morning.

I have to know the details that are reported by the context elements to correctly evaluate and operate on the real data. It seems like I have to test each of the children of the context elements Is there a faster more efficient way to get those values? Rephrased, is there a way to have some element and create a data structure that contains all of its children, and grandchildren etc without having to do a fair amount of try else statements

Once I have them I can start building a data dictionary and assign data elements to particular entries based on the context. So getting the context elements efficiently and completely is critical to my task.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

燕归巢 2024-09-14 12:25:19

使用元素树接口(lxml 也支持),getiterator 迭代以当前元素为根的子树中的所有节点。

因此,[list(c.getiterator()) for c in contextlist] 为您提供所需的列表列表(或者您可能希望将 c 保留在结果列表中为了避免稍后必须使用 contextlist 对其进行压缩,即直接创建一个元组列表 [(c, list(c.getiterator())) for c in contextlist],具体取决于您的预期用途)。

请注意,精确形式的 listcomp 永远没有多大意义 - 使用 list(whatever) 来转换任何其他可迭代对象到一个列表中。

Using the element-tree interface (which lxml also supports), getiterator iterates over all the nodes in the subtree rooted at the current element.

So, [list(c.getiterator()) for c in contextlist] gives you the list of lists you want (or you may want to keep c in the resulting list to avoid having to zip it with contextlist later, i.e. diretly make a list of tuples [(c, list(c.getiterator())) for c in contextlist], depending on your intended use).

Note in passing that a listcomp of the exact form [x for x in whatever] never makes much sense -- use list(whatever), instead, to turn whatever other iterable into a list.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文