使用 lxml 解析结构怪异的 XML
我有许多需要解析的 XML 文件。我编写了一些可以工作但很丑陋的代码,并且我想从比我更有 XML 经验的人那里得到一些建议。
首先,我可能在错误的上下文中使用了一些术语,因为我对 XML 的经验有限。对于元素,除非另有说明,我的意思是这样的:
<root>
<element>
...
</element>
<element>
...
</element>
</root>
无论如何,每个文件都包含许多元素,以及许多子元素(显然)。让我困惑的是,需要通过四种不同的方式访问相关值;
1) 节点文本:
<tag>value</tag>
2) 属性:
<tag attribute="value"></tag>
3) 标签内“隐藏”的值(本例中为“true”):
<tag><boolean.true/></tag>
4) 同名标签内的值(“tagA”),但带有“祖父母”标签不同的名称(“tag1”和“tag2”),都在同一元素内。 “tagA”对我来说没有用,相反我会寻找“tag1”和“tag2”。
<element>
<tag1><tagA>value</tagA><tag1>
<tag2><tagA>value</tagA></tag2>
</element>
目前我有一本字典,其中每个文件作为键。这些值是带有“属性”、“节点文本”、“标签”和“父元素”键的字典。
示例:
{'file1.xml' : 'attributes' : {'Person': 'Id', 'Car' : 'Color'},
'node text': ['Name', 'Address'],
}
其中“Person”和“Car”是标签,“Id”和“Color”是属性名称。
这使得迭代所有元素并检查每个标签变得容易,如果字典中存在匹配项(如果 dict['file1.xml']['attributes'] 中的 elem.tag),则提取该值。
正如我所说,代码可以工作,但我不喜欢我的解决方案。另外,并非所有元素都具有所有子元素(例如,一个人可能没有汽车,那么该标签将完全丢失),并且我需要给这些值分配“无”。现在,我获取每个文件中每个元素应存在的所有标签,将它们转换为一个集合,然后检查这些标签与我实际从中提取该元素的值的标签集之间的差异。同样,代码非常丑陋。
希望这个混乱有一些意义。
编辑:
我使用了 JF Sebastian 的建议,将每个值的 xpath 存储在字典中,其中字段名称作为键,xpath 作为值。
I have a number of XML files that I need to parse. I've written some code that works, but is ugly, and I'd like to get some advice from people more experienced with XML than I am.
First of all, I might be using some terms in the wrong context, because my experience with XML is limited. By element, unless specified otherwise, I mean something like this:
<root>
<element>
...
</element>
<element>
...
</element>
</root>
Anyway, each file consist of a number of elements, with a number of child elements (obviously). What stumps me is that the relevant values need to be accessed in four different ways;
1) Node text:
<tag>value</tag>
2) Attribute:
<tag attribute="value"></tag>
3) A value "hidden" inside a tag ("true" in this case):
<tag><boolean.true/></tag>
4) Values inside tags of the same name ("tagA"), but with "grandparent" tags with different names ("tag1" and "tag2"), all within the same element. "tagA" is of no use to me, instead I will be looking for "tag1" and "tag2".
<element>
<tag1><tagA>value</tagA><tag1>
<tag2><tagA>value</tagA></tag2>
</element>
At the moment I have a dictionary with each file as a key. The values are dictionaries with the keys "attribute", "node text", "tag" and "parent element".
Example:
{'file1.xml' : 'attributes' : {'Person': 'Id', 'Car' : 'Color'},
'node text': ['Name', 'Address'],
}
Where "Person" and "Car" are tags, and "Id" and "Color" are attribute names.
This makes it easy to iterate over all elements and inspect each tag, and if there is a match in the dictionary (if elem.tag in dict['file1.xml']['attributes']), extract the value.
So as I said, the code works, but I don't like my solution. Also, not all the elements have all the child elements (for example, a Person might not own a car, then that tag will be missing altogether), and I need to give assign those values "None". Right now I get all the tags that should exist for every element in each file, turn them into a set, then check the difference between those and the set of tags that I've actually extracted values from for that element. Again, the code is pretty ugly.
Hopefully this mess makes some sense.
edit:
I used J.F. Sebastian's suggestion of storing the xpath to each value in a dictionary with the field name as the key and xpath as value.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可以通过使用与元素相关的
xpath
表达式而不是复杂的数据结构来简化您的输入代码,例如,#1-4 情况:使用什么输出数据结构取决于您以后希望如何在代码中使用它。您可以从最适合当前代码的结构开始。当您更好地理解需求后,将其发展为更通用的解决方案。
您可以使用普通的 dict 和 csv.DictWriter(fieldnames=xpathdict.keys()):
其中
xpathdict
是字段名称和相应的 xpath 表达式之间的映射。为了一般性,您可以存储函数对象f(element) -> csv 字段
代替/除了 xpath exprs 之外。You could streamline your input code by using
xpath
expressions relative your element instead of a complex data-structure e.g., #1-4 cases:What output data-structure to use depends on how do you like it to be used in your code later. You could start with a structure that is most convenient for your current code. And evolve it to a more general solution later when you better understand the requirements.
You could use ordinary dict and csv.DictWriter(fieldnames=xpathdict.keys()):
Where
xpathdict
is a mapping between field names and corresponding xpath expressions. For generality you could store function objectsf(element) -> csv field
instead of/in addition to xpath exprs.我不认为 #3 是合法的 XML,因为没有与之关联的开始标记,即使它在其他地方,它也不会正确嵌套在该示例中。由于
<
字符,该表达式将被解释为结束标记。I don't think #3 is legal XML because there's no opening tag associated with and even if it's somewhere else, it wouldn't be properly nested in that example. The expression will be interpreted as a closing tag because of the
<
character.我假设你想要这样的东西:
并得到这样的东西:
要做到这一点,我会做这样的事情(未经测试):
I'm assuming that you'd want to take something like this:
And get something like this:
To do this I'd do something like this (untested):