使用 lxml 解析结构怪异的 XML

发布于 2024-12-12 10:03:49 字数 1515 浏览 2 评论 0原文

我有许多需要解析的 XML 文件。我编写了一些可以工作但很丑陋的代码，并且我想从比我更有 XML 经验的人那里得到一些建议。

首先，我可能在错误的上下文中使用了一些术语，因为我对 XML 的经验有限。对于元素，除非另有说明，我的意思是这样的：

 <root>
  <element>
   ...
  </element>
  <element>
   ...
  </element>
 </root>

无论如何，每个文件都包含许多元素，以及许多子元素（显然）。让我困惑的是，需要通过四种不同的方式访问相关值；

1) 节点文本：

<tag>value</tag>

2) 属性：

<tag attribute="value"></tag>

3) 标签内“隐藏”的值（本例中为“true”）：

<tag><boolean.true/></tag>

4) 同名标签内的值（“tagA”），但带有“祖父母”标签不同的名称（“tag1”和“tag2”），都在同一元素内。 “tagA”对我来说没有用，相反我会寻找“tag1”和“tag2”。

<element>
   <tag1><tagA>value</tagA><tag1>
   <tag2><tagA>value</tagA></tag2>
</element>

目前我有一本字典，其中每个文件作为键。这些值是带有“属性”、“节点文本”、“标签”和“父元素”键的字典。

示例：

{'file1.xml' : 'attributes' : {'Person': 'Id', 'Car' : 'Color'},
               'node text': ['Name', 'Address'],
}

其中“Person”和“Car”是标签，“Id”和“Color”是属性名称。

这使得迭代所有元素并检查每个标签变得容易，如果字典中存在匹配项（如果 dict['file1.xml']['attributes'] 中的 elem.tag），则提取该值。

正如我所说，代码可以工作，但我不喜欢我的解决方案。另外，并非所有元素都具有所有子元素（例如，一个人可能没有汽车，那么该标签将完全丢失），并且我需要给这些值分配“无”。现在，我获取每个文件中每个元素应存在的所有标签，将它们转换为一个集合，然后检查这些标签与我实际从中提取该元素的值的标签集之间的差异。同样，代码非常丑陋。

希望这个混乱有一些意义。

编辑：

我使用了 JF Sebastian 的建议，将每个值的 xpath 存储在字典中，其中字段名称作为键，xpath 作为值。

原文

I have a number of XML files that I need to parse. I've written some code that works, but is ugly, and I'd like to get some advice from people more experienced with XML than I am.

First of all, I might be using some terms in the wrong context, because my experience with XML is limited. By element, unless specified otherwise, I mean something like this:

 <root>
  <element>
   ...
  </element>
  <element>
   ...
  </element>
 </root>

Anyway, each file consist of a number of elements, with a number of child elements (obviously). What stumps me is that the relevant values need to be accessed in four different ways;

1) Node text:

<tag>value</tag>

2) Attribute:

<tag attribute="value"></tag>

3) A value "hidden" inside a tag ("true" in this case):

<tag><boolean.true/></tag>

4) Values inside tags of the same name ("tagA"), but with "grandparent" tags with different names ("tag1" and "tag2"), all within the same element. "tagA" is of no use to me, instead I will be looking for "tag1" and "tag2".

<element>
   <tag1><tagA>value</tagA><tag1>
   <tag2><tagA>value</tagA></tag2>
</element>

At the moment I have a dictionary with each file as a key. The values are dictionaries with the keys "attribute", "node text", "tag" and "parent element".

Example:

{'file1.xml' : 'attributes' : {'Person': 'Id', 'Car' : 'Color'},
               'node text': ['Name', 'Address'],
}

Where "Person" and "Car" are tags, and "Id" and "Color" are attribute names.

This makes it easy to iterate over all elements and inspect each tag, and if there is a match in the dictionary (if elem.tag in dict['file1.xml']['attributes']), extract the value.

So as I said, the code works, but I don't like my solution. Also, not all the elements have all the child elements (for example, a Person might not own a car, then that tag will be missing altogether), and I need to give assign those values "None". Right now I get all the tags that should exist for every element in each file, turn them into a set, then check the difference between those and the set of tags that I've actually extracted values from for that element. Again, the code is pretty ugly.

Hopefully this mess makes some sense.

edit:

I used J.F. Sebastian's suggestion of storing the xpath to each value in a dictionary with the field name as the key and xpath as value.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

遥远的她 2024-12-19 10:03:49

您可以通过使用与元素相关的 xpath 表达式而不是复杂的数据结构来简化您的输入代码，例如，#1-4 情况：

tag/text()
tag/@ attribute
name(DTBoolean/*[1])
(tag1|tag2)/*/text()

使用什么输出数据结构取决于您以后希望如何在代码中使用它。您可以从最适合当前代码的结构开始。当您更好地理解需求后，将其发展为更通用的解决方案。

我将其输出到 csv，其中每个元素都是 csv 文件中的一行。
...
我使用 defaultdict 来存储元素，然后将它们存储在列表中，然后再将它们输出到 csv。

您可以使用普通的 dict 和 csv.DictWriter(fieldnames=xpathdict.keys())：

# for each element
row_dict = dict.fromkeys(xpathdict.keys())
...
# for each key 
row_dict[key] = element.xpath(xpathdict[key]) or None
...
dictwriter.writerow(row_dict)

其中 xpathdict 是字段名称和相应的 xpath 表达式之间的映射。为了一般性，您可以存储函数对象 f(element) -> csv 字段 代替/除了 xpath exprs 之外。

You could streamline your input code by using xpath expressions relative your element instead of a complex data-structure e.g., #1-4 cases:

tag/text()
tag/@attribute
name(DTBoolean/*[1])
(tag1|tag2)/*/text()

What output data-structure to use depends on how do you like it to be used in your code later. You could start with a structure that is most convenient for your current code. And evolve it to a more general solution later when you better understand the requirements.

I output it to csv, where each element is one row in the csv file.
...
I use a defaultdict to store the elements and then store those in a list before I output them to csv.

You could use ordinary dict and csv.DictWriter(fieldnames=xpathdict.keys()):

# for each element
row_dict = dict.fromkeys(xpathdict.keys())
...
# for each key 
row_dict[key] = element.xpath(xpathdict[key]) or None
...
dictwriter.writerow(row_dict)

Where xpathdict is a mapping between field names and corresponding xpath expressions. For generality you could store function objects f(element) -> csv field instead of/in addition to xpath exprs.

回复收藏 0 原文

旧人 2024-12-19 10:03:49

我不认为 #3 是合法的 XML，因为没有与之关联的开始标记，即使它在其他地方，它也不会正确嵌套在该示例中。由于 < 字符，该表达式将被解释为结束标记。

回复收藏 0 原文

几度春秋 2024-12-19 10:03:49

我假设你想要这样的东西：

<root>
  <element>
    <text_attribute>Some Text</text_attribute>
    <attribute var="blah"/>
    <bool_attribute><boolean.true/></bool_attribute>
  </element>
  <element>
    <text_attribute>Some more Text</text_attribute>
    <attribute var="blah again"/>
    <bool_attribute><boolean.false/></bool_attribute>
  </element>
</root>

并得到这样的东西：

[
   { "text_attribute":"Some Text", "attribute":"blah", "bool_attribute":True },
   { "text_attribute":"Some more Text", "attribute":"blah again", "bool_attribute":False }
]

要做到这一点，我会做这样的事情（未经测试）：

# Helper function so we can extract a default from an xpath result if empty
def get_first(x, default_value):
  if(len(x)>0) return x[0]
  return default_value

# Parse one element
def process_element( e ):
  retval = {}
  retval['text_attribute'] = get_first(e.xpath("text_attribute/text()"), "default text")
  retval['attribute'] = get_first( e.xpath("attribute/@var"), "default attribute")
  retval['bool_attribute'] = get_first( e.xpath("bool_attribute/boolean.true"), False )
  return retval

# Parse all the elements
elements = []
elements_xml = xml.xpath('/root/element')
for e in elements_xml:
  elements.push( process_element(e) )

I'm assuming that you'd want to take something like this:

<root>
  <element>
    <text_attribute>Some Text</text_attribute>
    <attribute var="blah"/>
    <bool_attribute><boolean.true/></bool_attribute>
  </element>
  <element>
    <text_attribute>Some more Text</text_attribute>
    <attribute var="blah again"/>
    <bool_attribute><boolean.false/></bool_attribute>
  </element>
</root>

And get something like this:

[
   { "text_attribute":"Some Text", "attribute":"blah", "bool_attribute":True },
   { "text_attribute":"Some more Text", "attribute":"blah again", "bool_attribute":False }
]

To do this I'd do something like this (untested):

# Helper function so we can extract a default from an xpath result if empty
def get_first(x, default_value):
  if(len(x)>0) return x[0]
  return default_value

# Parse one element
def process_element( e ):
  retval = {}
  retval['text_attribute'] = get_first(e.xpath("text_attribute/text()"), "default text")
  retval['attribute'] = get_first( e.xpath("attribute/@var"), "default attribute")
  retval['bool_attribute'] = get_first( e.xpath("bool_attribute/boolean.true"), False )
  return retval

# Parse all the elements
elements = []
elements_xml = xml.xpath('/root/element')
for e in elements_xml:
  elements.push( process_element(e) )

回复收藏 0 原文

~没有更多了~