使用 Python lxml 和 Iterparse 解析大型 XML 文件
我正在尝试使用 lxml 和 iterparse 方法编写一个解析器来单步执行包含许多项目的非常大的 xml 文件。
我的文件格式为:
<item>
<title>Item 1</title>
<desc>Description 1</desc>
<url>
<item>http://www.url1.com</item>
</url>
</item>
<item>
<title>Item 2</title>
<desc>Description 2</desc>
<url>
<item>http://www.url2.com</item>
</url>
</item>
到目前为止,我的解决方案是:
from lxml import etree
context = etree.iterparse( MYFILE, tag='item' )
for event, elem in context :
print elem.xpath( 'description/text( )' )
elem.clear( )
while elem.getprevious( ) is not None :
del elem.getparent( )[0]
del context
当我运行它时,我得到类似的内容:
[]
['description1']
[]
['description2']
空白集是因为它还拉出了作为 url 标签子级的项目标签,并且它们显然没有使用 xpath 提取的描述字段。我的希望是逐一解析每个项目,然后根据需要处理子字段。我刚刚学习 lxml 库,所以我很好奇是否有一种方法可以提取主要项目,同时在遇到任何子项目时保留任何子项目?
I'm attempting to write a parser using lxml and the iterparse method to step through a very large xml file containing many items.
My file is of the format:
<item>
<title>Item 1</title>
<desc>Description 1</desc>
<url>
<item>http://www.url1.com</item>
</url>
</item>
<item>
<title>Item 2</title>
<desc>Description 2</desc>
<url>
<item>http://www.url2.com</item>
</url>
</item>
and so far my solution is:
from lxml import etree
context = etree.iterparse( MYFILE, tag='item' )
for event, elem in context :
print elem.xpath( 'description/text( )' )
elem.clear( )
while elem.getprevious( ) is not None :
del elem.getparent( )[0]
del context
When I run it, I get something similar to:
[]
['description1']
[]
['description2']
The blank sets are because it also pulls out the item tags that are children to the url tag, and they obviously have no description field to extract with xpath. My hope was to parse out each of the items 1 by 1 and then process the child fields as required. I'm sorta just learning the lxml libarary, so I'm curious if there is a way to pull out the main items while leaving any sub items alone if encountered?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
整个 xml 无论如何都会由核心实现进行解析。 etree.iterparse 只是生成器样式的视图,它提供了按标签名称的简单过滤(请参阅文档字符串 http://lxml.de/api/lxml.etree.iterparse-class.html)。
如果你想要一个复杂的过滤,你应该自己做。
解决方案:还注册开始事件:
并有一个 bool 来知道您何时位于“item”端,何时位于“item/url/item”端。
The entire xml is parsed anyway by the core implementation. The etree.iterparse is just a view in generator style, that provides a simple filtering by tag name (see docstring http://lxml.de/api/lxml.etree.iterparse-class.html).
If you want a complex filtering you should do by it's own.
A solution: registering for start event also:
and have a bool to know when you are at the "item" end, when you are the "item/url/item" end.