使用 Python Iterparse 处理大型 XML 文件
我需要用 Python 编写一个解析器,它可以在没有太多内存(只有 2 GB)的计算机上处理一些非常大的文件(> 2 GB)。我想在 lxml 中使用 iterparse 来做到这一点。
我的文件格式为:
<item>
<title>Item 1</title>
<desc>Description 1</desc>
</item>
<item>
<title>Item 2</title>
<desc>Description 2</desc>
</item>
到目前为止,我的解决方案是:
from lxml import etree
context = etree.iterparse( MYFILE, tag='item' )
for event, elem in context :
print elem.xpath( 'description/text( )' )
del context
不幸的是,这个解决方案仍然占用了大量内存。我认为问题是在处理每个“ITEM”之后我需要做一些事情来清理空的孩子。谁能就处理数据以正确清理后我可以做什么提供一些建议?
I need to write a parser in Python that can process some extremely large files ( > 2 GB ) on a computer without much memory (only 2 GB). I wanted to use iterparse in lxml to do it.
My file is of the format:
<item>
<title>Item 1</title>
<desc>Description 1</desc>
</item>
<item>
<title>Item 2</title>
<desc>Description 2</desc>
</item>
and so far my solution is:
from lxml import etree
context = etree.iterparse( MYFILE, tag='item' )
for event, elem in context :
print elem.xpath( 'description/text( )' )
del context
Unfortunately though, this solution is still eating up a lot of memory. I think the problem is that after dealing with each "ITEM" I need to do something to cleanup empty children. Can anyone offer some suggestions on what I might do after processing my data to properly cleanup?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
尝试 Liza Daly 的 fast_iter 。处理完元素
elem
后,它会调用elem.clear()
来删除后代,并删除前面的同级元素。Daly 的文章非常值得一读,特别是当您正在处理大型 XML 文件时。
编辑:上面发布的
fast_iter
是 Daly 的fast_iter
的修改版本。处理完一个元素后,它会更积极地删除不再需要的其他元素。下面的脚本显示了行为上的差异。请特别注意,
orig_fast_iter
不会删除A1
元素,而mod_fast_iter
会删除它,从而节省更多内存。Try Liza Daly's fast_iter. After processing an element,
elem
, it callselem.clear()
to remove descendants and also removes preceding siblings.Daly's article is an excellent read, especially if you are processing large XML files.
Edit: The
fast_iter
posted above is a modified version of Daly'sfast_iter
. After processing an element, it is more aggressive at removing other elements that are no longer needed.The script below shows the difference in behavior. Note in particular that
orig_fast_iter
does not delete theA1
element, while themod_fast_iter
does delete it, thus saving more memory.iterparse()
允许您在构建树时做一些事情,这意味着除非您删除不再需要的内容,否则您最终仍然会得到整个树到底。有关更多信息:请阅读 this 由原始 ElementTree 实现的作者编写(但它也适用于 lxml)
iterparse()
lets you do stuff while building the tree, that means that unless you remove what you don't need anymore, you'll still end up with the whole tree in the end.For more information: read this by the author of the original ElementTree implementation (but it's also applicable to lxml)
为什么不使用 sax 的“回调”方法?
Why won't you use the "callback" approach of sax?
请注意,iterparse 仍然会构建一棵树,就像 parse 一样,但您可以在解析时安全地重新排列或删除树的部分内容。例如,要解析大文件,您可以在处理完元素后立即删除它们:
for event, elem in iterparse(source):
如果 elem.tag == "记录":
...处理记录元素...
elem.clear()
上述模式有一个缺点:它不会清除根元素,因此您最终会得到一个带有许多空子元素的单个元素。如果您的文件很大,而不仅仅是大,这可能是一个问题。要解决这个问题,您需要掌握根元素。最简单的方法是启用启动事件,并保存对变量中第一个元素的引用:
获取可迭代的
context = iterparse(source, events=("start", "end"))
将其转为迭代器
context = iter(context)
获取根元素
所以这是一个增量解析的问题,此链接可以给你详细的答案 对于总结的答案你可以参考上面
Note that iterparse still builds a tree, just like parse, but you can safely rearrange or remove parts of the tree while parsing. For example, to parse large files, you can get rid of elements as soon as you’ve processed them:
for event, elem in iterparse(source):
if elem.tag == "record":
... process record elements ...
elem.clear()
The above pattern has one drawback; it does not clear the root element, so you will end up with a single element with lots of empty child elements. If your files are huge, rather than just large, this might be a problem. To work around this, you need to get your hands on the root element. The easiest way to do this is to enable start events, and save a reference to the first element in a variable:
get an iterable
context = iterparse(source, events=("start", "end"))
turn it into an iterator
context = iter(context)
get the root element
So this is a question of Incremental Parsing , This link can give you detailed answer for summarized answer you can refer the above
根据我的经验,使用或不使用
element.clear
进行 iterparse (请参阅 F. Lundh 和 L. Daly)并不总是能够处理非常大的 XML 文件:在一段时间内运行良好,突然内存消耗急剧增加,出现内存错误或系统崩溃。如果您遇到同样的问题,也许您可以使用相同的解决方案:expat 解析器。另请参阅 F. Lundh 或使用 OP 的 XML 片段的以下示例(加上两个元音变音以检查是否存在编码问题):input.xml:
output.txt:
In my experience, iterparse with or without
element.clear
(see F. Lundh and L. Daly) cannot always cope with very large XML files: It goes well for some time, suddenly the memory consumption goes through the roof and a memory error occurs or the system crashes. If you encounter the same problem, maybe you can use the same solution: the expat parser. See also F. Lundh or the following example using OP’s XML snippet (plus two umlaute for checking that there are no encoding issues):input.xml:
output.txt:
root.clear() 方法的唯一问题是它返回 NoneTypes。例如,这意味着您无法使用替换()或标题()等字符串方法编辑您解析的数据。也就是说,如果您只是按原样解析数据,那么这是一种最佳方法。
The only problem with the root.clear() method is it returns NoneTypes. This means you can't, for instance, edit what data you parse with string methods like replace() or title(). That said, this is a optimum method to use if you're just parsing the data as is.