在python中解析巨大的xml时lxml内存使用情况

发布于 2024-11-16 11:57:15 字数 3842 浏览 0 评论 0原文

我是一个蟒蛇新手。我正在尝试使用 lxml 解析 python 模块中的一个巨大的 xml 文件。尽管在每个循环结束时清除了元素,但我的内存仍然激增并使应用程序崩溃。我确信我在这里遗漏了一些东西。请帮我弄清楚那是什么。

以下是我正在使用的主要功能 -

from lxml import etree
def parseXml(context,attribList):
    for _, element in context:
        fieldMap={}
        rowList=[]
        readAttribs(element,fieldMap,attribList)
        readAllChildren(element,fieldMap,attribList)
        for row in rowList:
            yield row
        element.clear()

def readAttribs(element,fieldMap,attribList):
    for atrrib in attribList:
        fieldMap[attrib]=element.get(attrib,'')

def readAllChildren(element,fieldMap,attribList,rowList):
    for childElem in element:
        readAttribs(childEleme,fieldMap,attribList)
        if len(childElem) > 0:
           readAllChildren(childElem,fieldMap,attribList)
        rowlist.append(fieldMap.copy())
        childElem.clear()

def main():
    attribList=['name','age','id']
    context=etree.iterparse(fullFilePath, events=("start",))
    for row in parseXml(context,attribList)
        print row 

谢谢!

示例 xml 和嵌套字典 -

<root xmlns='NS'>
        <Employee Name="Mr.ZZ" Age="30">
            <Experience TotalYears="10" StartDate="2000-01-01" EndDate="2010-12-12">
                    <Employment id = "1" EndTime="ABC" StartDate="2000-01-01" EndDate="2002-12-12">
                            <Project Name="ABC_1" Team="4">
                            </Project>
                    </Employment>
                    <Employment id = "2" EndTime="XYZ" StartDate="2003-01-01" EndDate="2010-12-12">
                        <PromotionStatus>Manager</PromotionStatus>
                            <Project Name="XYZ_1" Team="7">
                                <Award>Star Team Member</Award>
                            </Project>
                    </Employment>
            </Experience>
        </Employee>
</root>

ELEMENT_NAME='element_name'
ELEMENTS='elements'
ATTRIBUTES='attributes'
TEXT='text'
xmlDef={ 'namespace' : 'NS',
           'content' :
           { ELEMENT_NAME: 'Employee',
             ELEMENTS: [{ELEMENT_NAME: 'Experience',
                         ELEMENTS: [{ELEMENT_NAME: 'Employment',
                                     ELEMENTS: [{
                                                 ELEMENT_NAME: 'PromotionStatus',
                                                 ELEMENTS: [],
                                                 ATTRIBUTES:[],
                                                 TEXT:['PromotionStatus']
                                               },
                                               {
                                                 ELEMENT_NAME: 'Project',
                                                 ELEMENTS: [{
                                                            ELEMENT_NAME: 'Award',
                                                            ELEMENTS: {},
                                                            ATTRIBUTES:[],
                                                            TEXT:['Award']
                                                            }],
                                                 ATTRIBUTES:['Name','Team'],
                                                 TEXT:[]
                                               }],
                                     ATTRIBUTES: ['TotalYears','StartDate','EndDate'],
                                     TEXT:[]
                                    }],
                         ATTRIBUTES: ['TotalYears','StartDate','EndDate'],
                         TEXT:[]
                         }],
             ATTRIBUTES: ['Name','Age'],
             TEXT:[]
           }
         }

I am a python newbie. I am trying to parse a huge xml file in my python module using lxml. In spite of clearing the elements at the end of each loop, my memory shoots up and crashes the application. I am sure I am missing something here. Please helpme figure out what that is.

Following are main functions I am using -

from lxml import etree
def parseXml(context,attribList):
    for _, element in context:
        fieldMap={}
        rowList=[]
        readAttribs(element,fieldMap,attribList)
        readAllChildren(element,fieldMap,attribList)
        for row in rowList:
            yield row
        element.clear()

def readAttribs(element,fieldMap,attribList):
    for atrrib in attribList:
        fieldMap[attrib]=element.get(attrib,'')

def readAllChildren(element,fieldMap,attribList,rowList):
    for childElem in element:
        readAttribs(childEleme,fieldMap,attribList)
        if len(childElem) > 0:
           readAllChildren(childElem,fieldMap,attribList)
        rowlist.append(fieldMap.copy())
        childElem.clear()

def main():
    attribList=['name','age','id']
    context=etree.iterparse(fullFilePath, events=("start",))
    for row in parseXml(context,attribList)
        print row 

Thanks!!

Example xml and the nested dictionary -

<root xmlns='NS'>
        <Employee Name="Mr.ZZ" Age="30">
            <Experience TotalYears="10" StartDate="2000-01-01" EndDate="2010-12-12">
                    <Employment id = "1" EndTime="ABC" StartDate="2000-01-01" EndDate="2002-12-12">
                            <Project Name="ABC_1" Team="4">
                            </Project>
                    </Employment>
                    <Employment id = "2" EndTime="XYZ" StartDate="2003-01-01" EndDate="2010-12-12">
                        <PromotionStatus>Manager</PromotionStatus>
                            <Project Name="XYZ_1" Team="7">
                                <Award>Star Team Member</Award>
                            </Project>
                    </Employment>
            </Experience>
        </Employee>
</root>

ELEMENT_NAME='element_name'
ELEMENTS='elements'
ATTRIBUTES='attributes'
TEXT='text'
xmlDef={ 'namespace' : 'NS',
           'content' :
           { ELEMENT_NAME: 'Employee',
             ELEMENTS: [{ELEMENT_NAME: 'Experience',
                         ELEMENTS: [{ELEMENT_NAME: 'Employment',
                                     ELEMENTS: [{
                                                 ELEMENT_NAME: 'PromotionStatus',
                                                 ELEMENTS: [],
                                                 ATTRIBUTES:[],
                                                 TEXT:['PromotionStatus']
                                               },
                                               {
                                                 ELEMENT_NAME: 'Project',
                                                 ELEMENTS: [{
                                                            ELEMENT_NAME: 'Award',
                                                            ELEMENTS: {},
                                                            ATTRIBUTES:[],
                                                            TEXT:['Award']
                                                            }],
                                                 ATTRIBUTES:['Name','Team'],
                                                 TEXT:[]
                                               }],
                                     ATTRIBUTES: ['TotalYears','StartDate','EndDate'],
                                     TEXT:[]
                                    }],
                         ATTRIBUTES: ['TotalYears','StartDate','EndDate'],
                         TEXT:[]
                         }],
             ATTRIBUTES: ['Name','Age'],
             TEXT:[]
           }
         }

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

ま昔日黯然 2024-11-23 11:57:15

欢迎使用 Python 和 Stack Overflow!

看起来您已经遵循了一些关于lxml的好建议,尤其是etree.iterparse(..),但我认为您的实现从错误的角度解决了问题。 iterparse(..) 的想法是摆脱收集和存储数据,而是在读入标签时对其进行处理。您的 readAllChildren(..) 函数是将所有内容保存到 rowList 中,该列表不断增长以覆盖整个文档树。我做了一些更改来显示发生的情况:

from lxml import etree
def parseXml(context,attribList):
    for event, element in context:
        print "%s element %s:" % (event, element)
        fieldMap = {}
        rowList = []
        readAttribs(element, fieldMap, attribList)
        readAllChildren(element, fieldMap, attribList, rowList)
        for row in rowList:
            yield row
        element.clear()

def readAttribs(element, fieldMap, attribList):
    for attrib in attribList:
        fieldMap[attrib] = element.get(attrib,'')
    print "fieldMap:", fieldMap

def readAllChildren(element, fieldMap, attribList, rowList):
    for childElem in element:
        print "Found child:", childElem
        readAttribs(childElem, fieldMap, attribList)
        if len(childElem) > 0:
           readAllChildren(childElem, fieldMap, attribList, rowList)
        rowList.append(fieldMap.copy())
        print "len(rowList) =", len(rowList)
        childElem.clear()

def process_xml_original(xml_file):
    attribList=['name','age','id']
    context=etree.iterparse(xml_file, events=("start",))
    for row in parseXml(context,attribList):
        print "Row:", row

使用一些虚拟数据运行:

>>> from cStringIO import StringIO
>>> test_xml = """\
... <family>
...     <person name="somebody" id="5" />
...     <person age="45" />
...     <person name="Grandma" age="62">
...         <child age="35" id="10" name="Mom">
...             <grandchild age="7 and 3/4" />
...             <grandchild id="12345" />
...         </child>
...     </person>
...     <something-completely-different />
... </family>
... """
>>> process_xml_original(StringIO(test_xml))
start element: <Element family at 0x105ca58>
fieldMap: {'age': '', 'name': '', 'id': ''}
Found child: <Element person at 0x105ca80>
fieldMap: {'age': '', 'name': 'somebody', 'id': '5'}
len(rowList) = 1
Found child: <Element person at 0x105c468>
fieldMap: {'age': '45', 'name': '', 'id': ''}
len(rowList) = 2
Found child: <Element person at 0x105c7b0>
fieldMap: {'age': '62', 'name': 'Grandma', 'id': ''}
Found child: <Element child at 0x106e468>
fieldMap: {'age': '35', 'name': 'Mom', 'id': '10'}
Found child: <Element grandchild at 0x106e148>
fieldMap: {'age': '7 and 3/4', 'name': '', 'id': ''}
len(rowList) = 3
Found child: <Element grandchild at 0x106e490>
fieldMap: {'age': '', 'name': '', 'id': '12345'}
len(rowList) = 4
len(rowList) = 5
len(rowList) = 6
Found child: <Element something-completely-different at 0x106e4b8>
fieldMap: {'age': '', 'name': '', 'id': ''}
len(rowList) = 7
Row: {'age': '', 'name': 'somebody', 'id': '5'}
Row: {'age': '45', 'name': '', 'id': ''}
Row: {'age': '7 and 3/4', 'name': '', 'id': ''}
Row: {'age': '', 'name': '', 'id': '12345'}
Row: {'age': '', 'name': '', 'id': '12345'}
Row: {'age': '', 'name': '', 'id': '12345'}
Row: {'age': '', 'name': '', 'id': ''}
start element: <Element person at 0x105ca80>
fieldMap: {'age': '', 'name': '', 'id': ''}
start element: <Element person at 0x105c468>
fieldMap: {'age': '', 'name': '', 'id': ''}
start element: <Element person at 0x105c7b0>
fieldMap: {'age': '', 'name': '', 'id': ''}
start element: <Element child at 0x106e468>
fieldMap: {'age': '', 'name': '', 'id': ''}
start element: <Element grandchild at 0x106e148>
fieldMap: {'age': '', 'name': '', 'id': ''}
start element: <Element grandchild at 0x106e490>
fieldMap: {'age': '', 'name': '', 'id': ''}
start element: <Element something-completely-different at 0x106e4b8>
fieldMap: {'age': '', 'name': '', 'id': ''}

这有点难以阅读,但您可以看到它在第一遍中从根标记向下爬升整个树,构建 rowList 整个文档中的每个元素。您还会注意到它甚至没有停在那里,因为 element.clear() 调用是在 parseXml 中的 yield 语句之后进行的。 (..),直到第二次迭代(即树中的下一个元素)才会执行。

增量处理 FTW

一个简单的解决方法是让 iterparse(..) 完成其工作:迭代解析!以下将提取相同的信息并增量处理它:

def do_something_with_data(data):
    """This just prints it out. Yours will probably be more interesting."""
    print "Got data: ", data

def process_xml_iterative(xml_file):
    # by using the default 'end' event, you start at the _bottom_ of the tree
    ATTRS = ('name', 'age', 'id')
    for event, element in etree.iterparse(xml_file):
        print "%s element: %s" % (event, element)
        data = {}
        for attr in ATTRS:
            data[attr] = element.get(attr, u"")
        do_something_with_data(data)
        element.clear()
        del element # for extra insurance

在相同的虚拟 XML 上运行:

>>> print test_xml
<family>
    <person name="somebody" id="5" />
    <person age="45" />
    <person name="Grandma" age="62">
        <child age="35" id="10" name="Mom">
            <grandchild age="7 and 3/4" />
            <grandchild id="12345" />
        </child>
    </person>
    <something-completely-different />
</family>
>>> process_xml_iterative(StringIO(test_xml))
end element: <Element person at 0x105cc10>
Got data:  {'age': u'', 'name': 'somebody', 'id': '5'}
end element: <Element person at 0x106e468>
Got data:  {'age': '45', 'name': u'', 'id': u''}
end element: <Element grandchild at 0x106e148>
Got data:  {'age': '7 and 3/4', 'name': u'', 'id': u''}
end element: <Element grandchild at 0x106e490>
Got data:  {'age': u'', 'name': u'', 'id': '12345'}
end element: <Element child at 0x106e508>
Got data:  {'age': '35', 'name': 'Mom', 'id': '10'}
end element: <Element person at 0x106e530>
Got data:  {'age': '62', 'name': 'Grandma', 'id': u''}
end element: <Element something-completely-different at 0x106e558>
Got data:  {'age': u'', 'name': u'', 'id': u''}
end element: <Element family at 0x105c6e8>
Got data:  {'age': u'', 'name': u'', 'id': u''}

这将大大提高脚本的速度和内存性能。此外,通过挂钩 'end' 事件,您可以随时清除和删除元素,而不是等到所有子元素都已处理完毕。

根据您的数据集,仅处理某些类型的元素可能是个好主意。例如,根元素可能没有多大意义,其他嵌套元素也可能会用大量 {'age': u'', 'id': u'', 'name' 填充您的数据集:u''}


或者,就 SAX

而言,当我读到“XML”和“低内存”时,我的思绪总是直接跳到 SAX,这是解决此问题的另一种方法。使用内置 xml.sax 模块

import xml.sax

class AttributeGrabber(xml.sax.handler.ContentHandler):
    """SAX Handler which will store selected attribute values."""
    def __init__(self, target_attrs=()):
        self.target_attrs = target_attrs

    def startElement(self, name, attrs):
        print "Found element: ", name
        data = {}
        for target_attr in self.target_attrs:
            data[target_attr] = attrs.get(target_attr, u"")

        # (no xml trees or elements created at all)
        do_something_with_data(data)

def process_xml_sax(xml_file):
    grabber = AttributeGrabber(target_attrs=('name', 'age', 'id'))
    xml.sax.parse(xml_file, grabber)

:您必须根据最适合您的情况来评估这两个选项(如果这是您经常做的事情,也许可以运行几个基准测试)。


请务必跟进事情的进展!


根据后续评论进行编辑

实现上述任一解决方案可能需要对代码的整体结构进行一些更改,但您所拥有的任何内容都应该仍然可行。例如,批量处理“行”,您可以:

def process_xml_batch(xml_file, batch_size=10):
    ATTRS = ('name', 'age', 'id')
    batch = []
    for event, element in etree.iterparse(xml_file):
        data = {}
        for attr in ATTRS:
            data[attr] = element.get(attr, u"")
        batch.append(data)
        element.clear()
        del element

        if len(batch) == batch_size:
            do_something_with_batch(batch)
            # Or, if you want this to be a genrator:
            # yield batch
            batch = []
    if batch:
        # there are leftover items
        do_something_with_batch(batch) # Or, yield batch

Welcome to Python and Stack Overflow!

It looks like you've followed some good advice looking at lxml and especially etree.iterparse(..), but I think your implementation is approaching the problem from the wrong angle. The idea of iterparse(..) is to get away from collecting and storing data, and instead processing tags as they get read in. Your readAllChildren(..) function is saving everything to rowList, which grows and grows to cover the whole document tree. I made a few changes to show what's going on:

from lxml import etree
def parseXml(context,attribList):
    for event, element in context:
        print "%s element %s:" % (event, element)
        fieldMap = {}
        rowList = []
        readAttribs(element, fieldMap, attribList)
        readAllChildren(element, fieldMap, attribList, rowList)
        for row in rowList:
            yield row
        element.clear()

def readAttribs(element, fieldMap, attribList):
    for attrib in attribList:
        fieldMap[attrib] = element.get(attrib,'')
    print "fieldMap:", fieldMap

def readAllChildren(element, fieldMap, attribList, rowList):
    for childElem in element:
        print "Found child:", childElem
        readAttribs(childElem, fieldMap, attribList)
        if len(childElem) > 0:
           readAllChildren(childElem, fieldMap, attribList, rowList)
        rowList.append(fieldMap.copy())
        print "len(rowList) =", len(rowList)
        childElem.clear()

def process_xml_original(xml_file):
    attribList=['name','age','id']
    context=etree.iterparse(xml_file, events=("start",))
    for row in parseXml(context,attribList):
        print "Row:", row

Running with some dummy data:

>>> from cStringIO import StringIO
>>> test_xml = """\
... <family>
...     <person name="somebody" id="5" />
...     <person age="45" />
...     <person name="Grandma" age="62">
...         <child age="35" id="10" name="Mom">
...             <grandchild age="7 and 3/4" />
...             <grandchild id="12345" />
...         </child>
...     </person>
...     <something-completely-different />
... </family>
... """
>>> process_xml_original(StringIO(test_xml))
start element: <Element family at 0x105ca58>
fieldMap: {'age': '', 'name': '', 'id': ''}
Found child: <Element person at 0x105ca80>
fieldMap: {'age': '', 'name': 'somebody', 'id': '5'}
len(rowList) = 1
Found child: <Element person at 0x105c468>
fieldMap: {'age': '45', 'name': '', 'id': ''}
len(rowList) = 2
Found child: <Element person at 0x105c7b0>
fieldMap: {'age': '62', 'name': 'Grandma', 'id': ''}
Found child: <Element child at 0x106e468>
fieldMap: {'age': '35', 'name': 'Mom', 'id': '10'}
Found child: <Element grandchild at 0x106e148>
fieldMap: {'age': '7 and 3/4', 'name': '', 'id': ''}
len(rowList) = 3
Found child: <Element grandchild at 0x106e490>
fieldMap: {'age': '', 'name': '', 'id': '12345'}
len(rowList) = 4
len(rowList) = 5
len(rowList) = 6
Found child: <Element something-completely-different at 0x106e4b8>
fieldMap: {'age': '', 'name': '', 'id': ''}
len(rowList) = 7
Row: {'age': '', 'name': 'somebody', 'id': '5'}
Row: {'age': '45', 'name': '', 'id': ''}
Row: {'age': '7 and 3/4', 'name': '', 'id': ''}
Row: {'age': '', 'name': '', 'id': '12345'}
Row: {'age': '', 'name': '', 'id': '12345'}
Row: {'age': '', 'name': '', 'id': '12345'}
Row: {'age': '', 'name': '', 'id': ''}
start element: <Element person at 0x105ca80>
fieldMap: {'age': '', 'name': '', 'id': ''}
start element: <Element person at 0x105c468>
fieldMap: {'age': '', 'name': '', 'id': ''}
start element: <Element person at 0x105c7b0>
fieldMap: {'age': '', 'name': '', 'id': ''}
start element: <Element child at 0x106e468>
fieldMap: {'age': '', 'name': '', 'id': ''}
start element: <Element grandchild at 0x106e148>
fieldMap: {'age': '', 'name': '', 'id': ''}
start element: <Element grandchild at 0x106e490>
fieldMap: {'age': '', 'name': '', 'id': ''}
start element: <Element something-completely-different at 0x106e4b8>
fieldMap: {'age': '', 'name': '', 'id': ''}

It's a little hard to read but you can see it's climbing the whole tree down from the root tag on the first pass, building up rowList for every element in the entire document. You'll also notice it's not even stopping there, since the element.clear() call comes after the yield statment in parseXml(..), it doesn't get executed until the second iteration (i.e. the next element in the tree).

Incremental processing FTW

A simple fix is to let iterparse(..) do its job: parse iteratively! The following will pull the same information and process it incrementally instead:

def do_something_with_data(data):
    """This just prints it out. Yours will probably be more interesting."""
    print "Got data: ", data

def process_xml_iterative(xml_file):
    # by using the default 'end' event, you start at the _bottom_ of the tree
    ATTRS = ('name', 'age', 'id')
    for event, element in etree.iterparse(xml_file):
        print "%s element: %s" % (event, element)
        data = {}
        for attr in ATTRS:
            data[attr] = element.get(attr, u"")
        do_something_with_data(data)
        element.clear()
        del element # for extra insurance

Running on the same dummy XML:

>>> print test_xml
<family>
    <person name="somebody" id="5" />
    <person age="45" />
    <person name="Grandma" age="62">
        <child age="35" id="10" name="Mom">
            <grandchild age="7 and 3/4" />
            <grandchild id="12345" />
        </child>
    </person>
    <something-completely-different />
</family>
>>> process_xml_iterative(StringIO(test_xml))
end element: <Element person at 0x105cc10>
Got data:  {'age': u'', 'name': 'somebody', 'id': '5'}
end element: <Element person at 0x106e468>
Got data:  {'age': '45', 'name': u'', 'id': u''}
end element: <Element grandchild at 0x106e148>
Got data:  {'age': '7 and 3/4', 'name': u'', 'id': u''}
end element: <Element grandchild at 0x106e490>
Got data:  {'age': u'', 'name': u'', 'id': '12345'}
end element: <Element child at 0x106e508>
Got data:  {'age': '35', 'name': 'Mom', 'id': '10'}
end element: <Element person at 0x106e530>
Got data:  {'age': '62', 'name': 'Grandma', 'id': u''}
end element: <Element something-completely-different at 0x106e558>
Got data:  {'age': u'', 'name': u'', 'id': u''}
end element: <Element family at 0x105c6e8>
Got data:  {'age': u'', 'name': u'', 'id': u''}

This should greatly improve both the speed and memory performance of your script. Also, by hooking the 'end' event, you're free to clear and delete elements as you go, rather than waiting until all children have been processed.

Depending on your dataset, it might be a good idea to only process certain types of elements. The root element, for one, probably isn't very meaningful, and other nested elements may also fill your dataset with a lot of {'age': u'', 'id': u'', 'name': u''}.


Or, with SAX

As an aside, when I read "XML" and "low-memory" my mind always jumps straight to SAX, which is another way you could attack this problem. Using the builtin xml.sax module:

import xml.sax

class AttributeGrabber(xml.sax.handler.ContentHandler):
    """SAX Handler which will store selected attribute values."""
    def __init__(self, target_attrs=()):
        self.target_attrs = target_attrs

    def startElement(self, name, attrs):
        print "Found element: ", name
        data = {}
        for target_attr in self.target_attrs:
            data[target_attr] = attrs.get(target_attr, u"")

        # (no xml trees or elements created at all)
        do_something_with_data(data)

def process_xml_sax(xml_file):
    grabber = AttributeGrabber(target_attrs=('name', 'age', 'id'))
    xml.sax.parse(xml_file, grabber)

You'll have to evaluate both options based on what works best in your situation (and maybe run a couple benchmarks, if this is something you'll be doing often).


Be sure to follow up with how things work out!


Edit based on follow-up comments

Implementing either of the above solutions may require some changes to the overall structure of your code, but anything you have should still be doable. For instance, processing "rows" in batches, you could have:

def process_xml_batch(xml_file, batch_size=10):
    ATTRS = ('name', 'age', 'id')
    batch = []
    for event, element in etree.iterparse(xml_file):
        data = {}
        for attr in ATTRS:
            data[attr] = element.get(attr, u"")
        batch.append(data)
        element.clear()
        del element

        if len(batch) == batch_size:
            do_something_with_batch(batch)
            # Or, if you want this to be a genrator:
            # yield batch
            batch = []
    if batch:
        # there are leftover items
        do_something_with_batch(batch) # Or, yield batch
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文