使用 Python Iterparse 处理大型 XML 文件

发布于 2024-12-01 06:16:21 字数 659 浏览 4 评论 0原文

我需要用 Python 编写一个解析器，它可以在没有太多内存（只有 2 GB）的计算机上处理一些非常大的文件（> 2 GB）。我想在 lxml 中使用 iterparse 来做到这一点。

我的文件格式为：

<item>
  <title>Item 1</title>
  <desc>Description 1</desc>
</item>
<item>
  <title>Item 2</title>
  <desc>Description 2</desc>
</item>

到目前为止，我的解决方案是：

from lxml import etree

context = etree.iterparse( MYFILE, tag='item' )

for event, elem in context :
      print elem.xpath( 'description/text( )' )

del context

不幸的是，这个解决方案仍然占用了大量内存。我认为问题是在处理每个“ITEM”之后我需要做一些事情来清理空的孩子。谁能就处理数据以正确清理后我可以做什么提供一些建议？

原文

I need to write a parser in Python that can process some extremely large files ( > 2 GB ) on a computer without much memory (only 2 GB). I wanted to use iterparse in lxml to do it.

My file is of the format:

<item>
  <title>Item 1</title>
  <desc>Description 1</desc>
</item>
<item>
  <title>Item 2</title>
  <desc>Description 2</desc>
</item>

and so far my solution is:

from lxml import etree

context = etree.iterparse( MYFILE, tag='item' )

for event, elem in context :
      print elem.xpath( 'description/text( )' )

del context

Unfortunately though, this solution is still eating up a lot of memory. I think the problem is that after dealing with each "ITEM" I need to do something to cleanup empty children. Can anyone offer some suggestions on what I might do after processing my data to properly cleanup?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

剑心龙吟 2024-12-08 06:16:21

尝试 Liza Daly 的 fast_iter 。处理完元素 elem 后，它会调用 elem.clear() 来删除后代，并删除前面的同级元素。

def fast_iter(context, func, *args, **kwargs):
    """
    http://lxml.de/parsing.html#modifying-the-tree
    Based on Liza Daly's fast_iter
    http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
    See also http://effbot.org/zone/element-iterparse.htm
    """
    for event, elem in context:
        func(elem, *args, **kwargs)
        # It's safe to call clear() here because no descendants will be
        # accessed
        elem.clear()
        # Also eliminate now-empty references from the root node to elem
        for ancestor in elem.xpath('ancestor-or-self::*'):
            while ancestor.getprevious() is not None:
                del ancestor.getparent()[0]
    del context


def process_element(elem):
    print elem.xpath( 'description/text( )' )

context = etree.iterparse( MYFILE, tag='item' )
fast_iter(context,process_element)

Daly 的文章非常值得一读，特别是当您正在处理大型 XML 文件时。

编辑：上面发布的 fast_iter 是 Daly 的 fast_iter 的修改版本。处理完一个元素后，它会更积极地删除不再需要的其他元素。

下面的脚本显示了行为上的差异。请特别注意，orig_fast_iter 不会删除 A1 元素，而 mod_fast_iter 会删除它，从而节省更多内存。

import lxml.etree as ET
import textwrap
import io

def setup_ABC():
    content = textwrap.dedent('''\
      <root>
        <A1>
          <B1></B1>
          <C>1<D1></D1></C>
          <E1></E1>
        </A1>
        <A2>
          <B2></B2>
          <C>2<D></D></C>
          <E2></E2>
        </A2>
      </root>
        ''')
    return content


def study_fast_iter():
    def orig_fast_iter(context, func, *args, **kwargs):
        for event, elem in context:
            print('Processing {e}'.format(e=ET.tostring(elem)))
            func(elem, *args, **kwargs)
            print('Clearing {e}'.format(e=ET.tostring(elem)))
            elem.clear()
            while elem.getprevious() is not None:
                print('Deleting {p}'.format(
                    p=(elem.getparent()[0]).tag))
                del elem.getparent()[0]
        del context

    def mod_fast_iter(context, func, *args, **kwargs):
        """
        http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
        Author: Liza Daly
        See also http://effbot.org/zone/element-iterparse.htm
        """
        for event, elem in context:
            print('Processing {e}'.format(e=ET.tostring(elem)))
            func(elem, *args, **kwargs)
            # It's safe to call clear() here because no descendants will be
            # accessed
            print('Clearing {e}'.format(e=ET.tostring(elem)))
            elem.clear()
            # Also eliminate now-empty references from the root node to elem
            for ancestor in elem.xpath('ancestor-or-self::*'):
                print('Checking ancestor: {a}'.format(a=ancestor.tag))
                while ancestor.getprevious() is not None:
                    print(
                        'Deleting {p}'.format(p=(ancestor.getparent()[0]).tag))
                    del ancestor.getparent()[0]
        del context

    content = setup_ABC()
    context = ET.iterparse(io.BytesIO(content), events=('end', ), tag='C')
    orig_fast_iter(context, lambda elem: None)
    # Processing <C>1<D1/></C>
    # Clearing <C>1<D1/></C>
    # Deleting B1
    # Processing <C>2<D/></C>
    # Clearing <C>2<D/></C>
    # Deleting B2

    print('-' * 80)
    """
    The improved fast_iter deletes A1. The original fast_iter does not.
    """
    content = setup_ABC()
    context = ET.iterparse(io.BytesIO(content), events=('end', ), tag='C')
    mod_fast_iter(context, lambda elem: None)
    # Processing <C>1<D1/></C>
    # Clearing <C>1<D1/></C>
    # Checking ancestor: root
    # Checking ancestor: A1
    # Checking ancestor: C
    # Deleting B1
    # Processing <C>2<D/></C>
    # Clearing <C>2<D/></C>
    # Checking ancestor: root
    # Checking ancestor: A2
    # Deleting A1
    # Checking ancestor: C
    # Deleting B2

study_fast_iter()

Try Liza Daly's fast_iter. After processing an element, elem, it calls elem.clear() to remove descendants and also removes preceding siblings.

def fast_iter(context, func, *args, **kwargs):
    """
    http://lxml.de/parsing.html#modifying-the-tree
    Based on Liza Daly's fast_iter
    http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
    See also http://effbot.org/zone/element-iterparse.htm
    """
    for event, elem in context:
        func(elem, *args, **kwargs)
        # It's safe to call clear() here because no descendants will be
        # accessed
        elem.clear()
        # Also eliminate now-empty references from the root node to elem
        for ancestor in elem.xpath('ancestor-or-self::*'):
            while ancestor.getprevious() is not None:
                del ancestor.getparent()[0]
    del context


def process_element(elem):
    print elem.xpath( 'description/text( )' )

context = etree.iterparse( MYFILE, tag='item' )
fast_iter(context,process_element)

Daly's article is an excellent read, especially if you are processing large XML files.

Edit: The fast_iter posted above is a modified version of Daly's fast_iter. After processing an element, it is more aggressive at removing other elements that are no longer needed.

The script below shows the difference in behavior. Note in particular that orig_fast_iter does not delete the A1 element, while the mod_fast_iter does delete it, thus saving more memory.

import lxml.etree as ET
import textwrap
import io

def setup_ABC():
    content = textwrap.dedent('''\
      <root>
        <A1>
          <B1></B1>
          <C>1<D1></D1></C>
          <E1></E1>
        </A1>
        <A2>
          <B2></B2>
          <C>2<D></D></C>
          <E2></E2>
        </A2>
      </root>
        ''')
    return content


def study_fast_iter():
    def orig_fast_iter(context, func, *args, **kwargs):
        for event, elem in context:
            print('Processing {e}'.format(e=ET.tostring(elem)))
            func(elem, *args, **kwargs)
            print('Clearing {e}'.format(e=ET.tostring(elem)))
            elem.clear()
            while elem.getprevious() is not None:
                print('Deleting {p}'.format(
                    p=(elem.getparent()[0]).tag))
                del elem.getparent()[0]
        del context

    def mod_fast_iter(context, func, *args, **kwargs):
        """
        http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
        Author: Liza Daly
        See also http://effbot.org/zone/element-iterparse.htm
        """
        for event, elem in context:
            print('Processing {e}'.format(e=ET.tostring(elem)))
            func(elem, *args, **kwargs)
            # It's safe to call clear() here because no descendants will be
            # accessed
            print('Clearing {e}'.format(e=ET.tostring(elem)))
            elem.clear()
            # Also eliminate now-empty references from the root node to elem
            for ancestor in elem.xpath('ancestor-or-self::*'):
                print('Checking ancestor: {a}'.format(a=ancestor.tag))
                while ancestor.getprevious() is not None:
                    print(
                        'Deleting {p}'.format(p=(ancestor.getparent()[0]).tag))
                    del ancestor.getparent()[0]
        del context

    content = setup_ABC()
    context = ET.iterparse(io.BytesIO(content), events=('end', ), tag='C')
    orig_fast_iter(context, lambda elem: None)
    # Processing <C>1<D1/></C>
    # Clearing <C>1<D1/></C>
    # Deleting B1
    # Processing <C>2<D/></C>
    # Clearing <C>2<D/></C>
    # Deleting B2

    print('-' * 80)
    """
    The improved fast_iter deletes A1. The original fast_iter does not.
    """
    content = setup_ABC()
    context = ET.iterparse(io.BytesIO(content), events=('end', ), tag='C')
    mod_fast_iter(context, lambda elem: None)
    # Processing <C>1<D1/></C>
    # Clearing <C>1<D1/></C>
    # Checking ancestor: root
    # Checking ancestor: A1
    # Checking ancestor: C
    # Deleting B1
    # Processing <C>2<D/></C>
    # Clearing <C>2<D/></C>
    # Checking ancestor: root
    # Checking ancestor: A2
    # Deleting A1
    # Checking ancestor: C
    # Deleting B2

study_fast_iter()

回复收藏 0 原文

归途 2024-12-08 06:16:21

iterparse() 允许您在构建树时做一些事情，这意味着除非您删除不再需要的内容，否则您最终仍然会得到整个树到底。

有关更多信息：请阅读 this 由原始 ElementTree 实现的作者编写（但它也适用于 lxml）

回复收藏 0 原文

笑叹一世浮沉 2024-12-08 06:16:21

为什么不使用 sax 的“回调”方法？

回复收藏 0 原文

无敌元气妹 2024-12-08 06:16:21

请注意，iterparse 仍然会构建一棵树，就像 parse 一样，但您可以在解析时安全地重新排列或删除树的部分内容。例如，要解析大文件，您可以在处理完元素后立即删除它们：

for event, elem in iterparse(source)：如果 elem.tag == "记录": ...处理记录元素... elem.clear()
上述模式有一个缺点：它不会清除根元素，因此您最终会得到一个带有许多空子元素的单个元素。如果您的文件很大，而不仅仅是大，这可能是一个问题。要解决这个问题，您需要掌握根元素。最简单的方法是启用启动事件，并保存对变量中第一个元素的引用：

获取可迭代的

context = iterparse(source, events=("start", "end"))

将其转为迭代器

context = iter(context)

获取根元素

event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

所以这是一个增量解析的问题，此链接可以给你详细的答案对于总结的答案你可以参考上面

Note that iterparse still builds a tree, just like parse, but you can safely rearrange or remove parts of the tree while parsing. For example, to parse large files, you can get rid of elements as soon as you’ve processed them:

for event, elem in iterparse(source): if elem.tag == "record": ... process record elements ... elem.clear()
The above pattern has one drawback; it does not clear the root element, so you will end up with a single element with lots of empty child elements. If your files are huge, rather than just large, this might be a problem. To work around this, you need to get your hands on the root element. The easiest way to do this is to enable start events, and save a reference to the first element in a variable:

get an iterable

context = iterparse(source, events=("start", "end"))

turn it into an iterator

context = iter(context)

get the root element

event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

So this is a question of Incremental Parsing , This link can give you detailed answer for summarized answer you can refer the above

回复收藏 0 原文

所谓喜欢 2024-12-08 06:16:21

根据我的经验，使用或不使用 element.clear 进行 iterparse （请参阅 F. Lundh 和 L. Daly）并不总是能够处理非常大的 XML 文件：在一段时间内运行良好，突然内存消耗急剧增加，出现内存错误或系统崩溃。如果您遇到同样的问题，也许您可以使用相同的解决方案：expat 解析器。另请参阅 F. Lundh 或使用 OP 的 XML 片段的以下示例（加上两个元音变音以检查是否存在编码问题）：

import xml.parsers.expat
from collections import deque

def iter_xml(inpath: str, outpath: str) -> None:
    def handle_cdata_end():
        nonlocal in_cdata
        in_cdata = False

    def handle_cdata_start():
        nonlocal in_cdata
        in_cdata = True

    def handle_data(data: str):
        nonlocal in_cdata
        if not in_cdata and open_tags and open_tags[-1] == 'desc':
            data = data.replace('\\', '\\\\').replace('\n', '\\n')
            outfile.write(data + '\n')

    def handle_endtag(tag: str):
        while open_tags:
            open_tag = open_tags.pop()
            if open_tag == tag:
                break

    def handle_starttag(tag: str, attrs: 'Dict[str, str]'):
        open_tags.append(tag)

    open_tags = deque()
    in_cdata = False
    parser = xml.parsers.expat.ParserCreate()
    parser.CharacterDataHandler = handle_data
    parser.EndCdataSectionHandler = handle_cdata_end
    parser.EndElementHandler = handle_endtag
    parser.StartCdataSectionHandler = handle_cdata_start
    parser.StartElementHandler = handle_starttag
    with open(inpath, 'rb') as infile:
        with open(outpath, 'w', encoding = 'utf-8') as outfile:
            parser.ParseFile(infile)

iter_xml('input.xml', 'output.txt')

input.xml：

<root>
    <item>
    <title>Item 1</title>
    <desc>Description 1ä</desc>
    </item>
    <item>
    <title>Item 2</title>
    <desc>Description 2ü</desc>
    </item>
</root>

output.txt：

Description 1ä
Description 2ü

In my experience, iterparse with or without element.clear (see F. Lundh and L. Daly) cannot always cope with very large XML files: It goes well for some time, suddenly the memory consumption goes through the roof and a memory error occurs or the system crashes. If you encounter the same problem, maybe you can use the same solution: the expat parser. See also F. Lundh or the following example using OP’s XML snippet (plus two umlaute for checking that there are no encoding issues):

import xml.parsers.expat
from collections import deque

def iter_xml(inpath: str, outpath: str) -> None:
    def handle_cdata_end():
        nonlocal in_cdata
        in_cdata = False

    def handle_cdata_start():
        nonlocal in_cdata
        in_cdata = True

    def handle_data(data: str):
        nonlocal in_cdata
        if not in_cdata and open_tags and open_tags[-1] == 'desc':
            data = data.replace('\\', '\\\\').replace('\n', '\\n')
            outfile.write(data + '\n')

    def handle_endtag(tag: str):
        while open_tags:
            open_tag = open_tags.pop()
            if open_tag == tag:
                break

    def handle_starttag(tag: str, attrs: 'Dict[str, str]'):
        open_tags.append(tag)

    open_tags = deque()
    in_cdata = False
    parser = xml.parsers.expat.ParserCreate()
    parser.CharacterDataHandler = handle_data
    parser.EndCdataSectionHandler = handle_cdata_end
    parser.EndElementHandler = handle_endtag
    parser.StartCdataSectionHandler = handle_cdata_start
    parser.StartElementHandler = handle_starttag
    with open(inpath, 'rb') as infile:
        with open(outpath, 'w', encoding = 'utf-8') as outfile:
            parser.ParseFile(infile)

iter_xml('input.xml', 'output.txt')

input.xml:

<root>
    <item>
    <title>Item 1</title>
    <desc>Description 1ä</desc>
    </item>
    <item>
    <title>Item 2</title>
    <desc>Description 2ü</desc>
    </item>
</root>

output.txt: