迭代 SAX

发布于 2024-12-04 01:22:12 字数 528 浏览 3 评论 0原文

我有一个像这样的xml（只是一个例子）：

<xml>
  <page>
    <lol>
    </lol>
    <lel>
    </lel>
  </page>
  <page>
    <lol>
    </lol>
    <lel>
    </lel>
  </page>
  <page>
    <lol>
    </lol>
    <lel>
    </lel>
  </page>
</xml>

我需要一种方法来做这样的事情：

#Sax code

for page in something:
  parse(page)

How i can do this with sax?

该xml文件包含30GB数据。

原文

I have an xml like this (just an example):

<xml>
  <page>
    <lol>
    </lol>
    <lel>
    </lel>
  </page>
  <page>
    <lol>
    </lol>
    <lel>
    </lel>
  </page>
  <page>
    <lol>
    </lol>
    <lel>
    </lel>
  </page>
</xml>

I need a way to do something like this:

#Sax code

for page in something:
  parse(page)

How i can do this with sax?

The xml file contains 30GB of data.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

淡笑忘祈一世凡恋 2024-12-11 01:22:12

不要使用 SAX，请使用 ElementTree 代替：

from xml.etree import cElementTree as ET

for event, elem in ET.iterparse("/path/to/your/file"):
    if elem.tag == 'page':
        # do your processing
        elem.clear()

elem。 clear() 调用很重要，否则您会将所有已处理的元素保留在内存中，并最终也会消耗所有 RAM。元素对象是轻量级的类似 DOM 的对象，因此与 SAX 相比，它们非常易于使用。

如果单个 page 元素太大而无法适应您的记忆，您将不得不恢复到 SAX，但我从您的示例中假设有许多小的 page 元素，而不是比几个大的。

Do not use SAX, use ElementTree instead:

from xml.etree import cElementTree as ET

for event, elem in ET.iterparse("/path/to/your/file"):
    if elem.tag == 'page':
        # do your processing
        elem.clear()

The elem.clear() call is important, otherwise you will keep all the processed elements in memory and eventually consume all your RAM, too. The element objects are light-weight DOM-like objects, so they are quite easy to use, as compared to SAX.

If the individual page elements are too large already to fit your memory, you will have to revert to SAX, but I assume from your example that there are many small page elements rather than a few large ones.

回复收藏 0 原文

油饼 2024-12-11 01:22:12

使用 xml.sax 执行此操作的最有效且最 Python 的方法是使用 parser.feed() 方法。

示例：

parser = xml.sax.make_parser()
parser.setContentHandler(YourContentHandler)

f = open('terribly_large.xml', 'r')
for line in f.xreadlines():
    parser.feed(line)

这可确保您增量读取文件并增量解析它。

由此产生的内存占用应该是最小的。

The most efficent and pythonic way to do this with xml.sax is to use the parser.feed() method.

Example:

parser = xml.sax.make_parser()
parser.setContentHandler(YourContentHandler)

f = open('terribly_large.xml', 'r')
for line in f.xreadlines():
    parser.feed(line)

This ensures that you're both incrementally reading the file, and incrementally parsing it.

The resulting memory footprint should be minimal.

回复收藏 0 原文