什么是适合 Python 的优秀 XML 流解析器?

发布于 2024-12-08 21:29:02 字数 1539 浏览 0 评论 0 原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

横笛休吹塞上声 2024-12-15 21:29:02

这是关于好答案 .elementtree.html#xml.etree.ElementTree.iterparse" rel="noreferrer">xml.etree.ElementTree.iterparse 在大型 XML 文件上进行练习。 lxml 也有这个方法。使用 iterparse 进行流解析的关键是手动清除和删除已处理的节点,否则最终会耗尽内存。

另一种选择是使用 xml.sax。官方手册对我来说太正式了,并且缺乏示例,因此需要与问题一起进行澄清。默认解析器模块xml.sax.expatreader,实现增量解析接口xml.sax.xmlreader.IncrementalParser。也就是说xml.sax.make_parser()提供了合适的流解析器。

例如,给定一个像这样的 XML 流:

<?xml version="1.0" encoding="utf-8"?>
<root>
  <entry><a>value 0</a><b foo='bar' /></entry>
  <entry><a>value 1</a><b foo='baz' /></entry>
  <entry><a>value 2</a><b foo='quz' /></entry>
  ...
</root>

可以按以下方式处理。

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import xml.sax


class StreamHandler(xml.sax.handler.ContentHandler):

  lastEntry = None
  lastName  = None


  def startElement(self, name, attrs):
    self.lastName = name
    if name == 'entry':
      self.lastEntry = {}
    elif name != 'root':
      self.lastEntry[name] = {'attrs': attrs, 'content': ''}

  def endElement(self, name):
    if name == 'entry':
      print({
        'a' : self.lastEntry['a']['content'],
        'b' : self.lastEntry['b']['attrs'].getValue('foo')
      })
      self.lastEntry = None
    elif name == 'root':
      raise StopIteration

  def characters(self, content):
    if self.lastEntry:
      self.lastEntry[self.lastName]['content'] += content


if __name__ == '__main__':
  # use default ``xml.sax.expatreader``
  parser = xml.sax.make_parser()
  parser.setContentHandler(StreamHandler())
  # feed the parser with small chunks to simulate
  with open('data.xml') as f:
    while True:
      buffer = f.read(16)
      if buffer:
        try:
          parser.feed(buffer)
        except StopIteration:
          break
  # if you can provide a file-like object it's as simple as
  with open('data.xml') as f:
    parser.parse(f)

Here's good answer about xml.etree.ElementTree.iterparse practice on huge XML files. lxml has the method as well. The key to stream parsing with iterparse is manual clearing and removing already processed nodes, because otherwise you will end up running out of memory.

Another option is using xml.sax. The official manual is too formal to me, and lacks examples so it needs clarification along with the question. Default parser module, xml.sax.expatreader, implement incremental parsing interface xml.sax.xmlreader.IncrementalParser. That is to say xml.sax.make_parser() provides suitable stream parser.

For instance, given a XML stream like:

<?xml version="1.0" encoding="utf-8"?>
<root>
  <entry><a>value 0</a><b foo='bar' /></entry>
  <entry><a>value 1</a><b foo='baz' /></entry>
  <entry><a>value 2</a><b foo='quz' /></entry>
  ...
</root>

Can be handled in the following way.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import xml.sax


class StreamHandler(xml.sax.handler.ContentHandler):

  lastEntry = None
  lastName  = None


  def startElement(self, name, attrs):
    self.lastName = name
    if name == 'entry':
      self.lastEntry = {}
    elif name != 'root':
      self.lastEntry[name] = {'attrs': attrs, 'content': ''}

  def endElement(self, name):
    if name == 'entry':
      print({
        'a' : self.lastEntry['a']['content'],
        'b' : self.lastEntry['b']['attrs'].getValue('foo')
      })
      self.lastEntry = None
    elif name == 'root':
      raise StopIteration

  def characters(self, content):
    if self.lastEntry:
      self.lastEntry[self.lastName]['content'] += content


if __name__ == '__main__':
  # use default ``xml.sax.expatreader``
  parser = xml.sax.make_parser()
  parser.setContentHandler(StreamHandler())
  # feed the parser with small chunks to simulate
  with open('data.xml') as f:
    while True:
      buffer = f.read(16)
      if buffer:
        try:
          parser.feed(buffer)
        except StopIteration:
          break
  # if you can provide a file-like object it's as simple as
  with open('data.xml') as f:
    parser.parse(f)
浪漫人生路 2024-12-15 21:29:02

您是否在寻找xml.sax?它就在标准库中。

Are you looking for xml.sax? It's right in the standard library.

江城子 2024-12-15 21:29:02

使用xml.etree.cElementTree。它比 xml.etree.ElementTree 快得多。它们都没有坏。您的文件已损坏(请参阅我对您的其他问题的回答)。

Use xml.etree.cElementTree. It's much faster than xml.etree.ElementTree. Neither of them are broken. Your files are broken (see my answer to your other question).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文