使用 libxml-ruby 逐块处理大型 XML 文件

发布于 2024-08-16 17:43:41 字数 1312 浏览 16 评论 0原文

我想阅读一个大型 XML 文件,其中包含超过一百万条小型书目记录(就像在 Ruby 中使用 libxml 的

...
一样。我已经尝试将 Reader 类与 expand 方法结合起来逐条读取记录,但我不确定这是正确的方法,因为我的代码会占用内存。因此,我正在寻找一种方法,如何以恒定的内存使用量方便地逐条记录地处理记录。下面是我的主循环:

   File.open('dblp.xml') do |io|
      dblp = XML::Reader.io(io, :options => XML::Reader::SUBST_ENTITIES)
      pubFactory = PubFactory.new

      i = 0
      while dblp.read do
        case dblp.name
          when 'article', 'inproceedings', 'book': 
            pub = pubFactory.create(dblp.expand)
            i += 1
            puts pub
            pub = nil
            $stderr.puts i if i % 10000 == 0
            dblp.next
          when 'proceedings','incollection', 'phdthesis', 'mastersthesis':
            # ignore for now
            dblp.next 
          else
            # nothing
        end
      end  
    end

这里的关键是 dblp.expand 读取整个子树(如

记录)并将其作为参数传递给工厂以便进一步加工。这是正确的方法吗?

然后,在工厂方法中,我使用类似 XPath 的高级表达式来提取元素的内容,如下所示。再说一遍,这可行吗?

def first(root, node)
    x = root.find(node).first
    x ? x.content : nil
end

pub.pages   = first(node,'pages') # node contains expanded node from dblp.expand

I'd like to read a large XML file that contains over a million small bibliographic records (like <article>...</article>) using libxml in Ruby. I have tried the Reader class in combination with the expand method to read record by record but I am not sure this is the right approach since my code eats up memory. Hence, I'm looking for a recipe how to conveniently process record by record with constant memory usage. Below is my main loop:

   File.open('dblp.xml') do |io|
      dblp = XML::Reader.io(io, :options => XML::Reader::SUBST_ENTITIES)
      pubFactory = PubFactory.new

      i = 0
      while dblp.read do
        case dblp.name
          when 'article', 'inproceedings', 'book': 
            pub = pubFactory.create(dblp.expand)
            i += 1
            puts pub
            pub = nil
            $stderr.puts i if i % 10000 == 0
            dblp.next
          when 'proceedings','incollection', 'phdthesis', 'mastersthesis':
            # ignore for now
            dblp.next 
          else
            # nothing
        end
      end  
    end

The key here is that dblp.expand reads an entire subtree (like an <article> record) and passes it as an argument to a factory for further processing. Is this the right approach?

Within the factory method I then use high-level XPath-like expression to extract the content of elements, like below. Again, is this viable?

def first(root, node)
    x = root.find(node).first
    x ? x.content : nil
end

pub.pages   = first(node,'pages') # node contains expanded node from dblp.expand

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

俏︾媚 2024-08-23 17:43:41

处理大型 XML 文件时,您应该使用流解析器来避免将所有内容加载到内存中。有两种常见的方法:

  • 推送解析器,例如 SAX,您在获取标签时对其做出反应(请参阅tadman 答案)。
  • 拉式解析器,您可以在 XML 文件中控制一个“光标”,您可以使用简单的原语(例如向上/向下等)移动该“光标”。

我认为如果您想检索,则可以使用推式解析器。只是一些字段,但它们通常很难用于复杂的数据提取,并且通常使用 case...when... 结构来实现,

在我看来,Pull 解析器是树之间的一个很好的替代方案基于模型和推送解析器。您可以在 Dr. Dobb 的日记中找到一篇关于使用 REXML 拉取解析器的好文章

When processing big XML files, you should use a stream parser to avoid loading everything in memory. There are two common approaches:

  • Push parsers like SAX, where you react to encoutered tags as you get them (see tadman answer).
  • Pull parsers, where you control a "cursor" in the XML file that you can move with simple primitives like go up/go down etc.

I think that push parsers are nice to use if you want to retrieve just some fields, but they are generally messy to use for complex data extraction and are often implemented whith use case... when... constructs

Pull parser are in my opinion a good alternative between a tree-based model and a push parser. You can find a nice article in Dr. Dobb's journal about pull parsers with REXML .

花桑 2024-08-23 17:43:41

处理 XML 时,两个常见选项是基于树和基于事件。基于树的方法通常会读取整个 XML 文档,并且会消耗大量内存。基于事件的方法不使用额外的内存,但不会执行任何操作,除非您编写自己的处理程序逻辑。

SAX 样式解析器及其派生实现采用基于事件的模型。

REXML 示例: http://www.iro.umontreal.ca /~lapalme/ForestInsteadOfTheTrees/HTML/ch08s01.html

REXML: http://ruby-doc.org/stdlib/libdoc/rexml/rdoc/index.html

When processing XML, two common options are tree-based, and event-based. The tree-based approach typically reads in the entire XML document and can consume a large amount of memory. The event-based approach uses no additional memory but doesn't do anything unless you write your own handler logic.

The event-based model is employed by the SAX-style parser, and derivative implementations.

Example with REXML: http://www.iro.umontreal.ca/~lapalme/ForestInsteadOfTheTrees/HTML/ch08s01.html

REXML: http://ruby-doc.org/stdlib/libdoc/rexml/rdoc/index.html

素染倾城色 2024-08-23 17:43:41

我有同样的问题,但我想我通过调用 Node#remove 解决了它!在扩展的节点上。就您而言,我认为您应该执行类似的操作,

my_node = dblp.expand
[do what you have to do with my_node]
dblp.next
my_node.remove!

不太确定为什么会这样,但是如果您查看 LibXML::XML::Reader#expand 的源代码,就会发现有一条关于释放节点的评论。我猜测 Reader#expand 将该节点关联到 Reader,并且您必须调用 Node#remove!释放它。

即使进行了这种黑客攻击,内存使用率也不是很高,但至少它没有继续增长。

I had the same problem, but I think I solved it by calling Node#remove! on the expanded node. In your case, I think you should do something like

my_node = dblp.expand
[do what you have to do with my_node]
dblp.next
my_node.remove!

Not really sure why this works, but if you look at the source for LibXML::XML::Reader#expand, there's a comment about freeing the node. I am guessing that Reader#expand associates the node to the Reader, and you have to call Node#remove! to free it.

Memory usage wasn't great, even with this hack, but at least it didn't keep on growing.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文