使用 libxml-ruby 逐块处理大型 XML 文件

发布于 2024-08-16 17:43:41 字数 1312 浏览 16 评论 0原文

我想阅读一个大型 XML 文件，其中包含超过一百万条小型书目记录（就像在 Ruby 中使用 libxml 的

...

一样。我已经尝试将 Reader 类与 expand 方法结合起来逐条读取记录，但我不确定这是正确的方法，因为我的代码会占用内存。因此，我正在寻找一种方法，如何以恒定的内存使用量方便地逐条记录地处理记录。下面是我的主循环：

   File.open('dblp.xml') do |io|
      dblp = XML::Reader.io(io, :options => XML::Reader::SUBST_ENTITIES)
      pubFactory = PubFactory.new

      i = 0
      while dblp.read do
        case dblp.name
          when 'article', 'inproceedings', 'book': 
            pub = pubFactory.create(dblp.expand)
            i += 1
            puts pub
            pub = nil
            $stderr.puts i if i % 10000 == 0
            dblp.next
          when 'proceedings','incollection', 'phdthesis', 'mastersthesis':
            # ignore for now
            dblp.next 
          else
            # nothing
        end
      end  
    end

这里的关键是 dblp.expand 读取整个子树（如

记录）并将其作为参数传递给工厂以便进一步加工。这是正确的方法吗？

然后，在工厂方法中，我使用类似 XPath 的高级表达式来提取元素的内容，如下所示。再说一遍，这可行吗？

def first(root, node)
    x = root.find(node).first
    x ? x.content : nil
end

pub.pages   = first(node,'pages') # node contains expanded node from dblp.expand

原文

I'd like to read a large XML file that contains over a million small bibliographic records (like <article>...</article>) using libxml in Ruby. I have tried the Reader class in combination with the expand method to read record by record but I am not sure this is the right approach since my code eats up memory. Hence, I'm looking for a recipe how to conveniently process record by record with constant memory usage. Below is my main loop:

   File.open('dblp.xml') do |io|
      dblp = XML::Reader.io(io, :options => XML::Reader::SUBST_ENTITIES)
      pubFactory = PubFactory.new

      i = 0
      while dblp.read do
        case dblp.name
          when 'article', 'inproceedings', 'book': 
            pub = pubFactory.create(dblp.expand)
            i += 1
            puts pub
            pub = nil
            $stderr.puts i if i % 10000 == 0
            dblp.next
          when 'proceedings','incollection', 'phdthesis', 'mastersthesis':
            # ignore for now
            dblp.next 
          else
            # nothing
        end
      end  
    end

The key here is that dblp.expand reads an entire subtree (like an <article> record) and passes it as an argument to a factory for further processing. Is this the right approach?

Within the factory method I then use high-level XPath-like expression to extract the content of elements, like below. Again, is this viable?

def first(root, node)
    x = root.find(node).first
    x ? x.content : nil
end

pub.pages   = first(node,'pages') # node contains expanded node from dblp.expand

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

俏︾媚 2024-08-23 17:43:41

处理大型 XML 文件时，您应该使用流解析器来避免将所有内容加载到内存中。有两种常见的方法：

推送解析器，例如 SAX，您在获取标签时对其做出反应（请参阅tadman 答案）。
拉式解析器，您可以在 XML 文件中控制一个“光标”，您可以使用简单的原语（例如向上/向下等）移动该“光标”。

我认为如果您想检索，则可以使用推式解析器。只是一些字段，但它们通常很难用于复杂的数据提取，并且通常使用 case...when... 结构来实现，

在我看来，Pull 解析器是树之间的一个很好的替代方案基于模型和推送解析器。您可以在 Dr. Dobb 的日记中找到一篇关于使用 REXML 拉取解析器的好文章。

回复收藏 0 原文

花桑 2024-08-23 17:43:41

处理 XML 时，两个常见选项是基于树和基于事件。基于树的方法通常会读取整个 XML 文档，并且会消耗大量内存。基于事件的方法不使用额外的内存，但不会执行任何操作，除非您编写自己的处理程序逻辑。

SAX 样式解析器及其派生实现采用基于事件的模型。

REXML 示例： http://www.iro.umontreal.ca /~lapalme/ForestInsteadOfTheTrees/HTML/ch08s01.html

REXML: http://ruby-doc.org/stdlib/libdoc/rexml/rdoc/index.html

回复收藏 0 原文

素染倾城色 2024-08-23 17:43:41

我有同样的问题，但我想我通过调用 Node#remove 解决了它！在扩展的节点上。就您而言，我认为您应该执行类似的操作，

my_node = dblp.expand
[do what you have to do with my_node]
dblp.next
my_node.remove!

不太确定为什么会这样，但是如果您查看 LibXML::XML::Reader#expand 的源代码，就会发现有一条关于释放节点的评论。我猜测 Reader#expand 将该节点关联到 Reader，并且您必须调用 Node#remove！释放它。

即使进行了这种黑客攻击，内存使用率也不是很高，但至少它没有继续增长。

I had the same problem, but I think I solved it by calling Node#remove! on the expanded node. In your case, I think you should do something like

my_node = dblp.expand
[do what you have to do with my_node]
dblp.next
my_node.remove!

Not really sure why this works, but if you look at the source for LibXML::XML::Reader#expand, there's a comment about freeing the node. I am guessing that Reader#expand associates the node to the Reader, and you have to call Node#remove! to free it.

Memory usage wasn't great, even with this hack, but at least it didn't keep on growing.

回复收藏 0 原文

~没有更多了~