使用 Ruby on Rails (1.4GB) 解析非常大的 XML 文件 -- 有没有比 SAXParser 更好的方法？

发布于 2024-09-01 21:08:39 字数 682 浏览 14 评论 0原文

目前，我正在使用 LIBXML::SAXParser::Callbacks 来解析包含 140,000 个产品数据的大型 XML 文件。我正在使用一项任务将这些产品的数据导入到我的 Rails 应用程序中。

我的最后一次导入只用了不到 10 个小时就完成了：

rake asi:import_products --trace  26815.23s user 1393.03s system 80% cpu 9:47:34.09 total

当前实现的问题是 XML 中复杂的依赖结构意味着我需要跟踪整个产品节点才能知道如何正确解析它。

理想情况下，我想要一种可以单独处理每个产品节点并能够使用 XPATH 的方法，文件大小限制我们使用需要将整个 XML 文件加载到内存中的方法。我无法控制原始 XML 的格式或大小。我最多可以在该进程中使用 3GB 的内存。

还有比这更好的方法吗？

当前 Rake 任务代码：

XML 文件的片段：

原文

Currently, I'm using LIBXML::SAXParser::Callbacks to parse a large XML file containing data 140,000 products. I'm using a task to import the data for these products into my rails app.

My last import took just under 10 hours to complete:

rake asi:import_products --trace  26815.23s user 1393.03s system 80% cpu 9:47:34.09 total

The problem with the current implementation is that the complex dependency structure in the XML means, I need to keep track of the entire product node to know how to parse it properly.

Ideally, I'd like a way that I could process each product node by itself and have the ability to use XPATH, the file size restricts us from using a method that requires loading the entire XML file into memory. I cannot control the format or size of original XML. I have at most, 3GB worth of memory I can use on the process.

Is there a better way than this?

Current Rake Task code:

Snippet of the XML file:

分享到QQ

分享到微博