扩展读取大型 XML 文件的应用程序

发布于 2024-12-03 14:50:20 字数 382 浏览 1 评论 0原文

我有一个应用程序会定期读取大量 XML 文件(大约 20-30 个),比如每 10 分钟一次。现在,每个 XML 文件的大小至少约为 40-100 MB。读取每个 XML 后,就会从文件中创建一个映射,然后该映射会跨处理器链 (10-15) 传递,每个处理器都使用数据、执行一些过滤或写入数据库等。

现在应用程序运行在 32 位 JVM 中。目前无意迁移到 64 位 JVM。正如预期的那样,内存占用量非常高...接近 32 位 JVM 的阈值。现在,当我们收到大文件时,我们将生成的映射序列化到磁盘中,并同时运行最多 3-4 个映射的处理器链,就好像我们尝试同时处理所有映射一样,很容易出现内存不足。垃圾收集也相当高。

我有一些想法,但想看看人们是否已经尝试/评估过一些选项。那么,有哪些选项可以扩展此类应用程序呢?

I have an application which reads large set of XML files (multiple around 20-30) periodically, like once every 10 minutes. Now each XML file can be approximated to at least 40-100 MB in size. Once each XML has read, a map is created out of the file, and then the map is passed across a processor chain (10-15), each processor using the data, performing some filter or writing to database, etc.

Now the application is running in 32 bit JVM. No intention on moving to 64 bit JVM right now. The memory foot-print as expected is very high... nearing the threshold of a 32 bit JVM. For now when we receive large files, we serialize the generated map into disk and run through the processor chain maximum of 3-4 map concurrently as if we try to process all the maps at the same time, it would easily go OutOfMemory. Also garbage collection is pretty high.

I have some ideas but wanted to see if there are some options which people have already tried/evaluated. So what are the options here for scaling this kind of application?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

一笔一画续写前缘 2024-12-10 14:50:20

是的,鹦鹉学舌@aaray和@MeBigFatGuy,你想为此使用一些基于事件的解析器,提到的dom4j,或者SAX或StAX。

举一个简单的例子,如果批量加载,100MB XML 至少会消耗 200MB 的 RAM,因为每个字符都会立即扩展为 16 位字符。

接下来,您不使用的任何元素标签都将消耗额外的内存(加上节点的所有其他包袱和簿记),并且这一切都被浪费了。如果您正在处理数字,如果数字大于 2 位数字,则将原始字符串转换为长字符串将是一个净胜算。

如果(这是一个很大的如果)您使用了很多相当小的字符串集,您可以通过 String.intern()'ing 它们来节省一些内存。这是一个规范化过程,可确保该字符串是否已经存在于 jvm 中,并且是共享的。这样做的缺点是它会污染你的永久元(一旦被拘留,就永远被拘留)。 PermGen 是相当有限的,但另一方面它几乎不受 GC 的影响。

您是否考虑过能够通过外部 XSLT 运行 XML,以在它进入 JVM 之前删除您不想处理的所有内容?有几个独立的命令行 XSL 处理器可用于将文件预处理为更合理的内容。这实际上取决于您实际使用的数据量。

通过使用基于事件的 XML 处理模型,XSLT 步骤几乎是多余的。但是基于事件的模型基本上都很难使用,因此也许使用 XSLT 步骤可以让您重用一些现有的 DOM 逻辑(假设这就是您正在做的事情)。

内部结构越扁平,内存就越便宜。实际上,运行 32b 虚拟机有一点优势,因为实例指针的大小只有一半。但是,当您谈论 1000 个或数百万个节点时,所有这些都会快速增加。

Yea, to parrot @aaray and @MeBigFatGuy, you want to use some event based parser for this, the dom4j mentioned, or SAX or StAX.

As a simple example, that 100MB XML is consuming a minimum of 200MB of RAM if you load it wholesale, as each character is immediately expanded to a 16 bit character.

Next, any tag of elements that you're not using is going to consume extra memory (plus all of the other baggage and bookkeeping of the nodes) and it's all wasted. If you're dealing with numbers, converting the raw string to a long will be a net win if the number is larger than 2 digits.

IF (and this is a BIG IF) you are using a lot of a reasonably small set of Strings, you can save some memory by String.intern()'ing them. This is a canonicalization process that makes sure if the string already exists in the jvm, its shared. The downside of this is that it pollutes your permgen (once interned, always interned). PermGen is pretty finite, but on the other hand it's pretty much immune to GC.

Have you considered being able to run the XML through an external XSLT to remove all of the cruft that you don't want to process before it even enters your JVM? There are several standalone, command line XSL processors that you can use to pre-process the files to something perhaps more sane. It really depends on how much of the data that is coming in you're actually using.

By using an event based XML processing model, the XSLT step is pretty much redundant. But the event based models are all basically awful to use, so perhaps using the XSLT step would let you re-use some of your existing DOM logic (assuming that's what you're doing).

The flatter your internal structures, the cheaper they are in terms of memory. You actually have a little bit of an advantage running a 32b vm, since instance pointers are half the size. But still, when you're talking 1000's or millions of nodes, it all adds up, and quickly.

南城旧梦 2024-12-10 14:50:20

我们在处理大型 XML 文件(大约 400Mb)时遇到了类似的问题。我们使用以下方法大大减少了应用程序的内存占用:

http:// /dom4j.sourceforge.net/dom4j-1.6.1/faq.html#large-doc

We had a similar problem processing large XML files (around 400Mb). We greatly reduced the memory footprint of the application using this:

http://dom4j.sourceforge.net/dom4j-1.6.1/faq.html#large-doc

自由如风 2024-12-10 14:50:20

您可以将每个 XML 文件的内容插入到临时数据库表中,每个链链接都会获取它需要的数据。您可能会损失性能,但会获得可扩展性。

You can insert the contents of each XML file into a temporary DB table and each chain link would fetch the data it needs. You will probably lose performance, but gain scalability.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文