Java 中的并行 XML 解析

发布于 2024-10-03 08:52:43 字数 744 浏览 1 评论 0原文

我正在编写一个应用程序,它处理大量具有深层节点结构的 xml 文件(> 1000)。使用 woodstox(事件 API)解析具有 22.000 个节点的文件大约需要六秒钟。

该算法被放置在一个与用户交互的过程中,只有几秒钟的响应时间是可以接受的。所以我需要改进如何处理xml文件的策略。

  1. 我的流程分析 xml 文件(仅提取几个节点)。
  2. 处理提取的节点,并将新结果写入新的数据流(生成具有修改节点的文档副本)。

现在我正在考虑多线程解决方案(在 16 Core+ 硬件上可以更好地扩展)。我考虑了以下策略:

  1. 创建多个解析器并在 xml 源上并行运行它们。
  2. 重写我的解析算法线程保存以仅使用解析器的一个实例(工厂等)
  3. 将 XML 源拆分为块并将这些块分配给多个处理线程(map-reduce xml - serial)
  4. 优化我的算法(StAX 解析器比 woodstox 更好?)/使用具有内置并发性的解析器

我想提高整体性能和“每个文件”性能。

您有遇到此类问题的经验吗?最好的方法是什么?

I'm writing an application which processes a lot of xml files (>1000) with deep node structures. It takes about six seconds with with woodstox (Event API) to parse a file with 22.000 Nodes.

The algorithm is placed in a process with user interaction where only a few seconds response time are acceptable. So I need to improve the strategy how to handle the xml files.

  1. My process analyses the xml files (extracts only a few nodes).
  2. Extracted nodes are processed and the new result is written into a new data stream (resulting in a copy of the document with modified nodes).

Now I'm thinking about a multithreaded solution (which scales better on 16 Core+ hardware). I thought about the following stategies:

  1. Creating multiple parsers and running them in parallel on the xml sources.
  2. Rewriting my parsing algorithm thread-save to use only one instance of the parser (factories, ...)
  3. Split the XML source into chunks and assign the chunks to multiple processing threads (map-reduce xml - serial)
  4. Optimizing my algorithm (better StAX parser than woodstox?) / Using a parser with build-in concurrency

I want to improve both, the performance overall and the "per file" performance.

Do you have experience with such problems? What is the best way to go?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

送舟行 2024-10-10 08:52:43
  1. 这很明显:只需创建几个解析器并在多个线程中并行运行它们即可。

  2. 查看Woodstox 性能(目前已关闭,请尝试使用 Google 缓存)。

  3. 如果 XML 的结构是可预测的,则可以完成此操作:如果它具有许多相同的顶级元素。例如:

    <前><代码><元素>
    <更多>更多元素

    <元素>
    <其他>其他元素

    在这种情况下,您可以创建简单的拆分器来搜索 并将该部分提供给特定的解析器实例。这是一种简化的方法:在现实生活中,我会使用 RandomAccessFile 来查找开始停止点 (),然后创建仅对文件的一部分进行操作的自定义 FileInputStream。

  4. 看看阿尔托。创造 Woodstox 的是同一个人。这是该领域的专家 - 不要重新发明轮子。

  1. This one is obvious: just create several parsers and run them in parallel in multiple threads.

  2. Take a look at Woodstox Performance (down at the moment, try google cache).

  3. This can be done IF structure of your XML is predictable: if it has a lot of same top-level elements. For instance:

    <element>
        <more>more elements</more>
    </element> 
    <element>
        <other>other elements</other>
    </element>
    

    In this case you could create simple splitter that searches <element> and feeds this part to a particular parser instance. That's a simplified approach: in real life I'd go with RandomAccessFile to find start stop points (<element>) and then create custom FileInputStream that just operates on a part of file.

  4. Take a look at Aalto. The same guys that created Woodstox. This are experts in this area - don't reinvent the wheel.

就像说晚安 2024-10-10 08:52:43

我同意吉姆的观点。我认为如果你想提高 1000 个文件的整体处理性能,你的计划是好的,除了 #3,在这种情况下是不相关的。
但是,如果您想提高单个文件的解析性能,则会遇到问题。我不知道如何在不解析 XML 文件的情况下分割它。每个块都将是非法的 XML,并且您的解析器将失败。

我相信提高整体时间对你来说已经足够了。在这种情况下,请阅读本教程:
http://download.oracle.com/javase/tutorial/essential/concurrency /index.html
然后创建例如 100 个线程的线程池和包含 XML 源的队列。每个线程仅解析 10 个文件,这将在多 CPU 环境中带来显着的性能优势。

I am agree with Jim. I think that if you want to improve performance of overall processing of 1000 files your plan is good except #3 that is irrelevant in this case.
If however you want to improve performance of parsing of single file you have a problem. I do not know how it is possible to split XML file without it parsing. Each chunk will be illegal XML and your parser will fail.

I believe that improving overall time is good enough for you. In this case read this tutorial:
http://download.oracle.com/javase/tutorial/essential/concurrency/index.html
then create thread pool of for example 100 threads and queue that contains XML sources. Each thread will parse only 10 files that will bring serious performance benefit in multi-CPU environment.

画离情绘悲伤 2024-10-10 08:52:43

除了现有的好的建议之外,还有一件相当简单的事情要做:使用游标 API (XMLStreamReader),而不是事件 API。事件 API 增加了 30-50% 的开销,但(仅在我看来)并没有显着地使处理变得容易。事实上,如果你想要方便,我建议使用 StaxMate 代替;它构建在 Cursor API 之上,而不会增加大量开销(与手写代码相比最多增加 5-10%)。

现在:我假设您已经使用 Woodstox 进行了基本优化;但如果没有,请查看“使用 Stax 进行快速 XML 处理的 3 个简单规则”。具体来说,您绝对应该:

  1. 确保仅创建一次 XMLInputFactory 和 XMLOutputFactory 实例
  2. 关闭读取器和写入器以确保缓冲区回收(以及其他有用的重用)按预期工作。

我提到这一点的原因是,虽然这些没有功能差异(代码按预期工作),但它们可以带来很大的性能差异;尽管在处理较小的文件时更是如此。

运行多个实例也确实有意义;尽管通常每个核心最多有 1 个线程。然而,只有当您的存储 I/O 能够支持这样的速度时,您才会受益;如果磁盘是瓶颈,这将无济于事,并且在某些情况下可能会造成损害(如果磁盘寻道竞争)。但值得一试。

In addition to existing good suggestions there is one rather simple thing to do: use cursor API (XMLStreamReader), NOT Event API. Event API adds 30-50% overhead without (just IMO) significantly making processing easire. In fact, if you want convenience, I would recommend using StaxMate instead; it builds on top of Cursor API without adding significant overhead (at most 5-10% compared to hand-written code).

Now: I assume you have done basic optimizations with Woodstox; but if not, check out "3 Simple Rules for Fast XML-processing using Stax". Specifically, you absolutely should:

  1. Make sure you only create XMLInputFactory and XMLOutputFactory instances once
  2. Close readers and writers to ensure buffer recycling (and other useful reuse) works as expected.

The reason I mention this is that while these make no functional difference (code works as expected) they can make big performance difference; although more so when processing smaller files.

Running multiple instances does also make sense; although usually with at most 1 thread per core. However you will only get benefit as long as your storage I/O can support such speeds; if disk is the bottleneck this will not help and can in some cases hurt (if disk seeks compete). But it is worth a try.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文