在 SAX 中解析大型 XML 文件时在 DOM 中加载本地块 (Java)
我有一个 xml 文件,我可以避免将其全部加载到内存中。 众所周知,对于这样的文件,我最好使用 SAX 解析器(如果找到相关内容,它将沿着文件进行解析并调用事件。)
我当前的问题是我想“按块”处理文件这意味着:
- 解析文件并找到相关标签(节点)
- 将此标签完全加载到内存中(就像我们在 DOM 中所做的那样)
- 处理此实体(本地块)
- 当我完成块时,释放并继续 1.(直到“结束文件”)
在一个完美的世界中,我正在搜索这样的东西:
// 1. Create a parser and set the file to load
IdealParser p = new IdealParser("BigFile.xml");
// 2. Set an XPath to define the interesting nodes
p.setRelevantNodesPath("/path/to/relevant/nodes");
// 3. Add a handler to callback the right method once a node is found
p.setHandler(new Handler(){
// 4. The method callback by the parser when a relevant node is found
void aNodeIsFound(saxNode aNode)
{
// 5. Inflate the current node i.e. load it (and all its content) in memory
DomNode d = aNode.expand();
// 6. Do something with the inflated node (method to be defined somewhere)
doThingWithNode(d);
}
});
// 7. Start the parser
p.start();
我目前陷入如何 展开一个“sax节点”(理解我……)有效地。
有没有与此类任务相关的Java框架或库?
I've an xml file that I would avoid having to load all in memory.
As everyone know, for such a file I better have to use a SAX parser (which will go along the file and call for events if something relevant is found.)
My current problem is that I would like to process the file "by chunk" which means:
- Parse the file and find a relevant tag (node)
- Load this tag entirely in memory (like we would do it in DOM)
- Do the process of this entity (that local chunk)
- When I'm done with the chunk, release it and continue to 1. (until "end of file")
In a perfect world I'm searching some something like this:
// 1. Create a parser and set the file to load
IdealParser p = new IdealParser("BigFile.xml");
// 2. Set an XPath to define the interesting nodes
p.setRelevantNodesPath("/path/to/relevant/nodes");
// 3. Add a handler to callback the right method once a node is found
p.setHandler(new Handler(){
// 4. The method callback by the parser when a relevant node is found
void aNodeIsFound(saxNode aNode)
{
// 5. Inflate the current node i.e. load it (and all its content) in memory
DomNode d = aNode.expand();
// 6. Do something with the inflated node (method to be defined somewhere)
doThingWithNode(d);
}
});
// 7. Start the parser
p.start();
I'm currently stuck on how to expand a "sax node" (understand me…) efficiently.
Is there any Java framework or library relevant to this kind of task?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
更新
您也可以只使用
javax.xml.xpath
API:下面是如何使用 StAX 完成此操作的示例。
input.xml
下面是一些示例 XML:
Demo
在此示例中,StAX
XMLStreamReader
用于查找将转换为 DOM 的节点。在此示例中,我们将每个statement
片段转换为 DOM,但您的导航算法可能更高级。输出
UPDATE
You could also just use the
javax.xml.xpath
APIs:Below is a sample of how it could be done with StAX.
input.xml
Below is some sample XML:
Demo
In this example a StAX
XMLStreamReader
is used to find the node that will be converted to a DOM. In this example we convert eachstatement
fragment to a DOM, but your navigation algorithm could be more advanced.Output
可以使用 SAX 来完成...但我认为较新的 StAX(XML 流 API)将更好地满足您的目的。您可以 创建一个 XMLEventReader 并使用它来解析您的文件,检测哪些节点符合您的标准之一。对于简单的基于路径的选择(不是真正的 XPath,而是一些简单的
/
分隔路径),您需要通过向新元素上的字符串添加条目或剪切条目来维护当前节点的路径在结束标签上。布尔标志足以维持您当前是否处于“相关模式”。当您从以下位置获取 XMLEvents 时您的读者,您可以将相关内容复制到 XMLEventWriter 那您已经在一些合适的占位符上创建了,例如 StringWriter 或 ByteArrayOutputStream。一旦您完成了某些 XML 提取的复制,这些提取形成了您希望为其构建 DOM 的“子文档”,只需将占位符提供给 DocumentBuilder 以合适的形式。
这里的限制是您没有利用 XPath 语言的所有功能。如果您希望考虑节点位置等内容,则必须在自己的路径中预见到这一点。也许有人知道将真正的 XPath 实现集成到其中的好方法。
StAX 非常好,因为它让您可以控制解析,而不是通过像 SAX 这样的处理程序使用一些回调接口。
还有另一种选择:使用 XSLT。 XSLT 样式表是仅过滤掉相关内容的理想方法。您可以转换输入一次以获得所需的片段并处理它们。或者在同一输入上运行多个样式表,以便每次都获得所需的摘录。然而,更好(更高效)的解决方案是使用 扩展函数和/或扩展元素。
扩展功能可以通过独立于所使用的 XSLT 处理器的方式来实现。它们在 Java 中使用起来相当简单,而且我知道您可以使用它们将完整的 XML 提取传递给方法,因为我已经这样做了。可能需要一些实验,但这是一个强大的机制。 DOM 提取(或节点)可能是此类方法可接受的参数类型之一。这将使文档构建到 XSLT 处理器上,这甚至更容易。
扩展元素也非常有用,但我认为它们需要以特定于实现的方式使用。如果您愿意将自己绑定到特定的 JAXP 设置(例如 Xerces + Xalan),那么它们可能就是答案。
当选择 XSLT 时,您将拥有完整 XPath 1.0 实现的所有优点,并且由于了解 XSLT 在 Java 中的良好状态而感到安心。它将输入树的构建限制为那些随时需要的节点,并且速度非常快,因为处理器倾向于将样式表编译成 Java 字节码而不是解释它们。不过,使用编译而不是解释可能会失去使用扩展元素的可能性。对此并不确定。扩展功能仍然是可能的。
无论您选择哪种方式,Java 中有很多用于 XML 处理的方法,如果您没有找到现成的解决方案,您会在实现此方面找到大量帮助。当然,这是最明显的事情......当有人做了艰苦的工作时,无需重新发明轮子。
祝你好运!
编辑:因为我实际上并没有感到沮丧,这里有一个使用我创建的 StAX 解决方案的演示。这当然不是最干净的代码,但它会给你基本的想法:
和sample.xml文件(应该在同一个包中):
编辑2:刚刚注意到Blaise Doughan的回答中有StAXSource。这样效率会更高。如果您要使用 StAX,请使用它。将消除保留一些缓冲区的需要。 StAX 允许您“查看”下一个事件,因此您可以检查它是否是具有正确路径的起始元素,而无需在将其传递到转换器之前消耗它。
It could be done with SAX... But I think the newer StAX (Streaming API for XML) will serve your purpose better. You could create an XMLEventReader and use that to parse your file, detecting which nodes adhere to one of your criteria. For simple path-based selection (not really XPath, but some simple
/
delimited path) you'd need to maintain a path to your current node by adding entries to a String on new elements or cutting of entries on an end tag. A boolean flag can suffice to maintain whether you're currently in "relevant mode" or not.As you obtain XMLEvents from your reader, you could copy the relevant ones over to an XMLEventWriter that you've created on some suitable placeholder, like a StringWriter or ByteArrayOutputStream. Once you've completed the copying for some XML extract that forms a "subdocument" of what you wish to build a DOM for, simply supply your placeholder to a DocumentBuilder in a suitable form.
The limitation here is that you're not harnessing all the power of the XPath language. If you wish to take stuff like node position into account, you'd have to foresee that in your own path. Perhaps someone knows of a good way of integrating a true XPath implementation into this.
StAX is really nice in that it gives you control over the parsing, rather than using some callback interface through a handler like SAX.
There's yet another alternative: using XSLT. An XSLT stylesheet is the ideal way to filter out only relevant stuff. You could transform your input once to obtain the required fragments and process those. Or run multiple stylesheets over the same input to get the desired extract each time. An even nicer (and more efficient) solution, however, would be the use of extension functions and/or extension elements.
Extension functions can be implemented in a way that's independent from the XSLT processor being used. They're fairly straightforward to use in Java and I know for a fact that you can use them to pass complete XML extracts to a method, because I've done so already. Might take some experimentation, but it's a powerful mechanism. A DOM extract (or node) is probably one of the accepted parameter types for such a method. That'd leave the document building up to the XSLT processor which is even easier.
Extension elements are also very useful, but I think they need to be used in an implementation-specific manner. If you're okay with tying yourself to a specific JAXP setup like Xerces + Xalan, they might be the answer.
When going for XSLT, you'll have all the advantages of a full XPath 1.0 implementation, plus the peace of mind that comes from knowing XSLT is in really good shape in Java. It limits the building of the input tree to those nodes that are needed at any time and is blazing fast because the processors tend to compile stylesheets into Java bytecode rather than interpreting them. It is possible that using compilation instead of interpretation loses the possibility of using extension elements, though. Not certain about that. Extension functions are still possible.
Whatever way you choose, there's so much out there for XML processing in Java that you'll find plenty of help in implementing this, should you have no luck in finding a ready-made solution. That'd be the most obvious thing, of course... No need to reinvent the wheel when someone did the hard work.
Good luck!
EDIT: because I'm actually not feeling depressed for once, here's a demo using the StAX solution I whipped up. It's certainly not the cleanest code, but it'll give you the basic idea:
And the sample.xml file (should be in the same package):
EDIT 2: Just noticed in Blaise Doughan's answer that there's a StAXSource. That'll be even more efficient. Use that if you're going with StAX. Will eliminate the need to keep some buffer. StAX allows you to "peek" at the next event, so you can check if it's a start element with the right path without consuming it before passing it into the transformer .
好的,感谢您的代码片段,我终于得到了我的解决方案:
用法非常直观:
注意:
所有代码都在这里:
ok thanks to your pieces of code, I finally end up with my solution:
Usage is quite intuitive:
Note:
All the of the code is here:
自从我做 SAX 以来有一段时间了,但是您想要做的是处理每个标签,直到找到要处理的组的结束标签,然后运行您的流程,清除它并查找下一个开始标签。
A little while since i did SAX, but what you want to do is process each of the tags until you find the end tag for the group you want to process, then run your process, clear it out and look for the next start tag.