我什么时候应该选择 SAX 而不是 StAX?
与构建树结构(如 DOM 解析器)的解析器相比,SAX 和 StAX 等流式 xml 解析器速度更快、内存效率更高。 SAX 是一个推送解析器,这意味着它是观察者模式(也称为侦听器模式)的实例。 SAX 首先出现,但随后出现了 StAX——一个拉式解析器,这意味着它基本上像迭代器一样工作。
您可以在任何地方找到为什么更喜欢 StAX 而不是 SAX 的原因,但通常归结为:“它更容易使用”。
在 JAXP 的 Java 教程中,StAX 被模糊地描述为 DOM 和 SAX 之间的中间:“它比 SAX 更容易,比 DOM 更高效”。然而,我从未发现任何线索表明 StAX 会比 SAX 更慢或内存效率更低。
所有这些让我想知道:有什么理由选择 SAX 而不是 StAX 吗?
Streaming xml-parsers like SAX and StAX are faster and more memory efficient than parsers building a tree-structure like DOM-parsers. SAX is a push parser, meaning that it's an instance of the observer pattern (also called listener pattern). SAX was there first, but then came StAX - a pull parser, meaning that it basically works like an iterator.
You can find reasons why to prefer StAX over SAX everywhere, but it usually boils down to: "it's easier to use".
In the Java tutorial on JAXP StAX is vaguely presented as the middle between DOM and SAX: "it's easier than SAX and more efficient than DOM". However, I never found any clues that StAX would be slower or less memory efficient than SAX.
All this made me wonder: are there any reasons to choose SAX instead of StAX?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
概述
XML 文档是分层文档,其中相同的元素名称和名称空间可能出现在多个位置,具有不同的含义,并且深度不定(递归)。通常,解决大问题的方法是将其分解为小问题。在 XML 解析的上下文中,这意味着使用特定于 XML 的方法来解析 XML 的特定部分。例如,一段逻辑将解析一个地址:
即您将有一个方法
或
逻辑中的某个位置,采用 XML 输入参数并返回一个对象(稍后可以从字段中获取 B 的结果)。
SAX
SAX“推动”XML 事件,由您决定 XML 事件属于您的程序/数据中的位置。
对于“Building”起始元素,您需要确定您实际上正在解析地址,然后将 XML 事件路由到负责解释地址的方法。
StAX
StAX“拉取”XML 事件,由您决定在程序/数据中的何处接收 XML 事件。
当然,您总是希望在其工作是解释地址的方法中接收“Building”事件。
讨论
SAX 和 StAX 之间的区别在于推和拉。在这两种情况下,都必须以某种方式处理解析状态。
这对于 SAX 来说是典型的方法 B,对于 StAX 来说是典型的方法 A。此外,SAX 必须为 B 提供单独的 XML 事件,而 StAX 可以为 A 提供多个事件(通过传递 XMLStreamReader 实例)。
因此,B 首先检查解析的先前状态,然后处理每个单独的 XML 事件,然后存储状态(在字段中)。方法 A 可以通过多次访问 XMLStreamReader 一次处理所有 XML 事件,直到满意为止。
结论
StAX 允许您根据 XML 结构构建解析(数据绑定)代码;因此,就 SAX 而言,StAX 的程序流中隐含“状态”,而在 SAX 中,对于大多数事件调用,您始终需要保留某种状态变量 + 根据该状态路由流。
我推荐 StAX 用于除最简单的文档之外的所有文档。而是稍后转向 SAX 作为优化(但那时您可能希望转为二进制)。
使用 StAX 进行解析时遵循此模式:
因此子方法使用大约相同的方法,即计数级别:
然后最终您达到您将阅读基本类型的级别。
这非常简单,不存在任何误解的余地。只需记住正确地递减级别:
A. 在您期望字符但在某个应包含字符的标记中得到 END_ELEMENT 后(在上面的模式中):
was 相反
对于丢失的子树也是如此,您明白了。
B. 在调用子解析方法之后,该方法在开始元素上调用,并在相应的结束元素之后返回,即解析器比方法调用之前低一级(上述模式)。
请注意,这种方法如何完全忽略“可忽略的”空白,以获得更稳健的实现。
解析器
对于大多数功能,请使用 Woodstox 或 Aaalto-xml 以提高速度。
Overview
XML documents are hierarchical documents, where the same element names and namespaces might occur in several places, having different meaning, and in infinitive depth (recursive). As normal, the solution to big problems, is to divide them into small problems. In the context of XML parsing, this means parsing specific parts of XML in methods specific to that XML. For example, one piece of logic would parse an address:
i.e. you would have a method
or
somewhere in your logic, taking XML inputs arguments and returning an object (result of B can be fetched from a field later).
SAX
SAX 'pushes' XML events, leaving it up to you to determine where the XML events belong in your program / data.
In case of an 'Building' start element, you would need to determine that you are actually parsing an Address and then route the XML event to the method whose job it is to interpret Address.
StAX
StAX 'pulls' XML events, leaving it up to you to determine where in your program / data to receive the XML events.
Of course, you would always want to receive a 'Building' event in in the method whose job it is to interpret Address.
Discussion
The difference between SAX and StAX is that of push and pull. In both cases, the parse state must be handled somehow.
This translates to method B as typical for SAX, and method A for StAX. In addition, SAX must give B individual XML events, while StAX can give A multiple events (by passing an XMLStreamReader instance).
Thus B first check the previous state of the parsing and then handle each individual XML event and then store the state (in a field). Method A can just handle the XML events all at once by accessing the XMLStreamReader multiple times until satisfied.
Conclusion
StAX lets you structure your parsing (data-binding) code according to the XML structure; so in relation to SAX, the 'state' is implicit from the program flow for StAX, whereas in SAX, you always need to preserve some kind of state variable + route the flow according to that state, for most event calls.
I recommend StAX for all but the simplest documents. Rather move to SAX as an optimization later (but you'll probably want to go binary by then).
Follow this pattern when parsing using StAX:
So the submethod uses about the same approach, i.e. counting level:
And then eventually you reach a level in which you will read the base types.
This is quite straightforward and there is no room for misunderstandings. Just remember to decrement level correctly:
A. after you expected characters but got an END_ELEMENT in some tag which should contain chars (in the above pattern):
was instead
The same is true for a missing subtree too, you get the idea.
B. after calling subparsing methods, which are called on start elements, and returns AFTER the corresponding end element, i.e. the parser is at one level lower than before the method call (the above pattern).
Note how this approach totally ignores 'ignorable' whitespace too, for more robust implementation.
Parsers
Go with Woodstox for most features or Aaalto-xml for speed.
概括地说,我认为
StAX
可以与SAX
一样高效。随着StAX
设计的改进,我确实找不到任何首选SAX
解析的情况,除非使用遗留代码。编辑:根据此博客Java SAX vs. StAX
StAX
不提供架构验证。To generalize a bit, I think
StAX
can be as efficient asSAX
. With the improved design ofStAX
I can't really find any situation whereSAX
parsing would be preferred, unless working with legacy code.EDIT: According to this blog Java SAX vs. StAX
StAX
offer no schema validation.@Rinke:我想只有当您不需要处理 XML 内容时,我才会考虑使用 SAX 而不是 STAX;例如,您唯一想做的就是检查传入 XML 的格式是否正确,并且只想处理错误(如果有)...在这种情况下,您可以简单地在 SAX 解析器上调用 parse() 方法并指定错误处理程序来处理任何错误解析问题....所以基本上,在您想要处理内容的情况下,STAX 绝对是更好的选择,因为 SAX 内容处理程序太难编码...
这种情况的一个实际示例可能是,如果您的系统中有一系列 SOAP 节点仅企业系统和入门级 SOAP 节点让那些格式良好的 SOAP XML 通过下一阶段,那么我看不出有任何理由要使用 STAX。我只会使用 SAX。
@Rinke: I guess only time I think of preferring SAX over STAX in case when you don't need to handle/process XML content; for e.g. only thing you want to do is check for well-formedness of incoming XML and just want to handle errors if it has...in this case you can simply call parse() method on SAX parser and specify error handler to handle any parsing problem....so basically STAX is definitely preferrable choice in scenarios where you want to handle content becasue SAX content handler is too difficult to code...
one practical example of this case may be if you have series of SOAP nodes in your enterprise system and an entry level SOAP node only lets those SOAP XML pass thru next stage which are well-formedness, then I don't see any reason why I would use STAX. I would just use SAX.
这都是一个平衡。
您可以使用阻塞队列和一些线程技巧将 SAX 解析器转变为拉解析器,因此,对我来说,差异比最初看起来要小得多。
我相信目前 StAX 需要通过第三方 jar 进行打包,而 SAX 在 javax 中是免费的。
我最近选择了 SAX 并围绕它构建了一个拉式解析器,因此我不需要依赖第三方 jar。
Java 的未来版本几乎肯定会包含 StAX 实现,因此问题就会消失。
It's all a balance.
You can turn a SAX parser into a pull parser using a blocking queue and some thread trickery so, to me, there is much less difference than there first seems.
I believe currently StAX needs to be packaged through a third-party jar while SAX comes free in javax.
I recently chose SAX and built a pull parser around it so I did not need to rely on a third-party jar.
Future versions of Java will almost certainly contain a StAX implementation so the problem goes away.
如果您确实需要最大速度并且 XML 并不复杂,那么只需编写您自己的 XML 即可。当然,您必须了解一些有关优化字符和字符串处理的知识,但我们通过代码屏幕完成了它,花了额外的一天时间加上一些错误,并且比 SAX / StAX 等基于 OOTB 流的解决方案提高了 20%,因为这些必须是通用的。
在大多数情况下,“自己编写”并不是第一个解决方案,但这就是为什么这些库与完整的解析器相比现在不受欢迎的原因。如果您确实需要速度,有时需要一个紧密围绕特定要求的解决方案(我们甚至认为 XML 文件具有某种结构:我们永远不需要另一个文件,并且它加快了速度)。
幸运的是……在我们的业务领域(金融)……获得最好的速度是值得付出 1% 的时间的成本的:-) 这次节省的成本是值得的,也是我参与某件事的唯一一次像这样,事实证明最好使用针对需求和性能量身定制的自定义序列化格式。
其他时候,我们只是使用常规解析器。
If you really need max speed and the XML isn't complicated, just write your own. Of course you have to understand a bit about optimizing character and string processing but we did it with a screen of code, took an extra day plus a few bugs, and got a 20% increase over an OOTB streaming based solution like SAX / StAX because those have to be general purpose.
'Write it yourself' isn't the first go to solution in most cases, but this is why these libraries are out of favor today vs. a full parser. If you really do need speed, a solution wrapped tightly around the specific requirements (we even counted on the XML file being a certain structure: we'd never need another, and it sped things up) is sometimes required.
Fortunately ... in our line of business (financial) ... getting the best speed is worth the cost like 1% of the time :-) This time the savings was worth it, and the only other time I was involved in something like this it turned out to be best to use a custom serialization format tailored for the requirements and performance.
all the other times, we just reach for the regular parser.
StAX 使您能够创建快速的双向 XML 解析器。事实证明,无论是在性能还是可用性方面,它都是其他方法(例如 DOM 和 SAX)的更好替代方案。
您可以在 Java StAX 教程
StAX enables you to create bidirectional XML parsers that are fast. It proves a better alternative to other methods, such as DOM and SAX, both in terms of performance and usability
You can read more about StAX in Java StAX Tutorials
这些答案提供的大部分信息都有些过时了...这篇 2013 年的研究论文对所有 XML 解析库进行了全面的研究...阅读它,您将很容易看到明显的获胜者(提示:只有一个)真正的赢家)...
http://recipp.ipp.pt/bitstream/10400.22/1847/1/ART_BrunoOliveira_2013.pdf
Most of the information provided by those answers are somewhat outdated... there have been a comprehensive study of all XML parsing libs in this 2013 research paper... read it and you will easily see the clear winner (hint: there is only one true winner)...
http://recipp.ipp.pt/bitstream/10400.22/1847/1/ART_BrunoOliveira_2013.pdf