为什么sax解析比dom解析快?斯塔税是如何运作的?
有点相关: libxml2 from java
是的,这个问题相当冗长 - 抱歉。我尽可能保持密集。我将问题加粗,以便在阅读全文之前更容易浏览。
为什么 sax 解析比 dom 解析快? 我唯一能想到的是,使用 sax 你可能会忽略大部分传入数据,因此不会浪费时间处理部分数据你不关心的xml。 IOW - 使用 SAX 解析后,您无法重新创建原始输入。 如果您编写 SAX 解析器,使其能够解释每个 xml 节点(因此可以重新创建原始节点),那么它不会比 DOM 更快,不是吗?
我这样做的原因是问题是我正在尝试更快地解析 xml 文档。我需要在解析后访问整个 xml 树。我正在编写一个供第三方服务插入的平台,因此我无法预测需要 xml 文档的哪些部分以及不需要哪些部分。我什至不知道传入文档的结构。这就是为什么我不能使用 jaxb 或 sax 的原因。内存占用对我来说不是问题,因为 xml 文档很小,而且我一次只需要内存中的 1 个文档。解析这个相对较小的 xml 文档所花费的时间让我很烦恼。我以前没有使用过 stax,但也许我需要进一步调查,因为它可能是中间立场? 如果我理解正确的话,stax会保留原始的xml结构并按需处理我要求的部分?这样,原始的解析时间可能很快,但每次我要求它遍历部分尚未遍历的树的哪个部分,就是处理发生的时间?
如果您提供的链接可以回答大多数问题,我将接受您的答案(如果我的问题已在其他地方得到解答,您不必直接回答我的问题)。
更新:我用 sax 重写了它,它解析文档的平均时间为 2.1 毫秒。与 dom 所花费的 2.5 毫秒相比,这是一个改进(快了 16%),但这并不是我(等人)猜测的幅度
谢谢
somewhat related to: libxml2 from java
yes, this question is rather long-winded - sorry. I kept is as dense as I felt possible. I bolded the questions to make it easier to peek at before reading the whole thing.
Why is sax parsing faster than dom parsing? The only thing I can come up with is that w/ sax you're probably ignoring the majority of the incoming data, and thus not wasting time processing parts of the xml you don't care about. IOW - after parsing w/ SAX, you can't recreate the original input. If you wrote your SAX parser so that it accounted for each and every xml node (and could thus recreate the original), then it wouldn't be any faster than DOM would it?
The reason I'm asking is that I'm trying to parse xml documents more quickly. I need to have access to the entire xml tree AFTER parsing. I am writing a platform for 3rd party services to plug into, so I can't anticipate what parts of the xml document will be needed and which parts won't. I don't even know the structure of the incoming document. This is why I can't use jaxb or sax. Memory footprint isn't an issue for me because the xml documents are small and I only need 1 in memory at a time. It's the time it takes to parse this relatively small xml document that is killing me. I haven't used stax before, but perhaps I need to investigate further because it might be the middle ground? If I understand correctly, stax keeps the original xml structure and processes the parts that I ask for on demand? In this way, the original parse time might be quick, but each time I ask it to traverse part of the tree it hasn't yet traversed, that's when the processing takes place?
If you provide a link that answers most of the questions, I will accept your answer (you don't have to directly answer my questions if they're already answered elsewhere).
update: I rewrote it in sax and it parses documents on avg 2.1 ms. This is an improvement (16% faster) over the 2.5 ms that dom was taking, however it is not the magnitude that I (et al) would've guessed
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
假设您除了解析文档之外什么都不做,则不同解析器标准的排名如下:
1。 StAX 最快
2。接下来是 SAX
3。 DOM 是最后一个,
您的用例
Assuming you do nothing but parse the document, the ranking of the different parser standards is as follows:
1. StAX is the fastest
2. SAX is next
3. DOM is last
Your Use Case
DOM 解析需要您将整个文档加载到内存中,然后遍历一棵树来查找您想要的信息。
SAX 仅需要执行基本 IO 所需的内存,并且您可以在读取文档时提取所需的信息。因为 SAX 是面向流的,所以您甚至可以处理仍在由另一个进程写入的文件。
DOM parsing requires you to load the entire document into memory and then traverse a tree to find the information you want.
SAX only requires as much memory as you need to do basic IO, and you can extract the information that you need as the document is being read. Because SAX is stream oriented, you can even process a file which is still being written by another process.
SAX 速度更快,因为 DOM 解析器通常使用 SAX 解析器在内部解析文档,然后执行创建和操作对象来表示每个节点的额外工作,即使应用程序不关心它们。
直接使用 SAX 的应用程序可能比 DOM“解析器”更有效地利用信息集。
StAX 是一种快乐的媒介,应用程序可以获得比 SAX 的事件驱动方法更方便的 API,而且不会遭受创建完整 DOM 的低效率问题。
SAX is faster because DOM parsers often use a SAX parser to parse a document internally, then do the extra work of creating and manipulating objects to represent each and every node, even if the application doesn't care about them.
An application that uses SAX directly is likely to utilize the information set more efficiently than a DOM "parser" does.
StAX is a happy medium where an application gets a more convenient API than SAX's event-driven approach, yet doesn't suffer the inefficiency of creating a complete DOM.
SAX 比 DOM 更快(通常在读取大型 XML 文档时感觉到),因为 SAX 以事件序列的形式提供信息(通常通过处理程序访问),而 DOM 创建节点并管理节点创建结构,直到完全创建 DOM 树(如在 XML 文档中表示)。
对于相对较小的文件,您不会感受到这种效果(除非 DOM 可能完成额外的处理来创建 Node 元素和/或 Node 列表)。
我无法对 StAX 做出真正的评论,因为我从未玩过它。
SAX is faster than DOM (usually felt when reading large XML document) because SAX gives you information as a sequence of events (usually accessed through a handler) while DOM creates Nodes and manages the node creation structure until a DOM tree is fully created (as represented in the XML document).
For relatively small files, you won't feel the effect (except that possibly that extra processing is done by DOM to create Node element and/or Node lists).
I can't really comment on StAX since I've never played with it.