我什么时候应该选择 SAX 而不是 StAX?

发布于 2024-12-06 13:32:58 字数 378 浏览 0 评论 0原文

与构建树结构(如 DOM 解析器)的解析器相比,SAX 和 StAX 等流式 xml 解析器速度更快、内存效率更高。 SAX 是一个推送解析器,这意味着它是观察者模式(也称为侦听器模式)的实例。 SAX 首先出现,但随后出现了 StAX——一个拉式解析器,这意味着它基本上像迭代器一样工作。

您可以在任何地方找到为什么更喜欢 StAX 而不是 SAX 的原因,但通常归结为:“它更容易使用”。

在 JAXP 的 Java 教程中,StAX 被模糊地描述为 DOM 和 SAX 之间的中间:“它比 SAX 更容易,比 DOM 更高效”。然而,我从未发现任何线索表明 StAX 会比 SAX 更慢或内存效率更低。

所有这些让我想知道:有什么理由选择 SAX 而不是 StAX 吗?

Streaming xml-parsers like SAX and StAX are faster and more memory efficient than parsers building a tree-structure like DOM-parsers. SAX is a push parser, meaning that it's an instance of the observer pattern (also called listener pattern). SAX was there first, but then came StAX - a pull parser, meaning that it basically works like an iterator.

You can find reasons why to prefer StAX over SAX everywhere, but it usually boils down to: "it's easier to use".

In the Java tutorial on JAXP StAX is vaguely presented as the middle between DOM and SAX: "it's easier than SAX and more efficient than DOM". However, I never found any clues that StAX would be slower or less memory efficient than SAX.

All this made me wonder: are there any reasons to choose SAX instead of StAX?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

淡墨 2024-12-13 13:32:58

概述
XML 文档是分层文档,其中相同的元素名称和名称空间可能出现在多个位置,具有不同的含义,并且深度不定(递归)。通常,解决大问题的方法是将其分解为小问题。在 XML 解析的上下文中,这意味着使用特定于 XML 的方法来解析 XML 的特定部分。例如,一段逻辑将解析一个地址:

<Address>
    <Street>Odins vei</Street>    
    <Building>4</Building>
    <Door>b</Door>
</Address>

即您将有一个方法

AddressType parseAddress(...); // A

void parseAddress(...); // B

逻辑中的某个位置,采用 XML 输入参数并返回一个对象(稍后可以从字段中获取 B 的结果)。

SAX
SAX“推动”XML 事件,由您决定 XML 事件属于您的程序/数据中的位置。

// method in stock SAX handler
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException
    // .. your logic here for start element
}

对于“Building”起始元素,您需要确定您实际上正在解析地址,然后将 XML 事件路由到负责解释地址的方法。

StAX
StAX“拉取”XML 事件,由您决定在程序/数据中的何处接收 XML 事件。

// method in standard StAX reader
int event = reader.next();
if(event == XMLStreamConstants.START_ELEMENT) {
    // .. your logic here for start element
}

当然,您总是希望在其工作是解释地址的方法中接收“Building”事件。

讨论
SAX 和 StAX 之间的区别在于推和拉。在这两种情况下,都必须以某种方式处理解析状态。

这对于 SAX 来说是典型的方法 B,对于 StAX 来说是典型的方法 A。此外,SAX 必须为 B 提供单独的 XML 事件,而 StAX 可以为 A 提供多个事件(通过传递 XMLStreamReader 实例)。

因此,B 首先检查解析的先前状态,然后处理每个单独的 XML 事件,然后存储状态(在字段中)。方法 A 可以通过多次访问 XMLStreamReader 一次处理所有 XML 事件,直到满意为止。

结论
StAX 允许您根据 XML 结构构建解析(数据绑定)代码;因此,就 SAX 而言,StAX 的程序流中隐含“状态”,而在 SAX 中,对于大多数事件调用,您始终需要保留某种状态变量 + 根据该状态路由流。

我推荐 StAX 用于除最简单的文档之外的所有文档。而是稍后转向 SAX 作为优化(但那时您可能希望转为二进制)。

使用 StAX 进行解析时遵循此模式:

public MyDataBindingObject parse(..) { // provide input stream, reader, etc

        // set up parser
        // read the root tag to get to level 1
        XMLStreamReader reader = ....;

        do {
            int event = reader.next();
            if(event == XMLStreamConstants.START_ELEMENT) {
              // check if correct root tag
              break;
            }

            // add check for document end if you want to

        } while(reader.hasNext());

        MyDataBindingObject object = new MyDataBindingObject();
        // read root attributes if any

        int level = 1; // we are at level 1, since we have read the document header

        do {
            int event = reader.next();
            if(event == XMLStreamConstants.START_ELEMENT) {
                level++;
                // do stateful stuff here

                // for child logic:
                if(reader.getLocalName().equals("Whatever1")) {
                    WhateverObject child = parseSubTreeForWhatever(reader);
                    level --; // read from level 1 to 0 in submethod.

                    // do something with the result of subtree
                    object.setWhatever(child);
                }

                // alternatively, faster
                if(level == 2) {
                    parseSubTreeForWhateverAtRelativeLevel2(reader);
                    level --; // read from level 1 to 0 in submethod.

                    // do something with the result of subtree
                    object.setWhatever(child);
                }


            } else if(event == XMLStreamConstants.END_ELEMENT) {
                level--;
                // do stateful stuff here, too
            }

        } while(level > 0);

        return object;
}

因此子方法使用大约相同的方法,即计数级别:

private MySubTreeObject parseSubTree(XMLStreamReader reader) throws XMLStreamException {

    MySubTreeObject object = new MySubTreeObject();
    // read element attributes if any

    int level = 1;
    do {
        int event = reader.next();
        if(event == XMLStreamConstants.START_ELEMENT) {
            level++;
            // do stateful stuff here

            // for child logic:
            if(reader.getLocalName().equals("Whatever2")) {
                MyWhateverObject child = parseMySubelementTree(reader);
                level --; // read from level 1 to 0 in submethod.

                // use subtree object somehow
                object.setWhatever(child);
            }

            // alternatively, faster, but less strict
            if(level == 2) {
              MyWhateverObject child = parseMySubelementTree(reader);
                level --; // read from level 1 to 0 in submethod.

                // use subtree object somehow
                object.setWhatever(child);
            }


        } else if(event == XMLStreamConstants.END_ELEMENT) {
            level--;
            // do stateful stuff here, too
        }

    } while(level > 0);

    return object;
}

然后最终您达到您将阅读基本类型的级别。

private MySetterGetterObject parseSubTree(XMLStreamReader reader) throws XMLStreamException {

    MySetterGetterObject myObject = new MySetterGetterObject();
    // read element attributes if any

    int level = 1;
    do {
        int event = reader.next();
        if(event == XMLStreamConstants.START_ELEMENT) {
            level++;

            // assume <FirstName>Thomas</FirstName>:
            if(reader.getLocalName().equals("FirstName")) {
               // read tag contents
               String text = reader.getElementText()
               if(text.length() > 0) {
                    myObject.setName(text)
               }
               level--;

            } else if(reader.getLocalName().equals("LastName")) {
               // etc ..
            } 


        } else if(event == XMLStreamConstants.END_ELEMENT) {
            level--;
            // do stateful stuff here, too
        }

    } while(level > 0);

    // verify that all required fields in myObject are present

    return myObject;
}

这非常简单,不存在任何误解的余地。只需记住正确地递减级别:

A. 在您期望字符但在某个应包含字符的标记中得到 END_ELEMENT 后(在上面的模式中):

<Name>Thomas</Name>

was 相反

<Name></Name>

对于丢失的子树也是如此,您明白了。

B. 在调用子解析方法之后,该方法在开始元素上调用,并在相应的结束元素之后返回,即解析器比方法调用之前低一级(上述模式)。

请注意,这种方法如何完全忽略“可忽略的”空白,以获得更稳健的实现。

解析器
对于大多数功能,请使用 WoodstoxAaalto-xml 以提高速度。

Overview
XML documents are hierarchical documents, where the same element names and namespaces might occur in several places, having different meaning, and in infinitive depth (recursive). As normal, the solution to big problems, is to divide them into small problems. In the context of XML parsing, this means parsing specific parts of XML in methods specific to that XML. For example, one piece of logic would parse an address:

<Address>
    <Street>Odins vei</Street>    
    <Building>4</Building>
    <Door>b</Door>
</Address>

i.e. you would have a method

AddressType parseAddress(...); // A

or

void parseAddress(...); // B

somewhere in your logic, taking XML inputs arguments and returning an object (result of B can be fetched from a field later).

SAX
SAX 'pushes' XML events, leaving it up to you to determine where the XML events belong in your program / data.

// method in stock SAX handler
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException
    // .. your logic here for start element
}

In case of an 'Building' start element, you would need to determine that you are actually parsing an Address and then route the XML event to the method whose job it is to interpret Address.

StAX
StAX 'pulls' XML events, leaving it up to you to determine where in your program / data to receive the XML events.

// method in standard StAX reader
int event = reader.next();
if(event == XMLStreamConstants.START_ELEMENT) {
    // .. your logic here for start element
}

Of course, you would always want to receive a 'Building' event in in the method whose job it is to interpret Address.

Discussion
The difference between SAX and StAX is that of push and pull. In both cases, the parse state must be handled somehow.

This translates to method B as typical for SAX, and method A for StAX. In addition, SAX must give B individual XML events, while StAX can give A multiple events (by passing an XMLStreamReader instance).

Thus B first check the previous state of the parsing and then handle each individual XML event and then store the state (in a field). Method A can just handle the XML events all at once by accessing the XMLStreamReader multiple times until satisfied.

Conclusion
StAX lets you structure your parsing (data-binding) code according to the XML structure; so in relation to SAX, the 'state' is implicit from the program flow for StAX, whereas in SAX, you always need to preserve some kind of state variable + route the flow according to that state, for most event calls.

I recommend StAX for all but the simplest documents. Rather move to SAX as an optimization later (but you'll probably want to go binary by then).

Follow this pattern when parsing using StAX:

public MyDataBindingObject parse(..) { // provide input stream, reader, etc

        // set up parser
        // read the root tag to get to level 1
        XMLStreamReader reader = ....;

        do {
            int event = reader.next();
            if(event == XMLStreamConstants.START_ELEMENT) {
              // check if correct root tag
              break;
            }

            // add check for document end if you want to

        } while(reader.hasNext());

        MyDataBindingObject object = new MyDataBindingObject();
        // read root attributes if any

        int level = 1; // we are at level 1, since we have read the document header

        do {
            int event = reader.next();
            if(event == XMLStreamConstants.START_ELEMENT) {
                level++;
                // do stateful stuff here

                // for child logic:
                if(reader.getLocalName().equals("Whatever1")) {
                    WhateverObject child = parseSubTreeForWhatever(reader);
                    level --; // read from level 1 to 0 in submethod.

                    // do something with the result of subtree
                    object.setWhatever(child);
                }

                // alternatively, faster
                if(level == 2) {
                    parseSubTreeForWhateverAtRelativeLevel2(reader);
                    level --; // read from level 1 to 0 in submethod.

                    // do something with the result of subtree
                    object.setWhatever(child);
                }


            } else if(event == XMLStreamConstants.END_ELEMENT) {
                level--;
                // do stateful stuff here, too
            }

        } while(level > 0);

        return object;
}

So the submethod uses about the same approach, i.e. counting level:

private MySubTreeObject parseSubTree(XMLStreamReader reader) throws XMLStreamException {

    MySubTreeObject object = new MySubTreeObject();
    // read element attributes if any

    int level = 1;
    do {
        int event = reader.next();
        if(event == XMLStreamConstants.START_ELEMENT) {
            level++;
            // do stateful stuff here

            // for child logic:
            if(reader.getLocalName().equals("Whatever2")) {
                MyWhateverObject child = parseMySubelementTree(reader);
                level --; // read from level 1 to 0 in submethod.

                // use subtree object somehow
                object.setWhatever(child);
            }

            // alternatively, faster, but less strict
            if(level == 2) {
              MyWhateverObject child = parseMySubelementTree(reader);
                level --; // read from level 1 to 0 in submethod.

                // use subtree object somehow
                object.setWhatever(child);
            }


        } else if(event == XMLStreamConstants.END_ELEMENT) {
            level--;
            // do stateful stuff here, too
        }

    } while(level > 0);

    return object;
}

And then eventually you reach a level in which you will read the base types.

private MySetterGetterObject parseSubTree(XMLStreamReader reader) throws XMLStreamException {

    MySetterGetterObject myObject = new MySetterGetterObject();
    // read element attributes if any

    int level = 1;
    do {
        int event = reader.next();
        if(event == XMLStreamConstants.START_ELEMENT) {
            level++;

            // assume <FirstName>Thomas</FirstName>:
            if(reader.getLocalName().equals("FirstName")) {
               // read tag contents
               String text = reader.getElementText()
               if(text.length() > 0) {
                    myObject.setName(text)
               }
               level--;

            } else if(reader.getLocalName().equals("LastName")) {
               // etc ..
            } 


        } else if(event == XMLStreamConstants.END_ELEMENT) {
            level--;
            // do stateful stuff here, too
        }

    } while(level > 0);

    // verify that all required fields in myObject are present

    return myObject;
}

This is quite straightforward and there is no room for misunderstandings. Just remember to decrement level correctly:

A. after you expected characters but got an END_ELEMENT in some tag which should contain chars (in the above pattern):

<Name>Thomas</Name>

was instead

<Name></Name>

The same is true for a missing subtree too, you get the idea.

B. after calling subparsing methods, which are called on start elements, and returns AFTER the corresponding end element, i.e. the parser is at one level lower than before the method call (the above pattern).

Note how this approach totally ignores 'ignorable' whitespace too, for more robust implementation.

Parsers
Go with Woodstox for most features or Aaalto-xml for speed.

烦人精 2024-12-13 13:32:58

概括地说,我认为 StAX 可以与 SAX 一样高效。随着 StAX 设计的改进,我确实找不到任何首选 SAX 解析的情况,除非使用遗留代码。

编辑:根据此博客Java SAX vs. StAX StAX不提供架构验证。

To generalize a bit, I think StAX can be as efficient as SAX. With the improved design of StAX I can't really find any situation where SAX parsing would be preferred, unless working with legacy code.

EDIT: According to this blog Java SAX vs. StAX StAXoffer no schema validation.

撑一把青伞 2024-12-13 13:32:58

@Rinke:我想只有当您不需要处理 XML 内容时,我才会考虑使用 SAX 而不是 STAX;例如,您唯一想做的就是检查传入 XML 的格式是否正确,并且只想处理错误(如果有)...在这种情况下,您可以简单地在 SAX 解析器上调用 parse() 方法并指定错误处理程序来处理任何错误解析问题....所以基本上,在您想要处理内容的情况下,STAX 绝对是更好的选择,因为 SAX 内容处理程序太难编码...

这种情况的一个实际示例可能是,如果您的系统中有一系列 SOAP 节点仅企业系统和入门级 SOAP 节点让那些格式良好的 SOAP XML 通过下一阶段,那么我看不出有任何理由要使用 STAX。我只会使用 SAX。

@Rinke: I guess only time I think of preferring SAX over STAX in case when you don't need to handle/process XML content; for e.g. only thing you want to do is check for well-formedness of incoming XML and just want to handle errors if it has...in this case you can simply call parse() method on SAX parser and specify error handler to handle any parsing problem....so basically STAX is definitely preferrable choice in scenarios where you want to handle content becasue SAX content handler is too difficult to code...

one practical example of this case may be if you have series of SOAP nodes in your enterprise system and an entry level SOAP node only lets those SOAP XML pass thru next stage which are well-formedness, then I don't see any reason why I would use STAX. I would just use SAX.

女皇必胜 2024-12-13 13:32:58

这都是一个平衡。

您可以使用阻塞队列和一些线程技巧将 SAX 解析器转变为拉解析器,因此,对我来说,差异比最初看起来要小得多。

我相信目前 StAX 需要通过第三方 jar 进行打包,而 SAX 在 javax 中是免费的。

我最近选择了 SAX 并围绕它构建了一个拉式解析器,因此我不需要依赖第三方 jar。

Java 的未来版本几乎肯定会包含 StAX 实现,因此问题就会消失。

It's all a balance.

You can turn a SAX parser into a pull parser using a blocking queue and some thread trickery so, to me, there is much less difference than there first seems.

I believe currently StAX needs to be packaged through a third-party jar while SAX comes free in javax.

I recently chose SAX and built a pull parser around it so I did not need to rely on a third-party jar.

Future versions of Java will almost certainly contain a StAX implementation so the problem goes away.

牵你的手,一向走下去 2024-12-13 13:32:58

如果您确实需要最大速度并且 XML 并不复杂,那么只需编写您自己的 XML 即可。当然,您必须了解一些有关优化字符和字符串处理的知识,但我们通过代码屏幕完成了它,花了额外的一天时间加上一些错误,并且比 SAX / StAX 等基于 OOTB 流的解决方案提高了 20%,因为这些必须是通用的。

在大多数情况下,“自己编写”并不是第一个解决方案,但这就是为什么这些库与完整的解析器相比现在不受欢迎的原因。如果您确实需要速度,有时需要一个紧密围绕特定要求的解决方案(我们甚至认为 XML 文件具有某种结构:我们永远不需要另一个文件,并且它加快了速度)。

幸运的是……在我们的业务领域(金融)……获得最好的速度是值得付出 1% 的时间的成本的:-) 这次节省的成本是值得的,也是我参与某件事的唯一一次像这样,事实证明最好使用针对需求和性能量身定制的自定义序列化格式。

其他时候,我们只是使用常规解析器。

If you really need max speed and the XML isn't complicated, just write your own. Of course you have to understand a bit about optimizing character and string processing but we did it with a screen of code, took an extra day plus a few bugs, and got a 20% increase over an OOTB streaming based solution like SAX / StAX because those have to be general purpose.

'Write it yourself' isn't the first go to solution in most cases, but this is why these libraries are out of favor today vs. a full parser. If you really do need speed, a solution wrapped tightly around the specific requirements (we even counted on the XML file being a certain structure: we'd never need another, and it sped things up) is sometimes required.

Fortunately ... in our line of business (financial) ... getting the best speed is worth the cost like 1% of the time :-) This time the savings was worth it, and the only other time I was involved in something like this it turned out to be best to use a custom serialization format tailored for the requirements and performance.

all the other times, we just reach for the regular parser.

坠似风落 2024-12-13 13:32:58

StAX 使您能够创建快速的双向 XML 解析器。事实证明,无论是在性能还是可用性方面,它都是其他方法(例如 DOM 和 SAX)的更好替代方案。

您可以在 Java StAX 教程

StAX enables you to create bidirectional XML parsers that are fast. It proves a better alternative to other methods, such as DOM and SAX, both in terms of performance and usability

You can read more about StAX in Java StAX Tutorials

风吹短裙飘 2024-12-13 13:32:58

这些答案提供的大部分信息都有些过时了...这篇 2013 年的研究论文对所有 XML 解析库进行了全面的研究...阅读它,您将很容易看到明显的获胜者(提示:只有一个)真正的赢家)...

http://recipp.ipp.pt/bitstream/10400.22/1847/1/ART_BrunoOliveira_2013.pdf

Most of the information provided by those answers are somewhat outdated... there have been a comprehensive study of all XML parsing libs in this 2013 research paper... read it and you will easily see the clear winner (hint: there is only one true winner)...

http://recipp.ipp.pt/bitstream/10400.22/1847/1/ART_BrunoOliveira_2013.pdf

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文