Stax 和 DOM 解析之间的性能差异

发布于 2024-08-26 06:16:22 字数 457 浏览 7 评论 0原文

我已经使用 DOM 很长时间了,因此 DOM 解析性能相当不错。即使处理大约 4-7 MB 的 XML,解析速度也很快。我们面临的 DOM 问题是内存占用,一旦我们开始处理大型 XML,内存占用就会变得巨大。

最近,我尝试转向 Stax(XML 流解析器),它被认为是第二代解析器(阅读有关 Stax 的内容,它说它是现在最快的解析器)。当我尝试使用 Stax 解析器来处理大约 4MB 的大型 XML 时,内存占用确实大大减少了,但是解析整个 XML 并从中创建 java 对象所需的时间比 DOM 增加了几乎 5 倍。

我使用 Stax 的 sjsxp.jar 实现。

我可以在某种程度上从逻辑上推断,由于解析器的流特性,性能可能不是非常好,但减少了 5 时间(例如 DOM 大约需要 8 秒来为这个 XML 构建对象,而 Stax 解析平均大约需要 40 秒)绝对不会被接受。

我是否完全错过了一些要点,因为我无法接受这些性能数字

I have been using DOM for a long time and as such DOM parsing performance wise has been pretty good. Even when dealing with XML of about 4-7 MB the parsing has been fast. The issue we face with DOM is the memory footprint which become huge as soon as we start dealing with large XMLs.

Lately I tried moving to Stax (Streaming parsers for XML) which are supposed top be second generation parsers (reading about Stax it said its the fastest parser now). When I tried Stax parser for large XML for about 4MB memory footprint definitely reduced drastically but time take to parse entire XML and create java object out of it increased almost by 5 times over DOM.

I used sjsxp.jar implementation of Stax.

I can deduce to some extent logically that performance may not be extremely good due to streaming nature of the parser but a reduction of 5 time (e.g. DOM takes about 8 seconds to build object for this XML, whereas Stax parsing took about 40 seconds on average) is definitely not going to be acceptable.

Am I missing some point here completely as I am not able to come to terms with these performance numbers

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

她说她爱他 2024-09-02 06:16:22
package parsers;

/**
 *
 * @author Arthur Kushman
 */

import java.io.File;
import java.io.IOException;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;

import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.w3c.dom.Element;


public class DOMTest {

  public static void main(String[] args) {
  long time1 = System.currentTimeMillis();
   try {
    DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
    DocumentBuilder db = dbf.newDocumentBuilder();
    Document doc = db.parse(new File("/Users/macpro/Desktop/myxml.xml"));
    doc.getDocumentElement().normalize();
    // System.out.println("Root Element: "+doc.getDocumentElement().getNodeName());
    NodeList nodeList = doc.getElementsByTagName("input");
    // System.out.println("Information of all elements in input");

    for (int s=0;s<nodeList.getLength();s++) {
      Node firstNode = nodeList.item(s);
      if (firstNode.getNodeType() == Node.ELEMENT_NODE) {
        Element firstElement = (Element)firstNode;
        NodeList firstNameElementList = firstElement.getElementsByTagName("href");
        Element firstNameElement = (Element)firstNameElementList.item(0);
        NodeList firstName = firstNameElement.getChildNodes();
        System.out.println("First Name: "+((Node)firstName.item(s)).getNodeValue());        
      }
    }


   } catch (Exception ex) {
    System.out.println(ex.getMessage());
    System.exit(1);
   }
  long time2 = System.currentTimeMillis() - time1;
  System.out.println(time2);
  }

}

45 家工厂

package parsers;

/**
 *
 * @author Arthur Kushman
 */
import javax.xml.stream.*;
import java.io.*;
import javax.xml.namespace.QName;

public class StAXTest {

  public static void main(String[] args) throws Exception {
  long time1 = System.currentTimeMillis();
    XMLInputFactory factory = XMLInputFactory.newInstance();
    // factory.setXMLReporter(myXMLReporter);
    XMLStreamReader reader = factory.createXMLStreamReader(
            new FileInputStream(
            new File("/Users/macpro/Desktop/myxml.xml")));

    /*String encoding = reader.getEncoding();

    System.out.println("Encoding: "+encoding);

    while (reader.hasNext()) {
      int event = reader.next();
      if (event == XMLStreamConstants.START_ELEMENT) {
        QName element = reader.getName();
        // String text = reader.getText();
        System.out.println("Element: "+element);
        // while (event != XMLStreamConstants.END_ELEMENT) {
          System.out.println("Text: "+reader.getLocalName());
        // }
      }
    }*/

  try {
    int inElement = 0;
    for (int event = reader.next();event != XMLStreamConstants.END_DOCUMENT;
    event = reader.next()) {
      switch (event) {
        case XMLStreamConstants.START_ELEMENT:
          if (isElement(reader.getLocalName(), "href")) {
            inElement++;
          }
          break;
        case XMLStreamConstants.END_ELEMENT:
          if (isElement(reader.getLocalName(), "href")) {
            inElement--;
            if (inElement == 0) System.out.println();
          }
          break;
        case XMLStreamConstants.CHARACTERS:
          if (inElement>0) System.out.println(reader.getText());
          break;
        case XMLStreamConstants.CDATA:
          if (inElement>0)  System.out.println(reader.getText());
          break;
      }
    }
    reader.close();
  } catch (XMLStreamException ex) {
    System.out.println(ex.getMessage());
    System.exit(1);
  }
    // System.out.println(System.currentTimeMillis());
    long time2 = System.currentTimeMillis() - time1;
    System.out.println(time2);
 }

  public static boolean isElement(String name, String element) {
    if (name.equals(element)) return true;
    return false;
  }

}

23 家工厂

StAX 获胜 =)

package parsers;

/**
 *
 * @author Arthur Kushman
 */

import java.io.File;
import java.io.IOException;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;

import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.w3c.dom.Element;


public class DOMTest {

  public static void main(String[] args) {
  long time1 = System.currentTimeMillis();
   try {
    DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
    DocumentBuilder db = dbf.newDocumentBuilder();
    Document doc = db.parse(new File("/Users/macpro/Desktop/myxml.xml"));
    doc.getDocumentElement().normalize();
    // System.out.println("Root Element: "+doc.getDocumentElement().getNodeName());
    NodeList nodeList = doc.getElementsByTagName("input");
    // System.out.println("Information of all elements in input");

    for (int s=0;s<nodeList.getLength();s++) {
      Node firstNode = nodeList.item(s);
      if (firstNode.getNodeType() == Node.ELEMENT_NODE) {
        Element firstElement = (Element)firstNode;
        NodeList firstNameElementList = firstElement.getElementsByTagName("href");
        Element firstNameElement = (Element)firstNameElementList.item(0);
        NodeList firstName = firstNameElement.getChildNodes();
        System.out.println("First Name: "+((Node)firstName.item(s)).getNodeValue());        
      }
    }


   } catch (Exception ex) {
    System.out.println(ex.getMessage());
    System.exit(1);
   }
  long time2 = System.currentTimeMillis() - time1;
  System.out.println(time2);
  }

}

45 mills

package parsers;

/**
 *
 * @author Arthur Kushman
 */
import javax.xml.stream.*;
import java.io.*;
import javax.xml.namespace.QName;

public class StAXTest {

  public static void main(String[] args) throws Exception {
  long time1 = System.currentTimeMillis();
    XMLInputFactory factory = XMLInputFactory.newInstance();
    // factory.setXMLReporter(myXMLReporter);
    XMLStreamReader reader = factory.createXMLStreamReader(
            new FileInputStream(
            new File("/Users/macpro/Desktop/myxml.xml")));

    /*String encoding = reader.getEncoding();

    System.out.println("Encoding: "+encoding);

    while (reader.hasNext()) {
      int event = reader.next();
      if (event == XMLStreamConstants.START_ELEMENT) {
        QName element = reader.getName();
        // String text = reader.getText();
        System.out.println("Element: "+element);
        // while (event != XMLStreamConstants.END_ELEMENT) {
          System.out.println("Text: "+reader.getLocalName());
        // }
      }
    }*/

  try {
    int inElement = 0;
    for (int event = reader.next();event != XMLStreamConstants.END_DOCUMENT;
    event = reader.next()) {
      switch (event) {
        case XMLStreamConstants.START_ELEMENT:
          if (isElement(reader.getLocalName(), "href")) {
            inElement++;
          }
          break;
        case XMLStreamConstants.END_ELEMENT:
          if (isElement(reader.getLocalName(), "href")) {
            inElement--;
            if (inElement == 0) System.out.println();
          }
          break;
        case XMLStreamConstants.CHARACTERS:
          if (inElement>0) System.out.println(reader.getText());
          break;
        case XMLStreamConstants.CDATA:
          if (inElement>0)  System.out.println(reader.getText());
          break;
      }
    }
    reader.close();
  } catch (XMLStreamException ex) {
    System.out.println(ex.getMessage());
    System.exit(1);
  }
    // System.out.println(System.currentTimeMillis());
    long time2 = System.currentTimeMillis() - time1;
    System.out.println(time2);
 }

  public static boolean isElement(String name, String element) {
    if (name.equals(element)) return true;
    return false;
  }

}

23 mills

StAX wins =)

微暖i 2024-09-02 06:16:22

虽然问题缺乏一些细节,但我很确定答案是,在任何一种情况下,解析都不会很慢(DOM 不是解析器;DOM 树通常使用 SAX 或 Stax 解析器构建),而是上面的代码创建对象。

有一些高效的自动数据绑定器,包括 JAXB(以及适当设置的 XStream),这可能会有所帮助。它们比 DOM 更快,因为 DOM(以及 JDOM、Dom4j 和 XOM)的主要性能问题是,与 POJO 相比,树模型本质上是昂贵的——它们基本上是美化的 HashMap,带有大量指针,可以方便地进行无类型遍历;特别是关于内存使用。

至于解析器,Woodstox 是比 Sjsxp 更快的 Stax 解析器;如果原始速度至关重要的话,阿尔托甚至更快。但我怀疑这里的主要问题是解析器速度。

Although question lacks some details, I am pretty sure that the answer is that it's not parsing that is slow in either case (DOM is not parser; DOM trees are typically built using SAX or Stax parsers), but code above it that creates objects.

There are efficient automatic data binders, including JAXB (and with proper settings, XStream), which could help. They are faster than DOM, because the main performance problem with DOM (and JDOM, Dom4j and XOM) is that tree models are inherently expensive compared to POJOs -- they are basically glorified HashMaps, with lots of pointers for convenient untyped traversal; especially regarding memory usage.

As to parsers, Woodstox is faster Stax parser that Sjsxp; and Aalto is even faster if raw speed is of essence. But I doubt main issue is parser speed here.

我早已燃尽 2024-09-02 06:16:22

以我的拙见,速度/内存权衡的经典案例。除了尝试 SAX(或 JDOM)并再次测量之外,您无能为力。

Classic case of speed/memory tradeoff in my humble opinion. Not much you can do apart from trying SAX as well (or JDOM) and measure again.

抱着落日 2024-09-02 06:16:22

尝试创建一个 2000M 的 XML,然后比较数字。我猜想基于 DOM 的方法在较小的数据上会工作得更快。当数据变大时,Stax(或任何基于 sax 的方法)将成为选择。

(我们处理 3G 或大文件。DOM 甚至不启动应用程序。)

Try creating an XML with 2000M and then compare the numbers. I guess DOM based approach will work faster on smaller data. Stax (or any sax based approach) will the option as the data gets larger.

(We deal with 3G or large files.. DOM does not even start the application.)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文