加载 XML 非常慢

发布于 2024-09-14 17:09:42 字数 4942 浏览 7 评论 0原文

我继承了一个使用简单文本文件来保存文档的数据存储。

文档具有一些属性(日期、标题和文本),这些属性被编码在文件名中:<日期>-<标题>.txt,文件正文是文本。

然而,实际上系统中的文档具有更多的属性,并且再次建议添加更多属性。

切换到 XML 格式似乎是合乎逻辑的,我已经这样做了,现在每个文档都编码在它自己的 XML 文件中。

然而,从 XML 读取文件现在慢得离谱! (以前 2000 篇 .txt 格式的文章需要几秒钟,现在 2000 篇 .xml 格式的文章需要 10 多分钟)。

我当时使用 DOM 解析器,在发现读取速度有多慢后,我切换到 SAX 解析器,但它仍然一样慢(嗯,更快,但仍然需要 10 分钟)。

XML 就是那么慢吗,还是我做了一些奇怪的事情?任何想法将不胜感激。

该系统是用JavaSE 1.6编写的。 解析器是这样创建的:


/*
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
*/
  SAXParserFactory factory = SAXParserFactory.newInstance();
  SAXParser saxParser;
  try {
    saxParser = factory.newSAXParser();
    ArticleSaxHandler handler = new ArticleSaxHandler();
    saxParser.parse(is, handler);
    return handler.getArticle();
  } catch (ParserConfigurationException e) {
    throw new IOException(e);
  } catch (SAXException e) {
    throw new IOException(e);
  } finally { 
    if (is != null) {
      try {
        is.close();
      } catch (IOException e) {
        logger.error(e);
      }
    }
  }
}

private class ArticleSaxHandler extends DefaultHandler {
        private URI uri = null;
        private String source = null;
        private String author = null;
        private DateTime articleDatetime = null;
        private DateTime processedDatetime = null;
        private String title = null;
        private String text = null;
        private ArticleElement currentElement;
        private final StringBuilder builder = new StringBuilder();

        public Article getArticle() {
            return new Article(uri, source, author, articleDatetime, processedDatetime, title, text);
        }

        /** Receive notification of the start of an element. */
        public void startElement(String uri, String localName, String qName, Attributes attributes) {
            if (builder.length() != 0) {
                throw new RuntimeException(new SAXParseException(currentElement + " was not finished before " + qName + " was started", null));
            }
            currentElement = ArticleElement.getElement(qName);
        }

        public void endElement(String uri, String localName, String qName) {
            final String elementText = builder.toString();
            builder.delete(0, builder.length());
            if (currentElement == null) {
                return;
            }
            switch (currentElement) {
                case ARTICLE:
                    break;
                case URI:
                    try {
                        this.uri = new URI(elementText);
                    } catch (URISyntaxException e) {
                        throw new RuntimeException(e);
                    }
                    break;
                case SOURCE:
                    source = elementText;
                    break;
                case AUTHOR:
                    author = elementText;
                    break;
                case ARTICLE_DATE_TIME:
                    articleDatetime = getDateTimeFormatter().parseDateTime(elementText);
                    break;
                case PROCESSED_DATE_TIME:
                    processedDatetime = getDateTimeFormatter().parseDateTime(elementText);
                    break;
                case TITLE:
                    title = elementText;
                    break;
                case TEXT:
                    this.text = elementText;
                    break;
                default:
                    throw new IllegalStateException("Unexpected ArticleElement: " + currentElement);
            }
            currentElement = null;
        }

        /** Receive notification of character data inside an element. */
        public void characters(char[] ch, int start, int length) {
            builder.append(ch, start, length);
        }

        public void error(SAXParseException e) {
            fatalError(e);
        }

        public void fatalError(SAXParseException e) {
            logger.error("currentElement: " + currentElement + " ||builder: " + builder.toString() + "\n\n" + e.getMessage(), e);
        }
    }

    private enum ArticleElement {
        ARTICLE(ARTICLE_ELEMENT_NAME), URI(URI_ELEMENT_NAME), SOURCE(SOURCE_ELEMENT_NAME), AUTHOR(AUTHOR_ELEMENT_NAME), ARTICLE_DATE_TIME(
                ARTICLE_DATETIME_ELEMENT_NAME), PROCESSED_DATE_TIME(PROCESSED_DATETIME_ELEMENT_NAME), TITLE(TITLE_ELEMENT_NAME), TEXT(TEXT_ELEMENT_NAME);
        private String name;

        private ArticleElement(String name) {
            this.name = name;
        }

        public static ArticleElement getElement(String qName) {
            for (ArticleElement element : ArticleElement.values()) {
                if (element.name.equals(qName)) {
                    return element;
                }
            }
            return null;
        }
    }

I inherited a data-storage which was using simple text-files to save documents.

Documents had some attributes (date, title, and text), and these were encoded in a filename: <date>-<title>.txt, with the body of the file being the text.

However in reality Documents in the system have many more attributes, and even more again were proposed to be added.

It seemed logical to switch to an XML format, and I have done so, with each document now encoded in it's own XML file.

However, reading the files in from XML is now RIDICULOUSLY slow! (Where 2000 articles in the .txt format took seconds, now 2000 articles in the .xml format takes more than 10 minutes).

I WAS using a DOM parser, and after I discovered how slow the reading was, I switched to a SAX parser, however it's STILL just as slow (well, faster, but still 10 minutes).

Is XML JUST THAT slow, or am I doing something strange? Any thoughts would be appreciated.

The system is written in JavaSE 1.6.
The Parser is created like this:


/*
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
*/
  SAXParserFactory factory = SAXParserFactory.newInstance();
  SAXParser saxParser;
  try {
    saxParser = factory.newSAXParser();
    ArticleSaxHandler handler = new ArticleSaxHandler();
    saxParser.parse(is, handler);
    return handler.getArticle();
  } catch (ParserConfigurationException e) {
    throw new IOException(e);
  } catch (SAXException e) {
    throw new IOException(e);
  } finally { 
    if (is != null) {
      try {
        is.close();
      } catch (IOException e) {
        logger.error(e);
      }
    }
  }
}

private class ArticleSaxHandler extends DefaultHandler {
        private URI uri = null;
        private String source = null;
        private String author = null;
        private DateTime articleDatetime = null;
        private DateTime processedDatetime = null;
        private String title = null;
        private String text = null;
        private ArticleElement currentElement;
        private final StringBuilder builder = new StringBuilder();

        public Article getArticle() {
            return new Article(uri, source, author, articleDatetime, processedDatetime, title, text);
        }

        /** Receive notification of the start of an element. */
        public void startElement(String uri, String localName, String qName, Attributes attributes) {
            if (builder.length() != 0) {
                throw new RuntimeException(new SAXParseException(currentElement + " was not finished before " + qName + " was started", null));
            }
            currentElement = ArticleElement.getElement(qName);
        }

        public void endElement(String uri, String localName, String qName) {
            final String elementText = builder.toString();
            builder.delete(0, builder.length());
            if (currentElement == null) {
                return;
            }
            switch (currentElement) {
                case ARTICLE:
                    break;
                case URI:
                    try {
                        this.uri = new URI(elementText);
                    } catch (URISyntaxException e) {
                        throw new RuntimeException(e);
                    }
                    break;
                case SOURCE:
                    source = elementText;
                    break;
                case AUTHOR:
                    author = elementText;
                    break;
                case ARTICLE_DATE_TIME:
                    articleDatetime = getDateTimeFormatter().parseDateTime(elementText);
                    break;
                case PROCESSED_DATE_TIME:
                    processedDatetime = getDateTimeFormatter().parseDateTime(elementText);
                    break;
                case TITLE:
                    title = elementText;
                    break;
                case TEXT:
                    this.text = elementText;
                    break;
                default:
                    throw new IllegalStateException("Unexpected ArticleElement: " + currentElement);
            }
            currentElement = null;
        }

        /** Receive notification of character data inside an element. */
        public void characters(char[] ch, int start, int length) {
            builder.append(ch, start, length);
        }

        public void error(SAXParseException e) {
            fatalError(e);
        }

        public void fatalError(SAXParseException e) {
            logger.error("currentElement: " + currentElement + " ||builder: " + builder.toString() + "\n\n" + e.getMessage(), e);
        }
    }

    private enum ArticleElement {
        ARTICLE(ARTICLE_ELEMENT_NAME), URI(URI_ELEMENT_NAME), SOURCE(SOURCE_ELEMENT_NAME), AUTHOR(AUTHOR_ELEMENT_NAME), ARTICLE_DATE_TIME(
                ARTICLE_DATETIME_ELEMENT_NAME), PROCESSED_DATE_TIME(PROCESSED_DATETIME_ELEMENT_NAME), TITLE(TITLE_ELEMENT_NAME), TEXT(TEXT_ELEMENT_NAME);
        private String name;

        private ArticleElement(String name) {
            this.name = name;
        }

        public static ArticleElement getElement(String qName) {
            for (ArticleElement element : ArticleElement.values()) {
                if (element.name.equals(qName)) {
                    return element;
                }
            }
            return null;
        }
    }

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

笔芯 2024-09-21 17:09:42

从无缓冲流中读取数据可以解释这些性能问题。这与从文本到 XML 的更改没有直接关系,但也许您的新实现碰巧不再使用 BufferedInputStream


按照该路径,详细检查此 is 是否已缓冲:

saxParser.parse(is, handler);

Reading data from an unbuffered stream could explain these performance problems. This is not directly related to the change from text to XML but maybe by chance your new implementation doesn't use a BufferedInputStream anymore.


Follwing that path, in detail, check if this is is buffered:

saxParser.parse(is, handler);
梦里兽 2024-09-21 17:09:42

我也遇到了这个问题,因为使用 SAX 解析器加载缓慢。这个问题实际上与我的 XML 文件有关,该文件具有来自 W3C 的 DTD 参考:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
        "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" >
<html xmlns="http://www.w3.org/TR/1999/REC-html-in-xml" xml:lang="en"
      lang="en">

“Core Java,卷 II”第 2 章中关于 SAX 和 XML 的摘录描述了发生的情况以及如何解决:

XHTML 文件以包含 DTD 引用的标记开头,并且
解析器会想要加载它。可以理解的是,W3C 也不是太
乐于提供数十亿份文件副本,例如
www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd。他们一度拒绝
完全一样,但在撰写本文时,它们为 DTD 提供服务
冰川般的步伐。如果不需要验证文档,只需调用

SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);

Thisfixed it for me 即可。此外,我使用 IntelliJ IDE 显示我的 XML 文件有一个额外的(不必要的) 标记和一个额外的 标签。。这帮助我摆脱了一些 SAX 异常。

I ran into this problem too with slow loading using an SAX parser. The issue was actually related to my XML file that has a DTD reference from the W3C:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
        "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" >
<html xmlns="http://www.w3.org/TR/1999/REC-html-in-xml" xml:lang="en"
      lang="en">

An excerpt from Chapter 2 of "Core Java, Volume II" about SAX and XML describes what's going on and also how to addres:

An XHTML file starts with a tag that contains a DTD reference, and
the parser will want to load it. Understandably, the W3C isn’t too
happy to serve billions of copies of files such as
www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd. At one point, they refused
altogether, but at the time of this writing, they serve the DTD at a
glacial pace. If you don’t need to validate the document, just call

SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);

This fixed it for me. In addition, I used IntelliJ IDE to show that my XML file had an extra (unnecessary) <HTML> tag and an extra <meta charset="UTF-8"/>. That helped rid me of some SAX exceptions.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文