加载 XML 非常慢
我继承了一个使用简单文本文件来保存文档的数据存储。
文档具有一些属性(日期、标题和文本),这些属性被编码在文件名中:<日期>-<标题>.txt,文件正文是文本。
然而,实际上系统中的文档具有更多的属性,并且再次建议添加更多属性。
切换到 XML 格式似乎是合乎逻辑的,我已经这样做了,现在每个文档都编码在它自己的 XML 文件中。
然而,从 XML 读取文件现在慢得离谱! (以前 2000 篇 .txt 格式的文章需要几秒钟,现在 2000 篇 .xml 格式的文章需要 10 多分钟)。
我当时使用 DOM 解析器,在发现读取速度有多慢后,我切换到 SAX 解析器,但它仍然一样慢(嗯,更快,但仍然需要 10 分钟)。
XML 就是那么慢吗,还是我做了一些奇怪的事情?任何想法将不胜感激。
该系统是用JavaSE 1.6编写的。 解析器是这样创建的:
/*
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
*/
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser;
try {
saxParser = factory.newSAXParser();
ArticleSaxHandler handler = new ArticleSaxHandler();
saxParser.parse(is, handler);
return handler.getArticle();
} catch (ParserConfigurationException e) {
throw new IOException(e);
} catch (SAXException e) {
throw new IOException(e);
} finally {
if (is != null) {
try {
is.close();
} catch (IOException e) {
logger.error(e);
}
}
}
}
private class ArticleSaxHandler extends DefaultHandler {
private URI uri = null;
private String source = null;
private String author = null;
private DateTime articleDatetime = null;
private DateTime processedDatetime = null;
private String title = null;
private String text = null;
private ArticleElement currentElement;
private final StringBuilder builder = new StringBuilder();
public Article getArticle() {
return new Article(uri, source, author, articleDatetime, processedDatetime, title, text);
}
/** Receive notification of the start of an element. */
public void startElement(String uri, String localName, String qName, Attributes attributes) {
if (builder.length() != 0) {
throw new RuntimeException(new SAXParseException(currentElement + " was not finished before " + qName + " was started", null));
}
currentElement = ArticleElement.getElement(qName);
}
public void endElement(String uri, String localName, String qName) {
final String elementText = builder.toString();
builder.delete(0, builder.length());
if (currentElement == null) {
return;
}
switch (currentElement) {
case ARTICLE:
break;
case URI:
try {
this.uri = new URI(elementText);
} catch (URISyntaxException e) {
throw new RuntimeException(e);
}
break;
case SOURCE:
source = elementText;
break;
case AUTHOR:
author = elementText;
break;
case ARTICLE_DATE_TIME:
articleDatetime = getDateTimeFormatter().parseDateTime(elementText);
break;
case PROCESSED_DATE_TIME:
processedDatetime = getDateTimeFormatter().parseDateTime(elementText);
break;
case TITLE:
title = elementText;
break;
case TEXT:
this.text = elementText;
break;
default:
throw new IllegalStateException("Unexpected ArticleElement: " + currentElement);
}
currentElement = null;
}
/** Receive notification of character data inside an element. */
public void characters(char[] ch, int start, int length) {
builder.append(ch, start, length);
}
public void error(SAXParseException e) {
fatalError(e);
}
public void fatalError(SAXParseException e) {
logger.error("currentElement: " + currentElement + " ||builder: " + builder.toString() + "\n\n" + e.getMessage(), e);
}
}
private enum ArticleElement {
ARTICLE(ARTICLE_ELEMENT_NAME), URI(URI_ELEMENT_NAME), SOURCE(SOURCE_ELEMENT_NAME), AUTHOR(AUTHOR_ELEMENT_NAME), ARTICLE_DATE_TIME(
ARTICLE_DATETIME_ELEMENT_NAME), PROCESSED_DATE_TIME(PROCESSED_DATETIME_ELEMENT_NAME), TITLE(TITLE_ELEMENT_NAME), TEXT(TEXT_ELEMENT_NAME);
private String name;
private ArticleElement(String name) {
this.name = name;
}
public static ArticleElement getElement(String qName) {
for (ArticleElement element : ArticleElement.values()) {
if (element.name.equals(qName)) {
return element;
}
}
return null;
}
}
I inherited a data-storage which was using simple text-files to save documents.
Documents had some attributes (date, title, and text), and these were encoded in a filename: <date>-<title>.txt, with the body of the file being the text.
However in reality Documents in the system have many more attributes, and even more again were proposed to be added.
It seemed logical to switch to an XML format, and I have done so, with each document now encoded in it's own XML file.
However, reading the files in from XML is now RIDICULOUSLY slow! (Where 2000 articles in the .txt format took seconds, now 2000 articles in the .xml format takes more than 10 minutes).
I WAS using a DOM parser, and after I discovered how slow the reading was, I switched to a SAX parser, however it's STILL just as slow (well, faster, but still 10 minutes).
Is XML JUST THAT slow, or am I doing something strange? Any thoughts would be appreciated.
The system is written in JavaSE 1.6.
The Parser is created like this:
/*
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
*/
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser;
try {
saxParser = factory.newSAXParser();
ArticleSaxHandler handler = new ArticleSaxHandler();
saxParser.parse(is, handler);
return handler.getArticle();
} catch (ParserConfigurationException e) {
throw new IOException(e);
} catch (SAXException e) {
throw new IOException(e);
} finally {
if (is != null) {
try {
is.close();
} catch (IOException e) {
logger.error(e);
}
}
}
}
private class ArticleSaxHandler extends DefaultHandler {
private URI uri = null;
private String source = null;
private String author = null;
private DateTime articleDatetime = null;
private DateTime processedDatetime = null;
private String title = null;
private String text = null;
private ArticleElement currentElement;
private final StringBuilder builder = new StringBuilder();
public Article getArticle() {
return new Article(uri, source, author, articleDatetime, processedDatetime, title, text);
}
/** Receive notification of the start of an element. */
public void startElement(String uri, String localName, String qName, Attributes attributes) {
if (builder.length() != 0) {
throw new RuntimeException(new SAXParseException(currentElement + " was not finished before " + qName + " was started", null));
}
currentElement = ArticleElement.getElement(qName);
}
public void endElement(String uri, String localName, String qName) {
final String elementText = builder.toString();
builder.delete(0, builder.length());
if (currentElement == null) {
return;
}
switch (currentElement) {
case ARTICLE:
break;
case URI:
try {
this.uri = new URI(elementText);
} catch (URISyntaxException e) {
throw new RuntimeException(e);
}
break;
case SOURCE:
source = elementText;
break;
case AUTHOR:
author = elementText;
break;
case ARTICLE_DATE_TIME:
articleDatetime = getDateTimeFormatter().parseDateTime(elementText);
break;
case PROCESSED_DATE_TIME:
processedDatetime = getDateTimeFormatter().parseDateTime(elementText);
break;
case TITLE:
title = elementText;
break;
case TEXT:
this.text = elementText;
break;
default:
throw new IllegalStateException("Unexpected ArticleElement: " + currentElement);
}
currentElement = null;
}
/** Receive notification of character data inside an element. */
public void characters(char[] ch, int start, int length) {
builder.append(ch, start, length);
}
public void error(SAXParseException e) {
fatalError(e);
}
public void fatalError(SAXParseException e) {
logger.error("currentElement: " + currentElement + " ||builder: " + builder.toString() + "\n\n" + e.getMessage(), e);
}
}
private enum ArticleElement {
ARTICLE(ARTICLE_ELEMENT_NAME), URI(URI_ELEMENT_NAME), SOURCE(SOURCE_ELEMENT_NAME), AUTHOR(AUTHOR_ELEMENT_NAME), ARTICLE_DATE_TIME(
ARTICLE_DATETIME_ELEMENT_NAME), PROCESSED_DATE_TIME(PROCESSED_DATETIME_ELEMENT_NAME), TITLE(TITLE_ELEMENT_NAME), TEXT(TEXT_ELEMENT_NAME);
private String name;
private ArticleElement(String name) {
this.name = name;
}
public static ArticleElement getElement(String qName) {
for (ArticleElement element : ArticleElement.values()) {
if (element.name.equals(qName)) {
return element;
}
}
return null;
}
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
从无缓冲流中读取数据可以解释这些性能问题。这与从文本到 XML 的更改没有直接关系,但也许您的新实现碰巧不再使用
BufferedInputStream
。按照该路径,详细检查此
is
是否已缓冲:Reading data from an unbuffered stream could explain these performance problems. This is not directly related to the change from text to XML but maybe by chance your new implementation doesn't use a
BufferedInputStream
anymore.Follwing that path, in detail, check if this
is
is buffered:我也遇到了这个问题,因为使用 SAX 解析器加载缓慢。这个问题实际上与我的 XML 文件有关,该文件具有来自 W3C 的 DTD 参考:
“Core Java,卷 II”第 2 章中关于 SAX 和 XML 的摘录描述了发生的情况以及如何解决:
Thisfixed it for me 即可。此外,我使用 IntelliJ IDE 显示我的 XML 文件有一个额外的(不必要的)
标记和一个额外的
标签。
。这帮助我摆脱了一些 SAX 异常。I ran into this problem too with slow loading using an SAX parser. The issue was actually related to my XML file that has a DTD reference from the W3C:
An excerpt from Chapter 2 of "Core Java, Volume II" about SAX and XML describes what's going on and also how to addres:
This fixed it for me. In addition, I used IntelliJ IDE to show that my XML file had an extra (unnecessary)
<HTML>
tag and an extra<meta charset="UTF-8"/>
. That helped rid me of some SAX exceptions.