Xerces DOM 解析器非常慢?
目前,我正在尝试使用 JTidy 清理 HTML 文件,将其转换为 XHTML 并将结果提供给 DOM 解析器。以下代码是这些努力的结果:
public class HeaderBasedNewsProvider implements INewsProvider {
/* ... */
public Collection<INewsEntry> getNewsEntries() throws NewsUnavailableException {
Document document;
try {
document = getCleanedDocument();
} catch (Exception e) {
throw new NewsUnavailableException(e);
return null;
private final Document getCleanedDocument() throws IOException, SAXException, ParserConfigurationException {
InputStream input = inputStreamProvider.getInputStream();
Tidy tidy = new Tidy();
ByteArrayOutputStream tidyOutputStream = new ByteArrayOutputStream();
tidy.parse(input, tidyOutputStream);
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
InputStream domInputStream = new ByteArrayInputStream(tidyOutputStream.toByteArray());
return factory.newDocumentBuilder().parse(domInputStream);
但是,我的系统上的 DOM 解析器实现 (com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl) 似乎非常慢。即使对于如下所示的单行文档,解析也需要 2-3 分钟:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head><title></title></head><body><div class="text"><h2>Nachricht vom 16. Juni 2011</h2><h1>Titel</h1><p>Mitteilung <a href="dokumente/medienmitteilungen/MM_NR_jglp.pdf" target="_blank">weiter</a> mehr Mitteilung</p></div></body></html>
请注意,与 DOM 解析器相比,JTidy 会在一秒钟内完成其工作。因此,我怀疑我在某种程度上滥用了 DOM API。
Currently, I am trying to clean up an HTML file using JTidy, convert it to XHTML and provide the results to a DOM parser. The following code is the result of these efforts:
public class HeaderBasedNewsProvider implements INewsProvider {
/* ... */
public Collection<INewsEntry> getNewsEntries() throws NewsUnavailableException {
Document document;
try {
document = getCleanedDocument();
} catch (Exception e) {
throw new NewsUnavailableException(e);
return null;
private final Document getCleanedDocument() throws IOException, SAXException, ParserConfigurationException {
InputStream input = inputStreamProvider.getInputStream();
Tidy tidy = new Tidy();
ByteArrayOutputStream tidyOutputStream = new ByteArrayOutputStream();
tidy.parse(input, tidyOutputStream);
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
InputStream domInputStream = new ByteArrayInputStream(tidyOutputStream.toByteArray());
return factory.newDocumentBuilder().parse(domInputStream);
However, the DOM parser implementation (com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl) on my system seems to be incredibly slow. Even for one-line documents such as the following, parsing takes 2-3 minutes:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head><title></title></head><body><div class="text"><h2>Nachricht vom 16. Juni 2011</h2><h1>Titel</h1><p>Mitteilung <a href="dokumente/medienmitteilungen/MM_NR_jglp.pdf" target="_blank">weiter</a> mehr Mitteilung</p></div></body></html>
Note that - in contrast to the DOM parser - JTidy finishes its work within a second. Therefore, I suspect that I'm somehow misusing the DOM API.
Thanks in advance for any suggestions on this one!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

即使不进行验证,XML 解析器也需要获取 DTD,例如支持命名字符实体。您应该考虑实现 EntityResolver它将 DTD 请求解析为本地副本。
Even when not validating, a XML parser needs to fetch the DTD, for example to support named character entities. You should look into implementing an EntityResolver that resolves the request for the DTD to a local copy.
HTML dtd 很大,使用包含。他们需要永远。使用XML 目录。可以在本地存储 dtd,并通过系统 ID 映射它们。
HTML dtd's are huge, using includes. They take forever. Use an XML catalog. There one can store the dtds locally and map them by their system ID.
If you use a tool, like maven, you will find sufficient pointers.
The advantage i.o. intercepting entities as the accepted answer suggests, is that you receive the correct characters.