Xerces DOM 解析器非常慢?

发布于 2024-12-13 10:38:44 字数 1927 浏览 1 评论 0原文

目前,我正在尝试使用 JTidy 清理 HTML 文件,将其转换为 XHTML 并将结果提供给 DOM 解析器。以下代码是这些努力的结果:

public class HeaderBasedNewsProvider implements INewsProvider {

    /* ... */

    public Collection<INewsEntry> getNewsEntries() throws NewsUnavailableException {
            Document document;
        try {
            document = getCleanedDocument();
        } catch (Exception e) {
            throw new NewsUnavailableException(e);
        }
        System.err.println(document.getDocumentElement().getTextContent());
        return null;
    }

    private final Document getCleanedDocument() throws IOException, SAXException, ParserConfigurationException {
        InputStream input = inputStreamProvider.getInputStream();
        Tidy tidy = new Tidy();
        tidy.setXHTML(true);
        ByteArrayOutputStream tidyOutputStream = new ByteArrayOutputStream();
        tidy.parse(input, tidyOutputStream);
        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        factory.setValidating(false);
        InputStream domInputStream = new ByteArrayInputStream(tidyOutputStream.toByteArray());
        System.err.println(factory.getClass());
        return factory.newDocumentBuilder().parse(domInputStream);
    }
}

但是,我的系统上的 DOM 解析器实现 (com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl) 似乎非常慢。即使对于如下所示的单行文档,解析也需要 2-3 分钟:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head><title></title></head><body><div class="text"><h2>Nachricht vom 16. Juni 2011</h2><h1>Titel</h1><p>Mitteilung <a href="dokumente/medienmitteilungen/MM_NR_jglp.pdf" target="_blank">weiter</a> mehr Mitteilung</p></div></body></html>

请注意,与 DOM 解析器相比,JTidy 会在一秒钟内完成其工作。因此,我怀疑我在某种程度上滥用了 DOM API。

预先感谢您对此提出的任何建议!

Currently, I am trying to clean up an HTML file using JTidy, convert it to XHTML and provide the results to a DOM parser. The following code is the result of these efforts:

public class HeaderBasedNewsProvider implements INewsProvider {

    /* ... */

    public Collection<INewsEntry> getNewsEntries() throws NewsUnavailableException {
            Document document;
        try {
            document = getCleanedDocument();
        } catch (Exception e) {
            throw new NewsUnavailableException(e);
        }
        System.err.println(document.getDocumentElement().getTextContent());
        return null;
    }

    private final Document getCleanedDocument() throws IOException, SAXException, ParserConfigurationException {
        InputStream input = inputStreamProvider.getInputStream();
        Tidy tidy = new Tidy();
        tidy.setXHTML(true);
        ByteArrayOutputStream tidyOutputStream = new ByteArrayOutputStream();
        tidy.parse(input, tidyOutputStream);
        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        factory.setValidating(false);
        InputStream domInputStream = new ByteArrayInputStream(tidyOutputStream.toByteArray());
        System.err.println(factory.getClass());
        return factory.newDocumentBuilder().parse(domInputStream);
    }
}

However, the DOM parser implementation (com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl) on my system seems to be incredibly slow. Even for one-line documents such as the following, parsing takes 2-3 minutes:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head><title></title></head><body><div class="text"><h2>Nachricht vom 16. Juni 2011</h2><h1>Titel</h1><p>Mitteilung <a href="dokumente/medienmitteilungen/MM_NR_jglp.pdf" target="_blank">weiter</a> mehr Mitteilung</p></div></body></html>

Note that - in contrast to the DOM parser - JTidy finishes its work within a second. Therefore, I suspect that I'm somehow misusing the DOM API.

Thanks in advance for any suggestions on this one!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

甜警司 2024-12-20 10:38:44

即使不进行验证,XML 解析器也需要获取 DTD,例如支持命名字符实体。您应该考虑实现 EntityResolver它将 DTD 请求解析为本地副本。

Even when not validating, a XML parser needs to fetch the DTD, for example to support named character entities. You should look into implementing an EntityResolver that resolves the request for the DTD to a local copy.

久夏青 2024-12-20 10:38:44

HTML dtd 很大,使用包含。他们需要永远。使用XML 目录。可以在本地存储 dtd,并通过系统 ID 映射它们。

如果你使用像maven这样的工具,你会找到足够的指针。

正如已接受的答案所示,拦截实体的优点是您收到正确的字符。

HTML dtd's are huge, using includes. They take forever. Use an XML catalog. There one can store the dtds locally and map them by their system ID.

If you use a tool, like maven, you will find sufficient pointers.

The advantage i.o. intercepting entities as the accepted answer suggests, is that you receive the correct characters.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文