如果输入文件中未指定 DTD,如何强制 SAX 解析器使用 DTD?

发布于 2024-08-24 03:07:29 字数 578 浏览 8 评论 0原文

如何强制 SAX 解析器(特别是 Java 中的 Xerces)在解析文档时使用 DTD,而输入文档中没有任何 doctype?这可能吗?

以下是我的场景的更多详细信息:

我们有一堆符合相同 DTD 的 XML 文档,这些文档是由多个不同系统生成的(我无法更改其中任何一个)。其中一些系统将文档类型添加到其输出文档中,而其他系统则没有。有些使用命名字符实体,有些则不使用。 有些使用命名字符实体而不声明文档类型。我知道这不合规矩,但这是我必须使用的。

我正在开发需要用 Java 解析这些文件的系统。目前,它处理上述情况的方法是,首先以流的形式读取 XML 文档,尝试检测它是否定义了 doctype,然后添加 doctype 声明(如果尚不存在)。问题是这段代码有错误,我想用更干净的东西替换它。

文件很大,因此我无法使用基于 DOM 的解决方案。我还尝试解析字符实体,因此使用 XML 架构没有帮助

如果您有解决方案,请直接发布而不是链接到它吗?如果将来有一个带有死链接的正确解决方案,这对 Stack Overflow 没有多大好处。

How can I force a SAX parser (specifically, Xerces in Java) to use a DTD when parsing a document without having any doctype in the input document? Is this even possible?

Here are some more details of my scenario:

We have a bunch of XML documents that conform to the same DTD that are generated by multiple different systems (none of which I can change). Some of these systems add a doctype to their output documents, others do not. Some use named character entities, some do not. Some use named character entities without declaring a doctype. I know that's not kosher, but it's what I have to work with.

I'm working on system that needs to parse these files in Java. Currently, it's handling the above cases by first reading in the XML document as a stream, attempting to detect if it has a doctype defined, and adding a doctype declaration if one isn't already present. The problem is that this code is buggy, and I'd like to replace it with something cleaner.

The files are large, so I can't use a DOM-based solution. I'm also trying get character entities resolved, so it doesn't help to use an XML Schema.

If you have a solution, could you please post it directly instead of linking to it? It doesn't do Stack Overflow much good if in a the future there's a correct solution with a dead link.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

欲拥i 2024-08-31 03:07:29

我认为如果文档没有 DOCTYPE,那么设置 DOCTYPE 并不是一种明智的方法。可能的解决方案是写一个假的,就像你已经做的那样。如果您使用 SAX,则可以使用这个假的 InputStream 和假的 DefaultHandler 实现。 (仅适用于 latin1 一字节编码)

我知道这个解决方案也很丑陋,但它只有一个适用于大数据流。

这是一些代码。

private enum State {readXmlDec, readXmlDecEnd, writeFakeDoctipe,  writeEnd};

private class MyInputStream extends InputStream{

    private final InputStream is;
    private StringBuilder sb = new StringBuilder();
    private int pos = 0;
    private String doctype = "<!DOCTYPE register SYSTEM \"fake.dtd\">";
    private State state = State.readXmlDec;

    private MyInputStream(InputStream source) {
        is = source;
    }
    @Override
    public int read() throws IOException {
        int bit;

        switch (state){ 
            case readXmlDec:
                bit = is.read();
                sb.append(Character.toChars(bit));
                if(sb.toString().equals("<?xml")){
                    state = State.readXmlDecEnd;
                }
                break;
            case readXmlDecEnd:
                bit = is.read();
                if(Character.toChars(bit)[0] == '>'){
                    state = State.writeFakeDoctipe;
                }
                break;
            case writeFakeDoctipe:
                bit =  doctype.charAt(pos++);
                if(doctype.length() == pos){
                    state = State.writeEnd;
                }
                break;
            default:
                bit = is.read();
                break;
        }
        return bit;
    }

    @Override
    public void close() throws IOException {
        super.close();
        is.close();
    }
}

private static class MyHandler extends DefaultHandler {

    @Override
    public InputSource resolveEntity(String publicId, String systemId) throws IOException, SAXException {
        System.out.println("resolve "+ systemId);
        // get real dtd
        InputStream is = ClassLoader.class.getResourceAsStream("/register.dtd");
        return new InputSource(is);
    }

 ... // rest of code
}

I think it is no sane way to set DOCTYPE, if document hasn't one. Possible solution is write fake one, as you already do. If you're using SAX, you can use this fake InputStream and fake DefaultHandler implementation. (will work only for latin1 one-byte encoding)

I know this solution also ugly, but it only one works well with big data streams.

Here is some code.

private enum State {readXmlDec, readXmlDecEnd, writeFakeDoctipe,  writeEnd};

private class MyInputStream extends InputStream{

    private final InputStream is;
    private StringBuilder sb = new StringBuilder();
    private int pos = 0;
    private String doctype = "<!DOCTYPE register SYSTEM \"fake.dtd\">";
    private State state = State.readXmlDec;

    private MyInputStream(InputStream source) {
        is = source;
    }
    @Override
    public int read() throws IOException {
        int bit;

        switch (state){ 
            case readXmlDec:
                bit = is.read();
                sb.append(Character.toChars(bit));
                if(sb.toString().equals("<?xml")){
                    state = State.readXmlDecEnd;
                }
                break;
            case readXmlDecEnd:
                bit = is.read();
                if(Character.toChars(bit)[0] == '>'){
                    state = State.writeFakeDoctipe;
                }
                break;
            case writeFakeDoctipe:
                bit =  doctype.charAt(pos++);
                if(doctype.length() == pos){
                    state = State.writeEnd;
                }
                break;
            default:
                bit = is.read();
                break;
        }
        return bit;
    }

    @Override
    public void close() throws IOException {
        super.close();
        is.close();
    }
}

private static class MyHandler extends DefaultHandler {

    @Override
    public InputSource resolveEntity(String publicId, String systemId) throws IOException, SAXException {
        System.out.println("resolve "+ systemId);
        // get real dtd
        InputStream is = ClassLoader.class.getResourceAsStream("/register.dtd");
        return new InputSource(is);
    }

 ... // rest of code
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文