如何跳过 java DOM 解析器的格式良好

发布于 2024-10-31 15:45:02 字数 1826 浏览 4 评论 0原文

我知道这个问题已经被问过多次了，但我有一个不同的问题来处理它。就我而言，应用程序接收到作为字符串传递的格式不正确的 dom 结构。这是一个示例：

<div class='video yt'><div class='yt_url'>http://www.youtube.com/watch?v=U_QLu_Twd0g&feature=abcde_gdata</div></div>

如您所见，内容格式不正确。现在，如果我尝试使用普通的 SAX 或 DOM 解析进行解析，它将抛出一个可以理解的异常。

org.xml.sax.SAXParseException：对实体“feature”的引用必须以“;”结尾

根据要求，我需要阅读此文档，添加一些额外的 div 标签并将内容作为字符串发送回来。通过使用 DOM 解析器，这非常有效，因为我可以读取输入结构并在所需位置添加其他标签。

我尝试使用 JTidy 等工具进行预处理，然后进行解析，但这会导致将文档转换为完整的 html，这是我不想要的。这是一个示例代码：


StringWriter writer = new StringWriter();
Tidy tidy = new Tidy(); // obtain a new Tidy instance
tidy.setXHTML(true);
tidy.parse(new ByteArrayInputStream(content.getBytes()), writer);
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(new ByteArrayInputStream(writer.toString().getBytes()));
// Traverse thru the content and add new tags
....
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
StreamResult result = new StreamResult(new StringWriter());
DOMSource source = new DOMSource(doc);
transformer.transform(source, result);

这将输入完全转换为格式良好的 html 文档。这样手动删除 html 标签就变得很困难。我尝试的另一个选择是使用 SAX2DOM，它也会创建 HTML 文档。这是示例代码。


ByteArrayInputStream is = new ByteArrayInputStream(content.getBytes());     
Parser p = new Parser();
p.setFeature(IContentExtractionConstant.SAX_NAMESPACE,true);
SAX2DOM sax2dom = new SAX2DOM();
p.setContentHandler(sax2dom);
p.parse(new InputSource(is));
Document doc = (Document)sax2dom.getDOM();

如果有人可以分享他们的想法，我将不胜感激。

谢谢

原文

I know this has been asked multiple times here, but I've a different issue dealing with it. In my case, the app receives a non well-formed dom structure passed as a string. Here's a sample :

<div class='video yt'><div class='yt_url'>http://www.youtube.com/watch?v=U_QLu_Twd0g&feature=abcde_gdata</div></div>

As you can see, the content is not well-formed. Now, if I try to parse using a normal SAX or DOM parse it'll throw an exception which is understood.

org.xml.sax.SAXParseException: The reference to entity "feature" must end with the ';' delimiter.

As per the requirement, I need to read this document,add few additional div tags and send the content back as a string. This works great by using a DOM parser as I can read through the input structure and add additional tags at their required position.

I tried using tools like JTidy to do a pre-processing and then parse, but that results in converting the document to a fully-blown html, which I don't want. Here's a sample code :


StringWriter writer = new StringWriter();
Tidy tidy = new Tidy(); // obtain a new Tidy instance
tidy.setXHTML(true);
tidy.parse(new ByteArrayInputStream(content.getBytes()), writer);
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(new ByteArrayInputStream(writer.toString().getBytes()));
// Traverse thru the content and add new tags
....
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
StreamResult result = new StreamResult(new StringWriter());
DOMSource source = new DOMSource(doc);
transformer.transform(source, result);

This completely converts the input to a well-formed html document. It then becomes hard to remove html tags manually. The other option I tried was to use SAX2DOM, which too creates a HTML doc. Here's a sample code .


ByteArrayInputStream is = new ByteArrayInputStream(content.getBytes());     
Parser p = new Parser();
p.setFeature(IContentExtractionConstant.SAX_NAMESPACE,true);
SAX2DOM sax2dom = new SAX2DOM();
p.setContentHandler(sax2dom);
p.parse(new InputSource(is));
Document doc = (Document)sax2dom.getDOM();

I'll appreciate if someone can share their ideas.

Thanks

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

余生共白头 2024-11-07 15:45:02

最简单的方法是将 xml 保留字符替换为相应的 xml 实体。您可以手动执行此操作：

content.replaceAll("&", "&");

如果您不想在解析字符串之前修改字符串，我可以建议您使用 SaxParser 的另一种方法，但此解决方案更为复杂。基本上你必须：

编写一个 LexicalHandler
与 ContentHandler 组合
告诉解析器继续它的
致命错误后执行（
ErrorHandler 还不够）
将未声明的实体视为简单
文本

更新
根据您的评论，我将添加有关第二个解决方案的一些详细信息。我编写了一个扩展 DefaulHandler 的类（EntityResolver、DTDHandler、ContentHandler 和 的默认实现ErrorHandler）并实现 LexicalHandler。我扩展了 ErrorHandler 的 fatalError 方法（我的实现不执行任何操作，而是抛出异常）和 ContentHandler 的 characters< /code> 方法与 LexicalHandler 的 startEntity 方法结合使用。

public class MyHandler extends DefaultHandler implements LexicalHandler {

    private String currentEntity = null;

    @Override
    public void fatalError(SAXParseException e) throws SAXException {
    }

    @Override
    public void characters(char[] ch, int start, int length)
            throws SAXException {
        String content = new String(ch, start, length);
        if (currentEntity != null) {
            content = "&" + currentEntity + content;
            currentEntity = null;
        }
        System.out.print(content);
    }

    @Override
    public void startEntity(String name) throws SAXException {
        currentEntity = name;
    }

    @Override
    public void endEntity(String name) throws SAXException {
    }

    @Override
    public void startDTD(String name, String publicId, String systemId)
            throws SAXException {
    }

    @Override
    public void endDTD() throws SAXException {
    }

    @Override
    public void startCDATA() throws SAXException {
    }

    @Override
    public void endCDATA() throws SAXException {
    }

    @Override
    public void comment(char[] ch, int start, int length) throws SAXException {
    }
}

这是我的主要内容，它解析格式不正确的 xml。 setFeature 非常重要，因为如果没有它，解析器就会抛出 SaxParseException，尽管 ErrorHandler 实现为空。

public static void main(String[] args) throws ParserConfigurationException,
        SAXException, IOException {
    String xml = "<div class='video yt'><div class='yt_url'>http://www.youtube.com/watch?v=U_QLu_Twd0g&feature=abcde_gdata</div></div>";
    SAXParser saxParser = SAXParserFactory.newInstance().newSAXParser();
    XMLReader xmlReader = saxParser.getXMLReader();
    MyHandler myHandler = new MyHandler();
    xmlReader.setContentHandler(myHandler);
    xmlReader.setErrorHandler(myHandler);
    xmlReader.setProperty("http://xml.org/sax/properties/lexical-handler",
            myHandler);
    xmlReader.setFeature(
            "http://apache.org/xml/features/continue-after-fatal-error",
            true);
    xmlReader.parse(new InputSource(new StringReader(xml)));
}

这个 main 打印出包含错误的 div 元素的内容：

http://www.youtube.com/watch?v=U_QLu_Twd0g&feature=abcde_gdata

请记住，这是一个适用于您的输入的示例，也许您必须完成它......例如，如果您正确地转义了一些字符应该添加一些代码行来处理这种情况等。

希望这会有所帮助。

The simplest way is replacing xml reserved characters with the corresponding xml entities. You can do this manually:

content.replaceAll("&", "&");

If you don't want to modify your string before parsing it, I could propose you another way using SaxParser, but this solution is more complicated. Basically you have to:

write a LexicalHandler in
combination with ContentHandler
tell the parser to continue its
execution after fatal error (the
ErrorHandler isn't enough)
treat undeclared entities as simple
text

UPDATE
According to your comment, I'm going to add some details regarding the second solution. I've writed a class which extends DefaulHandler (default implementation of EntityResolver, DTDHandler, ContentHandler and ErrorHandler) and implements LexicalHandler. I've extended ErrorHandler's fatalError method (my implementations does nothing instead of throwing the exception) and ContentHandler's characters method which works in combination with startEntity method of LexicalHandler.

public class MyHandler extends DefaultHandler implements LexicalHandler {

    private String currentEntity = null;

    @Override
    public void fatalError(SAXParseException e) throws SAXException {
    }

    @Override
    public void characters(char[] ch, int start, int length)
            throws SAXException {
        String content = new String(ch, start, length);
        if (currentEntity != null) {
            content = "&" + currentEntity + content;
            currentEntity = null;
        }
        System.out.print(content);
    }

    @Override
    public void startEntity(String name) throws SAXException {
        currentEntity = name;
    }

    @Override
    public void endEntity(String name) throws SAXException {
    }

    @Override
    public void startDTD(String name, String publicId, String systemId)
            throws SAXException {
    }

    @Override
    public void endDTD() throws SAXException {
    }

    @Override
    public void startCDATA() throws SAXException {
    }

    @Override
    public void endCDATA() throws SAXException {
    }

    @Override
    public void comment(char[] ch, int start, int length) throws SAXException {
    }
}

This is my main which parses your xml not well formed. It's very important the setFeature, because without it the parser throws the SaxParseException despite of the ErrorHandler empty implementation.

public static void main(String[] args) throws ParserConfigurationException,
        SAXException, IOException {
    String xml = "<div class='video yt'><div class='yt_url'>http://www.youtube.com/watch?v=U_QLu_Twd0g&feature=abcde_gdata</div></div>";
    SAXParser saxParser = SAXParserFactory.newInstance().newSAXParser();
    XMLReader xmlReader = saxParser.getXMLReader();
    MyHandler myHandler = new MyHandler();
    xmlReader.setContentHandler(myHandler);
    xmlReader.setErrorHandler(myHandler);
    xmlReader.setProperty("http://xml.org/sax/properties/lexical-handler",
            myHandler);
    xmlReader.setFeature(
            "http://apache.org/xml/features/continue-after-fatal-error",
            true);
    xmlReader.parse(new InputSource(new StringReader(xml)));
}

This main prints out the content of your div element which contains the error:

http://www.youtube.com/watch?v=U_QLu_Twd0g&feature=abcde_gdata

Keep in mind that this is an example which works with your input, maybe you'll have to complete it...for instance if you have some characters correctly escaped you should add some lines of code to handle this situation etc.

Hope this helps.

回复收藏 0 原文

~没有更多了~