如何跳过 java DOM 解析器的格式良好
我知道这个问题已经被问过多次了,但我有一个不同的问题来处理它。就我而言,应用程序接收到作为字符串传递的格式不正确的 dom 结构。这是一个示例:
<div class='video yt'><div class='yt_url'>http://www.youtube.com/watch?v=U_QLu_Twd0g&feature=abcde_gdata</div></div>
如您所见,内容格式不正确。现在,如果我尝试使用普通的 SAX 或 DOM 解析进行解析,它将抛出一个可以理解的异常。
org.xml.sax.SAXParseException:对实体“feature”的引用必须以“;”结尾
根据要求,我需要阅读此文档,添加一些额外的 div 标签并将内容作为字符串发送回来。通过使用 DOM 解析器,这非常有效,因为我可以读取输入结构并在所需位置添加其他标签。
我尝试使用 JTidy 等工具进行预处理,然后进行解析,但这会导致将文档转换为完整的 html,这是我不想要的。这是一个示例代码:
StringWriter writer = new StringWriter();
Tidy tidy = new Tidy(); // obtain a new Tidy instance
tidy.setXHTML(true);
tidy.parse(new ByteArrayInputStream(content.getBytes()), writer);
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(new ByteArrayInputStream(writer.toString().getBytes()));
// Traverse thru the content and add new tags
....
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
StreamResult result = new StreamResult(new StringWriter());
DOMSource source = new DOMSource(doc);
transformer.transform(source, result);
这将输入完全转换为格式良好的 html 文档。这样手动删除 html 标签就变得很困难。我尝试的另一个选择是使用 SAX2DOM,它也会创建 HTML 文档。这是示例代码。
ByteArrayInputStream is = new ByteArrayInputStream(content.getBytes());
Parser p = new Parser();
p.setFeature(IContentExtractionConstant.SAX_NAMESPACE,true);
SAX2DOM sax2dom = new SAX2DOM();
p.setContentHandler(sax2dom);
p.parse(new InputSource(is));
Document doc = (Document)sax2dom.getDOM();
如果有人可以分享他们的想法,我将不胜感激。
谢谢
I know this has been asked multiple times here, but I've a different issue dealing with it. In my case, the app receives a non well-formed dom structure passed as a string. Here's a sample :
<div class='video yt'><div class='yt_url'>http://www.youtube.com/watch?v=U_QLu_Twd0g&feature=abcde_gdata</div></div>
As you can see, the content is not well-formed. Now, if I try to parse using a normal SAX or DOM parse it'll throw an exception which is understood.
org.xml.sax.SAXParseException: The reference to entity "feature" must end with the ';' delimiter.
As per the requirement, I need to read this document,add few additional div tags and send the content back as a string. This works great by using a DOM parser as I can read through the input structure and add additional tags at their required position.
I tried using tools like JTidy to do a pre-processing and then parse, but that results in converting the document to a fully-blown html, which I don't want. Here's a sample code :
StringWriter writer = new StringWriter();
Tidy tidy = new Tidy(); // obtain a new Tidy instance
tidy.setXHTML(true);
tidy.parse(new ByteArrayInputStream(content.getBytes()), writer);
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(new ByteArrayInputStream(writer.toString().getBytes()));
// Traverse thru the content and add new tags
....
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
StreamResult result = new StreamResult(new StringWriter());
DOMSource source = new DOMSource(doc);
transformer.transform(source, result);
This completely converts the input to a well-formed html document. It then becomes hard to remove html tags manually. The other option I tried was to use SAX2DOM, which too creates a HTML doc. Here's a sample code .
ByteArrayInputStream is = new ByteArrayInputStream(content.getBytes());
Parser p = new Parser();
p.setFeature(IContentExtractionConstant.SAX_NAMESPACE,true);
SAX2DOM sax2dom = new SAX2DOM();
p.setContentHandler(sax2dom);
p.parse(new InputSource(is));
Document doc = (Document)sax2dom.getDOM();
I'll appreciate if someone can share their ideas.
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
最简单的方法是将 xml 保留字符替换为相应的 xml 实体。您可以手动执行此操作:
如果您不想在解析字符串之前修改字符串,我可以建议您使用
SaxParser
的另一种方法,但此解决方案更为复杂。基本上你必须:LexicalHandler
与
ContentHandler
组合致命错误后执行(
ErrorHandler
还不够)文本
更新
根据您的评论,我将添加有关第二个解决方案的一些详细信息。我编写了一个扩展
DefaulHandler
的类(EntityResolver
、DTDHandler
、ContentHandler
和的默认实现ErrorHandler
)并实现LexicalHandler
。我扩展了ErrorHandler
的fatalError
方法(我的实现不执行任何操作,而是抛出异常)和ContentHandler
的characters< /code> 方法与
LexicalHandler
的startEntity
方法结合使用。这是我的主要内容,它解析格式不正确的 xml。
setFeature
非常重要,因为如果没有它,解析器就会抛出SaxParseException
,尽管ErrorHandler
实现为空。这个 main 打印出包含错误的 div 元素的内容:
请记住,这是一个适用于您的输入的示例,也许您必须完成它......例如,如果您正确地转义了一些字符应该添加一些代码行来处理这种情况等。
希望这会有所帮助。
The simplest way is replacing xml reserved characters with the corresponding xml entities. You can do this manually:
If you don't want to modify your string before parsing it, I could propose you another way using
SaxParser
, but this solution is more complicated. Basically you have to:LexicalHandler
incombination with
ContentHandler
execution after fatal error (the
ErrorHandler
isn't enough)text
UPDATE
According to your comment, I'm going to add some details regarding the second solution. I've writed a class which extends
DefaulHandler
(default implementation ofEntityResolver
,DTDHandler
,ContentHandler
andErrorHandler
) and implementsLexicalHandler
. I've extendedErrorHandler
'sfatalError
method (my implementations does nothing instead of throwing the exception) andContentHandler
'scharacters
method which works in combination withstartEntity
method ofLexicalHandler
.This is my main which parses your xml not well formed. It's very important the
setFeature
, because without it the parser throws theSaxParseException
despite of theErrorHandler
empty implementation.This main prints out the content of your div element which contains the error:
Keep in mind that this is an example which works with your input, maybe you'll have to complete it...for instance if you have some characters correctly escaped you should add some lines of code to handle this situation etc.
Hope this helps.