我需要解析格式不正确的 xml 数据 (HTML)
我在 JAVA 中有一些格式不正确的 xml (HTML) 数据,我使用了 JAXP Dom,但它抱怨。
问题是:有什么办法吗? 使用JAXP来解析此类文档??
我有一个包含数据的文件,例如:
<employee>
<name value="ahmed" > <!-- note, this element is not closed, So it is not well-formed xml-->
</employee>
I have some non well-formed xml (HTML) data in JAVA, I used JAXP Dom, but It complains.
The Question is :Is there any way to
use JAXP to parse such documents ??
I have a file containing data such as :
<employee>
<name value="ahmed" > <!-- note, this element is not closed, So it is not well-formed xml-->
</employee>
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可以尝试首先通过 jtidy API 运行您的文档 - 它能够将 html 转换为有效的 xhtml: http: //jtidy.sourceforge.net/howto.html
You could try running your document through the jtidy API first - that has the ability to convert html into valid xhtml: http://jtidy.sourceforge.net/howto.html
您可以使用 TagSoup。我使用它取得了巨大的成功。它与 Java XML API 完全兼容,包括 SAX、DOM、XSLT 和 StAX。例如,以下是我如何使用它将 XSLT 转换应用于特别差的 HTML:
You could use TagSoup. I have used it with great success. It is completely compatible with the Java XML APIs, including SAX, DOM, XSLT, and StAX. For example, here is how I used it to apply XSLT transforms to particularly poor HTML:
并不真地。 JAXP 需要格式良好的标记。您是否考虑过 Cyberneko HTML 解析器?我们的商店在这方面非常成功。
编辑:我发现您也想解析 XML。嗯...Cyberneko 对于 HTML 工作得很好,但我不知道其他的。它有一个标签平衡器,可以关闭一些标签,但我不知道你是否可以训练它识别非 HTML 标签。
Not really. JAXP wants well-formed markup. Have you considered the Cyberneko HTML Parser? We've been very successful with it at our shop.
EDIT: I see you are wanting to parse XML too. Hrmm.... Cyberneko works well for HTML but I don't know about others. It has a tag balancer that would close some tags off, but I don't know if you can train it to recognize tags that are not HTML.