用于像浏览器一样清理 HTML 的 Java 库
所以这就是挑战......我需要从野外的随机网页创建干净的 HTML。我的目标是读入一个页面并将其传递给一个库,该库反过来会返回格式完美的 HTML。
听起来没那么难,对吧?毕竟,市场上的每个浏览器都有效地应对了格式错误的 HTML 的挑战,并将其转换为几乎每次页面加载时都可呈现的内容。每个都有自己稍微特殊的算法来清理内容(咳咳......对于 HTML < 5),但它们往往能够很好地捕捉我喜欢称之为作者的意图。那么,为什么我不能找到一个好的 java 库来完成这个任务呢?
值得一提的是,我对将 HTML 解析为 XML 一点也不感兴趣。我发现 NekoHTML、TagSoup、HtmlCleaner 和 JTidy(仅举几例)等库更专注于解决将 HTML 转换为有效 XML 的问题,而在这个过程中,他们忽视了如何糟糕地-格式化文档应重新构建。使用令人讨厌的 HTML,它们经常无法捕捉作者的意图,并生成与原始源代码完全不同的文档。对于这个项目来说,最重要的是两个文档的呈现相似。
我非常喜欢 Jericho HTML,但它似乎不是这项工作的理想选择……至少在我没有付出很多努力的情况下是这样。此外,本机依赖项是不行的,因此 mozilla 解析器已经过时了。
谁能帮我寻找完美的 HTML 解析器?提前致谢!
So here's the challenge... I need to create clean HTML from random web pages out there in the wild. My goal is to read in a page and pass it off to a library which will in turn give me back perfectly well-formed HTML.
Doesn't sound so tough, right? After all, every browser on the market effectively deals with the challenge of malformed HTML and turning it into something render-able with nearly every page load. Each has its own slightly particular algorithm for cleaning up the contents (ahem...for HTML < 5 that is), but they tend to do a very good job of capturing what i like to refer to as the author's intention. So then, why can't I find a good java library for this very task?
One thing to mention is that I'm not at all interested in parsing the HTML as XML. I've found that libraries such as NekoHTML, TagSoup, HtmlCleaner, and JTidy (to name a few) are more focused on solving the problem of converting to HTML to valid XML, and in the process, they lose sight of how the poorly-formatted document should be re-structured. With nasty HTML they frequently don't capture the author's intention and spit out documents that render quite differently from the original source. And for this project, it's of the utmost importance that the two documents render similarly.
I am quite fond of Jericho HTML, but it doesn't seem to be the ideal candidate for this job...at least not without a lot of effort on my part. Also, Native dependencies are a no-go, so the mozilla parser is out.
Can anyone help me in my search for the perfect HTML parser? Thanks in advance!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
JSoup 我会说
另请参阅
JSoup I would say
See Also
我过去曾使用过 HTML Tidy 。
I have used HTML Tidy in the past.
TagSoup?
TagSoup?