具有类似 HTML Tidy 功能的托管 (.NET) 库?
.NET 是否有一个 HTML 清理器可以解析 HTML 并(例如)将其转换为更机器友好的格式,例如 XHTML?
我已经尝试过 HTML Agility Pack,但它甚至无法正确解析 相当 简单示例。
举一个应该正确解析的 HTML 示例:
<html><title>test</title>
<body>
<ul><li>TestElem1
<li>TestElem2
<li>TestElem3 List:
<ul><li>Nested1
<li>Nested2</li>
<li>Nested3
</ul>
<li>TestElem4
</ul>
<p>paragraph 1
<p>paragraph 2
<p>paragraph 3
</body></html>
li
标签不需要关闭(参见规范),以及 P
标记 也不行。换句话说,上面的示例应该解析为:
<html><title>test</title>
<body>
<ul><li>TestElem1</li>
<li>TestElem2</li>
<li>TestElem3 List:
<ul><li>Nested1</li>
<li>Nested2</li>
<li>Nested3</li>
</ul></li>
<li>TestElem4</li>
</ul>
<p>paragraph 1</p>
<p>paragraph 2</p>
<p>paragraph 3</p>
</body></html>
由于目标是在各种机器上使用该库,因此需要回退到本机代码(例如 HTML Tidy),这将需要额外的部署麻烦并牺牲平台独立性,更不用说在沙盒场景中是不可能的。
有什么建议吗? 回顾一下,我正在寻找:
- HTML 清理器 ala HTML Tidy
- 必须能够处理现实世界的 HTML,而不仅仅是 XHTML,至少能够正确读取有效的 HTML 4
- 必须能够转换为更容易处理的 XML 格式
- 应该是一个纯粹的托管应用程序。
Is there an HTML cleaner for .NET that can parse HTML and (for instance) convert it to a more machine friendly format such as XHTML?
I've tried the HTML Agility Pack, but that fails to correctly parse even fairly simple examples.
To give an example of HTML that should be parsed correctly:
<html><title>test</title>
<body>
<ul><li>TestElem1
<li>TestElem2
<li>TestElem3 List:
<ul><li>Nested1
<li>Nested2</li>
<li>Nested3
</ul>
<li>TestElem4
</ul>
<p>paragraph 1
<p>paragraph 2
<p>paragraph 3
</body></html>
li
tags don't need to be closed (see specification), and neither do P
tags. In other words, the above sample should be parsed as:
<html><title>test</title>
<body>
<ul><li>TestElem1</li>
<li>TestElem2</li>
<li>TestElem3 List:
<ul><li>Nested1</li>
<li>Nested2</li>
<li>Nested3</li>
</ul></li>
<li>TestElem4</li>
</ul>
<p>paragraph 1</p>
<p>paragraph 2</p>
<p>paragraph 3</p>
</body></html>
Since the aim is to use the library on various machines, it's a big disadvantage to need to fall back to native code (such as a wrapper around HTML Tidy) which would require extra deployment hassle and sacrifice platform independence, not to mention being impossible in sandboxed scenarios.
Any suggestions?
To recap, I'm looking for:
- An HTML cleaner ala HTML Tidy
- Must be able to deal with real world HTML, not just XHTML, at the very least correctly reading valid HTML 4
- Must be able to convert to a more easily processable XML format
- Should be a purely managed application.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
尝试 TidyManaged。
Try TidyManaged.