具有类似 HTML Tidy 功能的托管 (.NET) 库？

发布于 2024-08-30 14:26:07 字数 2064 浏览 4 评论 0原文

.NET 是否有一个 HTML 清理器可以解析 HTML 并（例如）将其转换为更机器友好的格式，例如 XHTML？

我已经尝试过 HTML Agility Pack，但它甚至无法正确解析相当简单示例。

举一个应该正确解析的 HTML 示例：

<html><title>test</title>
<body>
    <ul><li>TestElem1
        <li>TestElem2
        <li>TestElem3 List:
            <ul><li>Nested1
                <li>Nested2</li>
                <li>Nested3
            </ul>
        <li>TestElem4
    </ul>
    <p>paragraph 1
    <p>paragraph 2
    <p>paragraph 3
</body></html>

li 标签不需要关闭（参见规范)，以及 P 标记也不行。换句话说，上面的示例应该解析为：

<html><title>test</title>
<body>
    <ul><li>TestElem1</li>
        <li>TestElem2</li>
        <li>TestElem3 List:
            <ul><li>Nested1</li>
                <li>Nested2</li>
                <li>Nested3</li>
            </ul></li>
        <li>TestElem4</li>
    </ul>
    <p>paragraph 1</p>
    <p>paragraph 2</p>
    <p>paragraph 3</p>
</body></html>

由于目标是在各种机器上使用该库，因此需要回退到本机代码（例如 HTML Tidy），这将需要额外的部署麻烦并牺牲平台独立性，更不用说在沙盒场景中是不可能的。

有什么建议吗？回顾一下，我正在寻找：

HTML 清理器 ala HTML Tidy
必须能够处理现实世界的 HTML，而不仅仅是 XHTML，至少能够正确读取有效的 HTML 4
必须能够转换为更容易处理的 XML 格式
应该是一个纯粹的托管应用程序。

原文

Is there an HTML cleaner for .NET that can parse HTML and (for instance) convert it to a more machine friendly format such as XHTML?

I've tried the HTML Agility Pack, but that fails to correctly parse even fairly simple examples.

To give an example of HTML that should be parsed correctly:

<html><title>test</title>
<body>
    <ul><li>TestElem1
        <li>TestElem2
        <li>TestElem3 List:
            <ul><li>Nested1
                <li>Nested2</li>
                <li>Nested3
            </ul>
        <li>TestElem4
    </ul>
    <p>paragraph 1
    <p>paragraph 2
    <p>paragraph 3
</body></html>

li tags don't need to be closed (see specification), and neither do P tags. In other words, the above sample should be parsed as:

<html><title>test</title>
<body>
    <ul><li>TestElem1</li>
        <li>TestElem2</li>
        <li>TestElem3 List:
            <ul><li>Nested1</li>
                <li>Nested2</li>
                <li>Nested3</li>
            </ul></li>
        <li>TestElem4</li>
    </ul>
    <p>paragraph 1</p>
    <p>paragraph 2</p>
    <p>paragraph 3</p>
</body></html>

Since the aim is to use the library on various machines, it's a big disadvantage to need to fall back to native code (such as a wrapper around HTML Tidy) which would require extra deployment hassle and sacrifice platform independence, not to mention being impossible in sandboxed scenarios.

Any suggestions?
To recap, I'm looking for:

An HTML cleaner ala HTML Tidy
Must be able to deal with real world HTML, not just XHTML, at the very least correctly reading valid HTML 4
Must be able to convert to a more easily processable XML format
Should be a purely managed application.

分享到QQ

分享到微博