如何在 PHP 中格式化 DOM 结构?

发布于 2024-12-13 07:08:07 字数 1816 浏览 0 评论 0 原文

我的第一个猜测是 PHP DOM 类 (带有 formatOutput 参数)。但是,我无法正确格式化和输出该 HTML 块。正如您所看到的,缩进和对齐不正确。

$html = '
<html>
<body>
<div>

<div>

        <div>

                <p>My Last paragraph</p>
            <div>
                            This is another text block and some other stuff.<br><br>
                Again we will start a new paragraph
                            and some other stuff
                            <br>
        </div>
</div>
        <div>
                        <div>
                            <h1>Another Title</h1>
                                                    </div>
                        <p>Some text again <b>for sure</b></p>
                </div>
</div>
<div>
    <pre><code>
    <span>&lt;html&gt;</span>
        <span>&lt;head&gt;</span>
            <span>&lt;title&gt;</span>
                Page Title
            <span>&lt;/title&gt;</span>
            <span>&lt;/head&gt;</span>
    <span>&lt;/html&gt;</span>
    </code></pre>
</div>
</div>
</body>
</html>';

header('Content-Type: text/plain');
libxml_use_internal_errors(TRUE);

$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
$dom->formatOutput = true;
$dom->loadHTML($html);
print $dom->saveHTML();

更新:我在示例中添加了一个预先格式化的代码块。

My first guess was the PHP DOM classes (with the formatOutput parameter). However, I cannot get this block of HTML to be formatted and output correctly. As you can see, the indention and alignment is not correct.

$html = '
<html>
<body>
<div>

<div>

        <div>

                <p>My Last paragraph</p>
            <div>
                            This is another text block and some other stuff.<br><br>
                Again we will start a new paragraph
                            and some other stuff
                            <br>
        </div>
</div>
        <div>
                        <div>
                            <h1>Another Title</h1>
                                                    </div>
                        <p>Some text again <b>for sure</b></p>
                </div>
</div>
<div>
    <pre><code>
    <span><html></span>
        <span><head></span>
            <span><title></span>
                Page Title
            <span></title></span>
            <span></head></span>
    <span></html></span>
    </code></pre>
</div>
</div>
</body>
</html>';

header('Content-Type: text/plain');
libxml_use_internal_errors(TRUE);

$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
$dom->formatOutput = true;
$dom->loadHTML($html);
print $dom->saveHTML();

Update: I added a pre-formatted code block to the example.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

夕嗳→ 2024-12-20 07:08:07

以下是对 @hijarian 答案的一些改进:

LibXML 错误

如果您不调用 libxml_use_internal_errors(true),PHP 将输出找到的所有 HTML 错误。但是,如果您调用该函数,错误将不会被抑制,而是会堆积起来,您可以通过调用 libxml_get_errors() 来检查。这样做的问题是它会消耗内存,而且众所周知 DOMDocument 非常挑剔。如果您批量处理大量文件,最终将耗尽内存。有两种解决方案:

if (libxml_use_internal_errors(true) === true)
{
    libxml_clear_errors();
}

由于 libxml_use_internal_errors(true) 返回此设置的先前值(默认 false),因此只有在运行时才清除错误不止一次(如批处理)。

另一个选项是传递LIBXML_NOERROR | LIBXML_NOWARNING 标记为 loadHTML() 方法。不幸的是,由于我不知道的原因,这仍然留下了一些错误。

请记住,DOMDocument 总是会输出错误(即使使用内部 libxml 错误并设置如果您将空(或空白)字符串传递给 load*() 方法,则会抑制标志)。

正则表达式

正则表达式 />\s* 没有多大意义,最好使用 ~>[[:space:]]++< ;~m 也捕获 \v (垂直制表符),并且仅在空格实际存在时进行替换(+ 而不是 *)不予回报 (++) - 更快 - 并降低不区分大小写的开销(因为空格没有大小写)。

您可能还希望将换行符标准化为 \n 和其他控制字符(特别是在 HTML 的来源未知的情况下),因为 \r 将返回为 saveXML() 之后 > 。

运行上述正则表达式后, DOMDocument::$preserveWhitespace 是无用且不必要的。

哦,我认为没有必要在这里保护空白的预类标签。仅包含空格的片段是无用的。

loadHTML() 的附加标志

  • LIBXML_COMPACT - “这可能会加快您的应用程序的速度,而无需更改代码”
  • LIBXML_NOBLANKS - 需要对此运行更多测试
  • LIBXML_NOCDATA - 需要对此运行更多测试
  • LIBXML_NOXMLDECL - 已记录,但未实现 =(

更新: 设置这些选项中的任何一个都会导致不格式化输出。

saveXML() 上,

DOMDocument::saveXML() 方法将输出 XML 声明。清除它(因为 LIBXML_NOXMLDECL 未实现),为此,我们可以使用 substr() + strpos() 的组合来查找第一个换行符。甚至使用正则表达式来清理它。

另一个选项似乎有 一个额外的好处就是简单地做:

$dom->saveXML($dom->documentElement);

另一件事,如果你的内联标签是空的,例如bili 中:

<b class="carret"></b>
<i class="icon-dashboard"></i> Dashboard
<li class="divider"></li>

saveXML() 方法会严重破坏它们(将以下元素放入空元素中),从而弄乱整个 HTML。 Tidy 也有类似的问题,只不过它只是删除节点。

要解决此问题,您可以将 LIBXML_NOEMPTYTAG 标志与 saveXML() 一起使用:

$dom->saveXML($dom->documentElement, LIBXML_NOEMPTYTAG);

此选项会将空(也称为自关闭)标签转换为内联标签,并允许空内联标签以及。

修复 HTML[5]

通过到目前为止我们所做的所有工作,我们的 HTML 输出现在有两个主要问题:

  1. 没有 DOCTYPE(当我们使用 $dom->documentElement 时它被删除)
  2. 空标签现在内联标签,意味着一个
    变成了两个 (

    ) 等等

修复第一个标签相当容易,因为 HTML5 是相当宽松:

"<!DOCTYPE html>\n" . $dom->saveXML($dom->documentElement, LIBXML_NOEMPTYTAG);

要恢复我们的空标签,如下所示:

  • area
  • base
  • basefont在 HTML5 中已弃用 )
  • br
  • col
  • command
  • embed
  • frame在 HTML5 中已弃用 em>)
  • 小时
  • img
  • 输入
  • keygen
  • 链接
  • param
  • source
  • track
  • wbr

我们可以在循环中使用 str_[i]replace

foreach (explode('|', 'area|base|basefont|br|col|command|embed|frame|hr|img|input|keygen|link|meta|param|source|track|wbr') as $tag)
{
    $html = str_ireplace('>/<' . $tag . '>', ' />', $html);
}

或者使用正则表达式:

$html = preg_replace('~></(?:area|base(?:font)?|br|col|command|embed|frame|hr|img|input|keygen|link|meta|param|source|track|wbr)>\b~i', '/>', $html);

这是一个操作成本高昂,我还没有对它们进行基准测试,所以我无法告诉你哪一个性能更好但我猜想preg_replace()。此外,我不确定是否需要不区分大小写的版本。我的印象是 XML 标签总是小写的。 更新:标签始终为小写。

关于

这些标签的内容(如果存在)将始终封装到(未注释的)CDATA 块中,这可能会破坏它们的含义。您必须用正则表达式替换这些标记。

执行

function DOM_Tidy($html)
{
    $dom = new \DOMDocument();

    if (libxml_use_internal_errors(true) === true)
    {
        libxml_clear_errors();
    }

    $html = mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8');
    $html = preg_replace(array('~\R~u', '~>[[:space:]]++<~m'), array("\n", '><'), $html);

    if ((empty($html) !== true) && ($dom->loadHTML($html) === true))
    {
        $dom->formatOutput = true;

        if (($html = $dom->saveXML($dom->documentElement, LIBXML_NOEMPTYTAG)) !== false)
        {
            $regex = array
            (
                '~' . preg_quote('<![CDATA[', '~') . '~' => '',
                '~' . preg_quote(']]>', '~') . '~' => '',
                '~></(?:area|base(?:font)?|br|col|command|embed|frame|hr|img|input|keygen|link|meta|param|source|track|wbr)>~' => ' />',
            );

            return '<!DOCTYPE html>' . "\n" . preg_replace(array_keys($regex), $regex, $html);
        }
    }

    return false;
}

Here are some improvements over @hijarian answer:

LibXML Errors

If you don't call libxml_use_internal_errors(true), PHP will output all HTML errors found. However, if you call that function, the errors won't be suppressed, instead they will go to a pile that you can inspect by calling libxml_get_errors(). The problem with this is that it eats memory, and DOMDocument is known to be very picky. If you're processing lots of files in batch, you will eventually run out of memory. There are two solutions for this:

if (libxml_use_internal_errors(true) === true)
{
    libxml_clear_errors();
}

Since libxml_use_internal_errors(true) returns the previous value of this setting (default false), this has the effect of only clearing errors if you run it more than once (as in batch processing).

The other option is to pass the LIBXML_NOERROR | LIBXML_NOWARNING flags to the loadHTML() method. Unfortunately, for reasons that are unknown to me, this still leaves a couple of errors behind.

Bare in mind that DOMDocument will always output a error (even when using internal libxml errors and setting the suppressing flags) if you pass a empty (or blankish) string to the load*() methods.

Regex

The regex />\s*</im doesn't make a whole lot of sense, it's better to use ~>[[:space:]]++<~m to also catch \v (vertical tabs) and only replace if spaces actually exist (+ instead of *) without giving back (++) - which is faster - and to drop the case insensitve overhead (since whitespace has no case).

You may also want to normalize newlines to \n and other control characters (specially if the origin of the HTML is unknown), since a \r will come back as  after saveXML() for instance.

DOMDocument::$preserveWhitespace is useless and unnecessary after running the above regex.

Oh, and I don't see the need to protect blank pre-like tags here. Whitespace-only snippets are useless.

Additional Flags for loadHTML()

  • LIBXML_COMPACT - "this may speed up your application without needing to change the code"
  • LIBXML_NOBLANKS - need to run more tests on this one
  • LIBXML_NOCDATA - need to run more tests on this one
  • LIBXML_NOXMLDECL - documented, but not implemented =(

UPDATE: Setting any of these options will have the effect of not formatting the output.

On saveXML()

The DOMDocument::saveXML() method will output the XML declaration. We need to manually purge it (since the LIBXML_NOXMLDECL isn't implemented). To do that, we could use a combination of substr() + strpos() to look for the first line break or even use a regex to clean it up.

Another option, that seems to have an added benefit is simply doing:

$dom->saveXML($dom->documentElement);

Another thing, if you have inline tags are are empty, such as the b, i or li in:

<b class="carret"></b>
<i class="icon-dashboard"></i> Dashboard
<li class="divider"></li>

The saveXML() method will seriously mangle them (placing the following element inside the empty one), messing your whole HTML. Tidy also has a similar problem, except that it just drops the node.

To fix that, you can use the LIBXML_NOEMPTYTAG flag along with saveXML():

$dom->saveXML($dom->documentElement, LIBXML_NOEMPTYTAG);

This option will convert empty (aka self-closing) tags to inline tags and allow empty inline tags as well.

Fixing HTML[5]

With all the stuff we did so far, our HTML output has two major problems now:

  1. no DOCTYPE (it was stripped when we used $dom->documentElement)
  2. empty tags are now inline tags, meaning one <br /> turned into two (<br></br>) and so on

Fixing the first one is fairly easy, since HTML5 is pretty permissive:

"<!DOCTYPE html>\n" . $dom->saveXML($dom->documentElement, LIBXML_NOEMPTYTAG);

To get our empty tags back, which are the following:

  • area
  • base
  • basefont (deprecated in HTML5)
  • br
  • col
  • command
  • embed
  • frame (deprecated in HTML5)
  • hr
  • img
  • input
  • keygen
  • link
  • meta
  • param
  • source
  • track
  • wbr

We can either use str_[i]replace in a loop:

foreach (explode('|', 'area|base|basefont|br|col|command|embed|frame|hr|img|input|keygen|link|meta|param|source|track|wbr') as $tag)
{
    $html = str_ireplace('>/<' . $tag . '>', ' />', $html);
}

Or a regular expression:

$html = preg_replace('~></(?:area|base(?:font)?|br|col|command|embed|frame|hr|img|input|keygen|link|meta|param|source|track|wbr)>\b~i', '/>', $html);

This is a costly operation, I haven't benchmarked them so I can't tell you which one performs better but I would guess preg_replace(). Additionally, I'm not sure if the case insensitive version is needed. I'm under the impression that XML tags are always lowercased. UPDATE: Tags are always lowercased.

On <script> and <style> Tags

These tags will always have their content (if existent) encapsulated into (uncommented) CDATA blocks, which will probably break their meaning. You'll have to replace those tokens with a regular expression.

Implementation

function DOM_Tidy($html)
{
    $dom = new \DOMDocument();

    if (libxml_use_internal_errors(true) === true)
    {
        libxml_clear_errors();
    }

    $html = mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8');
    $html = preg_replace(array('~\R~u', '~>[[:space:]]++<~m'), array("\n", '><'), $html);

    if ((empty($html) !== true) && ($dom->loadHTML($html) === true))
    {
        $dom->formatOutput = true;

        if (($html = $dom->saveXML($dom->documentElement, LIBXML_NOEMPTYTAG)) !== false)
        {
            $regex = array
            (
                '~' . preg_quote('<![CDATA[', '~') . '~' => '',
                '~' . preg_quote(']]>', '~') . '~' => '',
                '~></(?:area|base(?:font)?|br|col|command|embed|frame|hr|img|input|keygen|link|meta|param|source|track|wbr)>~' => ' />',
            );

            return '<!DOCTYPE html>' . "\n" . preg_replace(array_keys($regex), $regex, $html);
        }
    }

    return false;
}
看海 2024-12-20 07:08:07

这是 php.net 上的评论: https://www.php .net/manual/en/domdocument.save.php#88630

看起来当您从字符串加载 HTML 时(就像您所做的那样),DOMDocument 变得懒惰并且不会格式化其中的任何内容。

这是解决您问题的有效解决方案:

// Clean your HTML by hand first
$html = preg_replace('/>\s*</im', '><', $html);
$dom = new DOMDocument;
$dom->loadHTML($html);
$dom->formatOutput = true;
$dom->preserveWhitespace = false;
// Use saveXML(), not saveHTML()
print $dom->saveXML();

基本上,您可以删除标签之间的空格并使用 saveXML() 而不是 saveHTML()。
saveHTML() 在这种情况下不起作用。但是,您会在文本的第一行中获得 XML 声明。

Here's the comment at the php.net: https://www.php.net/manual/en/domdocument.save.php#88630

It looks like when you load HTML from the string (like you did) DOMDocument becomes lazy and does not format anything in it.

Here's working solution to your problem:

// Clean your HTML by hand first
$html = preg_replace('/>\s*</im', '><', $html);
$dom = new DOMDocument;
$dom->loadHTML($html);
$dom->formatOutput = true;
$dom->preserveWhitespace = false;
// Use saveXML(), not saveHTML()
print $dom->saveXML();

Basically, you throw out the spaces between tags and use saveXML() instead of saveHTML().
saveHTML() just does not work in this situation. However, you get an XML declaration in first line of text.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文