如何在 PHP 中格式化 DOM 结构?
我的第一个猜测是 PHP DOM 类 (带有 formatOutput 参数)。但是,我无法正确格式化和输出该 HTML 块。正如您所看到的,缩进和对齐不正确。
$html = '
<html>
<body>
<div>
<div>
<div>
<p>My Last paragraph</p>
<div>
This is another text block and some other stuff.<br><br>
Again we will start a new paragraph
and some other stuff
<br>
</div>
</div>
<div>
<div>
<h1>Another Title</h1>
</div>
<p>Some text again <b>for sure</b></p>
</div>
</div>
<div>
<pre><code>
<span><html></span>
<span><head></span>
<span><title></span>
Page Title
<span></title></span>
<span></head></span>
<span></html></span>
</code></pre>
</div>
</div>
</body>
</html>';
header('Content-Type: text/plain');
libxml_use_internal_errors(TRUE);
$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
$dom->formatOutput = true;
$dom->loadHTML($html);
print $dom->saveHTML();
更新:我在示例中添加了一个预先格式化的代码块。
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
以下是对 @hijarian 答案的一些改进:
LibXML 错误
如果您不调用
libxml_use_internal_errors(true)
,PHP 将输出找到的所有 HTML 错误。但是,如果您调用该函数,错误将不会被抑制,而是会堆积起来,您可以通过调用 libxml_get_errors() 来检查。这样做的问题是它会消耗内存,而且众所周知 DOMDocument 非常挑剔。如果您批量处理大量文件,最终将耗尽内存。有两种解决方案:由于
libxml_use_internal_errors(true)
返回此设置的先前值(默认false
),因此只有在运行时才清除错误不止一次(如批处理)。另一个选项是传递LIBXML_NOERROR | LIBXML_NOWARNING
标记为loadHTML()
方法。不幸的是,由于我不知道的原因,这仍然留下了一些错误。请记住,DOMDocument 总是会输出错误(即使使用内部
libxml
错误并设置如果您将空(或空白)字符串传递给load*()
方法,则会抑制标志)。正则表达式
正则表达式
/>\s* 没有多大意义,最好使用
~>[[:space:]]++< ;~m
也捕获\v
(垂直制表符),并且仅在空格实际存在时进行替换(+
而不是*
)不予回报 (++
) - 更快 - 并降低不区分大小写的开销(因为空格没有大小写)。您可能还希望将换行符标准化为
\n
和其他控制字符(特别是在 HTML 的来源未知的情况下),因为\r
将返回为saveXML() 之后 > 。
运行上述正则表达式后,
DOMDocument::$preserveWhitespace
是无用且不必要的。哦,我认为没有必要在这里保护空白的预类标签。仅包含空格的片段是无用的。
loadHTML()
的附加标志LIBXML_COMPACT
- “这可能会加快您的应用程序的速度,而无需更改代码”LIBXML_NOBLANKS
- 需要对此运行更多测试LIBXML_NOCDATA
- 需要对此运行更多测试LIBXML_NOXMLDECL
- 已记录,但未实现 =(更新: 设置这些选项中的任何一个都会导致不格式化输出。
在
saveXML()
上,DOMDocument::saveXML()
方法将输出 XML 声明。清除它(因为LIBXML_NOXMLDECL
未实现),为此,我们可以使用substr() + strpos()
的组合来查找第一个换行符。甚至使用正则表达式来清理它。另一个选项似乎有 一个额外的好处就是简单地做:
另一件事,如果你的内联标签是空的,例如
b
,i
或li
中:saveXML()
方法会严重破坏它们(将以下元素放入空元素中),从而弄乱整个 HTML。 Tidy 也有类似的问题,只不过它只是删除节点。要解决此问题,您可以将
LIBXML_NOEMPTYTAG
标志与saveXML()
一起使用:此选项会将空(也称为自关闭)标签转换为内联标签,并允许空内联标签以及。
修复 HTML[5]
通过到目前为止我们所做的所有工作,我们的 HTML 输出现在有两个主要问题:
$dom->documentElement
时它被删除)
变成了两个 () 等等
修复第一个标签相当容易,因为 HTML5 是相当宽松:
要恢复我们的空标签,如下所示:
area
base
basefont
(在 HTML5 中已弃用 )br
col
command
embed
frame
(在 HTML5 中已弃用 em>)小时
img
输入
keygen
链接
元
param
source
track
wbr
我们可以在循环中使用
str_[i]replace
:或者使用正则表达式:
这是一个操作成本高昂,我还没有对它们进行基准测试,所以我无法告诉你哪一个性能更好但我猜想
preg_replace()
。此外,我不确定是否需要不区分大小写的版本。我的印象是 XML 标签总是小写的。 更新:标签始终为小写。关于
标签
和
这些标签的内容(如果存在)将始终封装到(未注释的)CDATA 块中,这可能会破坏它们的含义。您必须用正则表达式替换这些标记。
执行
Here are some improvements over @hijarian answer:
LibXML Errors
If you don't call
libxml_use_internal_errors(true)
, PHP will output all HTML errors found. However, if you call that function, the errors won't be suppressed, instead they will go to a pile that you can inspect by callinglibxml_get_errors()
. The problem with this is that it eats memory, and DOMDocument is known to be very picky. If you're processing lots of files in batch, you will eventually run out of memory. There are two solutions for this:Since
libxml_use_internal_errors(true)
returns the previous value of this setting (defaultfalse
), this has the effect of only clearing errors if you run it more than once (as in batch processing).The other option is to pass theLIBXML_NOERROR | LIBXML_NOWARNING
flags to theloadHTML()
method. Unfortunately, for reasons that are unknown to me, this still leaves a couple of errors behind.Bare in mind that DOMDocument will always output a error (even when using internal
libxml
errors and setting the suppressing flags) if you pass a empty (or blankish) string to theload*()
methods.Regex
The regex
/>\s*</im
doesn't make a whole lot of sense, it's better to use~>[[:space:]]++<~m
to also catch\v
(vertical tabs) and only replace if spaces actually exist (+
instead of*
) without giving back (++
) - which is faster - and to drop the case insensitve overhead (since whitespace has no case).You may also want to normalize newlines to
\n
and other control characters (specially if the origin of the HTML is unknown), since a\r
will come back asafter
saveXML()
for instance.DOMDocument::$preserveWhitespace
is useless and unnecessary after running the above regex.Oh, and I don't see the need to protect blank pre-like tags here. Whitespace-only snippets are useless.
Additional Flags for
loadHTML()
LIBXML_COMPACT
- "this may speed up your application without needing to change the code"LIBXML_NOBLANKS
- need to run more tests on this oneLIBXML_NOCDATA
- need to run more tests on this oneLIBXML_NOXMLDECL
- documented, but not implemented =(UPDATE: Setting any of these options will have the effect of not formatting the output.
On
saveXML()
The
DOMDocument::saveXML()
method will output the XML declaration. We need to manually purge it (since theLIBXML_NOXMLDECL
isn't implemented). To do that, we could use a combination ofsubstr() + strpos()
to look for the first line break or even use a regex to clean it up.Another option, that seems to have an added benefit is simply doing:
Another thing, if you have inline tags are are empty, such as the
b
,i
orli
in:The
saveXML()
method will seriously mangle them (placing the following element inside the empty one), messing your whole HTML. Tidy also has a similar problem, except that it just drops the node.To fix that, you can use the
LIBXML_NOEMPTYTAG
flag along withsaveXML()
:This option will convert empty (aka self-closing) tags to inline tags and allow empty inline tags as well.
Fixing HTML[5]
With all the stuff we did so far, our HTML output has two major problems now:
$dom->documentElement
)<br />
turned into two (<br></br>
) and so onFixing the first one is fairly easy, since HTML5 is pretty permissive:
To get our empty tags back, which are the following:
area
base
basefont
(deprecated in HTML5)br
col
command
embed
frame
(deprecated in HTML5)hr
img
input
keygen
link
meta
param
source
track
wbr
We can either use
str_[i]replace
in a loop:Or a regular expression:
This is a costly operation, I haven't benchmarked them so I can't tell you which one performs better but I would guess
preg_replace()
. Additionally, I'm not sure if the case insensitive version is needed. I'm under the impression that XML tags are always lowercased. UPDATE: Tags are always lowercased.On
<script>
and<style>
TagsThese tags will always have their content (if existent) encapsulated into (uncommented) CDATA blocks, which will probably break their meaning. You'll have to replace those tokens with a regular expression.
Implementation
这是 php.net 上的评论: https://www.php .net/manual/en/domdocument.save.php#88630
看起来当您从字符串加载 HTML 时(就像您所做的那样),DOMDocument 变得懒惰并且不会格式化其中的任何内容。
这是解决您问题的有效解决方案:
基本上,您可以删除标签之间的空格并使用 saveXML() 而不是 saveHTML()。
saveHTML() 在这种情况下不起作用。但是,您会在文本的第一行中获得 XML 声明。
Here's the comment at the php.net: https://www.php.net/manual/en/domdocument.save.php#88630
It looks like when you load HTML from the string (like you did) DOMDocument becomes lazy and does not format anything in it.
Here's working solution to your problem:
Basically, you throw out the spaces between tags and use saveXML() instead of saveHTML().
saveHTML() just does not work in this situation. However, you get an XML declaration in first line of text.