通过 DOM 解析器转换 PRE 标签之间的空格
正则表达式是我最初的想法作为解决方案,尽管很快就发现 DOM 解析器会更合适......我想将字符串中的 PRE 标记之间的空格转换为
HTML 文本。例如:
<table atrr="zxzx"><tr>
<td>adfa a adfadfaf></td><td><br /> dfa dfa</td>
</tr></table>
<pre class="abc" id="abc">
abc 123
<span class="abc">abc 123</span>
</pre>
<pre>123 123</pre>
into (注意span标签属性中的空格被保留):
<table atrr="zxzx"><tr>
<td>adfa a adfadfaf></td><td><br /> dfa dfa</td>
</tr></table>
<pre class="abc" id="abc">
abc 123
<span class="abc">abc 123</span>
</pre>
<pre>123 123</pre>
结果需要序列化回字符串格式,以供其他地方使用。
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
当您想要插入
实体而不使用 DOM 将 & 实体转换为
&
实体时,这有点棘手,因为实体是节点,而空格只是字符数据。操作方法如下:获取所有 pre 元素并使用它们的 nodeValues 在这里不起作用,因为 nodeValue 属性将包含所有子元素的 组合 DOMText 值,例如,它将包含跨度孩子。在 pre 元素上设置 nodeValue 将删除这些元素。
因此,我们不是获取 pre 节点,而是获取在其轴上某处具有 pre 元素父级的所有 DOMText 节点:
然后,我们遍历每个 DOMText 节点,并将它们在每个空间处拆分为单独的 DOMText 节点。我们删除空格并在拆分节点之前插入一个 nbsp Entity 节点,因此最终您会得到一棵树,
因为我们只使用 DOMText 节点,任何 DOMElement 都保持不变,因此它将保留 pre 元素内的 span 元素。
警告:
您的代码段无效,因为它没有根元素。使用 loadHTML 时,libxml 会将任何缺失的结构添加到 DOM,这意味着您将获得包含 DOCTYPE、html 和 body 标记的代码片段。
如果您想要返回原始代码片段,则必须
getElementsByTagName
主体节点并获取所有子节点以获取innerHTML
。不幸的是,PHP 的 DOM 实现中没有 innerHTML 函数或属性,因此我们必须手动执行此操作:另请参阅
This is somewhat tricky when you want to insert
Entities without DOM converting the ampersand to
&
entities because Entities are nodes and spaces are just character data. Here is how to do it:Fetching all the pre elements and working with their nodeValues doesnt work here because the nodeValue attribute would contain the combined DOMText values of all the children, e.g. it would include the nodeValue of the span childs. Setting the nodeValue on the pre element would delete those.
So instead of fetching the pre nodes, we fetch all the DOMText nodes that have a pre element parent somewhere up on their axis:
We then go through each of those DOMText nodes and split them into separate DOMText nodes at each space. We remove the space and insert a nbsp Entity node before the split node, so in the end you get a tree like
Because we only worked with the DOMText nodes, any DOMElements are left untouched and so it will preserve the span elements inside the pre element.
Caveat:
Your snippet is not valid because it doesnt have a root element. When using loadHTML, libxml will add any missing structure to the DOM, which means you will get your snippet including a DOCTYPE, html and body tag back.
If you want the original snippet back, you'd have to
getElementsByTagName
the body node and fetch all the children to get theinnerHTML
. Unfortunately, there is no innerHTML function or property in PHP's DOM implementation, so we have to do that manually:Also see
我看到我之前的答案的缺点。以下是在
输入
输出
注释 #1
DOMDocument::saveHTML()
方法允许指定要输出的节点。否则,您可以使用str_replace()
或preg_replace()
来消除 doctype、html 和 body 标签。注意#2
这个技巧似乎有效,并且可以减少一行代码,但我不确定它是否保证有效:
I see the short coming of my previous answer. Here is a workaround to preserve tags inside the
<pre>
tag:Input
Output
Note #1
DOMDocument::saveHTML()
method in PHP >= 5.3.6 allows you to specify the node to output. Otherwise you can usestr_replace()
orpreg_replace()
to elimitate doctype, html and body tags.Note #2
This trick seems to work and results in one less line of code but I am not sure if it is guaranteed to work: