通过 DOM 解析器转换 PRE 标签之间的空格

发布于 2024-11-24 02:17:35 字数 801 浏览 0 评论 0 原文

正则表达式是我最初的想法作为解决方案,尽管很快就发现 DOM 解析器会更合适......我想将字符串中的 PRE 标记之间的空格转换为   HTML 文本。例如:

<table atrr="zxzx"><tr>
<td>adfa a   adfadfaf></td><td><br /> dfa  dfa</td>
</tr></table>
<pre class="abc" id="abc">
abc 123
<span class="abc">abc 123</span>
</pre>
<pre>123 123</pre>

into (注意span标签属性中的空格被保留):

<table atrr="zxzx"><tr>
<td>adfa a   adfadfaf></td><td><br /> dfa  dfa</td>
</tr></table>
<pre class="abc" id="abc">
abc&nbsp;123
<span class="abc">abc&nbsp;123</span>
</pre>
<pre>123 123</pre>

结果需要序列化回字符串格式,以供其他地方使用。

Regex was my original idea as a solution, although it soon became apparent a DOM parser would be more appropriate... I'd like to convert spaces to   between PRE tags within a string of HTML text. For example:

<table atrr="zxzx"><tr>
<td>adfa a   adfadfaf></td><td><br /> dfa  dfa</td>
</tr></table>
<pre class="abc" id="abc">
abc 123
<span class="abc">abc 123</span>
</pre>
<pre>123 123</pre>

into (note the space in the span tag attribute is preserved):

<table atrr="zxzx"><tr>
<td>adfa a   adfadfaf></td><td><br /> dfa  dfa</td>
</tr></table>
<pre class="abc" id="abc">
abc 123
<span class="abc">abc 123</span>
</pre>
<pre>123 123</pre>

The result needs to be serialised back into string format, for use elsewhere.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

我不咬妳我踢妳 2024-12-01 02:17:35

当您想要插入   实体而不使用 DOM 将 & 实体转换为 & 实体时,这有点棘手,因为实体是节点,而空格只是字符数据。操作方法如下:

$dom = new DOMDocument;
$dom->loadHtml($html);
$xp = new DOMXPath($dom);
foreach ($xp->query('//text()[ancestor::pre]') as $textNode)
{
    $remaining = $textNode;
    while (($nextSpace = strpos($remaining->wholeText, ' ')) !== FALSE) {
        $remaining = $remaining->splitText($nextSpace);
        $remaining->nodeValue = substr($remaining->nodeValue, 1);
        $remaining->parentNode->insertBefore(
            $dom->createEntityReference('nbsp'),
            $remaining
        );
    }
}

获取所有 pre 元素并使用它们的 nodeValues 在这里不起作用,因为 nodeValue 属性将包含所有子元素的 组合 DOMText 值,例如,它将包含跨度孩子。在 pre 元素上设置 nodeValue 将删除这些元素。

因此,我们不是获取 pre 节点,而是获取在其轴上某处具有 pre 元素父级的所有 DOMText 节点:

DOMElement pre
    DOMText "abc 123"         <-- picking this
    DOMElement span
       DOMText "abc 123"      <-- and this one
DOMElement
    DOMText "123 123"         <-- and this one

然后,我们遍历每个 DOMText 节点,并将它们在每个空间处拆分为单独的 DOMText 节点。我们删除空格并在拆分节点之前插入一个 nbsp Entity 节点,因此最终您会得到一棵树,

DOMElement pre
    DOMText "abc"
    DOMEntity nbsp
    DOMText "123"
    DOMElement span
       DOMText "abc"
       DOMEntity nbsp
       DOMText "123"
DOMElement
    DOMText "123"
    DOMEntity nbsp
    DOMText "123"

因为我们只使用 DOMText 节点,任何 DOMElement 都保持不变,因此它将保留 pre 元素内的 span 元素。

警告

您的代码段无效,因为它没有根元素。使用 loadHTML 时,libxml 会将任何缺失的结构添加到 DOM,这意味着您将获得包含 DOCTYPE、html 和 body 标记的代码片段。

如果您想要返回原始代码片段,则必须 getElementsByTagName 主体节点并获取所有子节点以获取 innerHTML。不幸的是,PHP 的 DOM 实现中没有 innerHTML 函数或属性,因此我们必须手动执行此操作:

$innerHtml = '';
foreach ($dom->getElementsByTagName('body')->item(0)->childNodes as $child) {
    $tmp_doc = new DOMDocument();
    $tmp_doc->appendChild($tmp_doc->importNode($child,true));
    $innerHtml .= $tmp_doc->saveHTML();
}
echo $innerHtml;

另请参阅

This is somewhat tricky when you want to insert   Entities without DOM converting the ampersand to & entities because Entities are nodes and spaces are just character data. Here is how to do it:

$dom = new DOMDocument;
$dom->loadHtml($html);
$xp = new DOMXPath($dom);
foreach ($xp->query('//text()[ancestor::pre]') as $textNode)
{
    $remaining = $textNode;
    while (($nextSpace = strpos($remaining->wholeText, ' ')) !== FALSE) {
        $remaining = $remaining->splitText($nextSpace);
        $remaining->nodeValue = substr($remaining->nodeValue, 1);
        $remaining->parentNode->insertBefore(
            $dom->createEntityReference('nbsp'),
            $remaining
        );
    }
}

Fetching all the pre elements and working with their nodeValues doesnt work here because the nodeValue attribute would contain the combined DOMText values of all the children, e.g. it would include the nodeValue of the span childs. Setting the nodeValue on the pre element would delete those.

So instead of fetching the pre nodes, we fetch all the DOMText nodes that have a pre element parent somewhere up on their axis:

DOMElement pre
    DOMText "abc 123"         <-- picking this
    DOMElement span
       DOMText "abc 123"      <-- and this one
DOMElement
    DOMText "123 123"         <-- and this one

We then go through each of those DOMText nodes and split them into separate DOMText nodes at each space. We remove the space and insert a nbsp Entity node before the split node, so in the end you get a tree like

DOMElement pre
    DOMText "abc"
    DOMEntity nbsp
    DOMText "123"
    DOMElement span
       DOMText "abc"
       DOMEntity nbsp
       DOMText "123"
DOMElement
    DOMText "123"
    DOMEntity nbsp
    DOMText "123"

Because we only worked with the DOMText nodes, any DOMElements are left untouched and so it will preserve the span elements inside the pre element.

Caveat:

Your snippet is not valid because it doesnt have a root element. When using loadHTML, libxml will add any missing structure to the DOM, which means you will get your snippet including a DOCTYPE, html and body tag back.

If you want the original snippet back, you'd have to getElementsByTagName the body node and fetch all the children to get the innerHTML. Unfortunately, there is no innerHTML function or property in PHP's DOM implementation, so we have to do that manually:

$innerHtml = '';
foreach ($dom->getElementsByTagName('body')->item(0)->childNodes as $child) {
    $tmp_doc = new DOMDocument();
    $tmp_doc->appendChild($tmp_doc->importNode($child,true));
    $innerHtml .= $tmp_doc->saveHTML();
}
echo $innerHtml;

Also see

夜吻♂芭芘 2024-12-01 02:17:35

我看到我之前的答案的缺点。以下是在

 标记内保留标记的解决方法:

<?php
$test = file_get_contents('input.html');
$dom = new DOMDocument('1.0');
$dom->loadHTML($test);
$xpath = new DOMXpath($dom);
$pre = $xpath->query('//pre//text()');
// manipulate nodes of type XML_TEXT_NODE
foreach($pre as $e) {
    $e->nodeValue = str_replace(' ', '__REPLACEMELATER__', $e->nodeValue);
    // when you attempt to write   in a dom node
    // the & will be converted to & :(
}
$temp = $dom->saveHTML();
$temp = str_replace('<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">', '', $temp);
$temp = str_replace('<html>', '', $temp);
$temp = str_replace('<body>', '', $temp);
$temp = str_replace('</body>', '', $temp);
$temp = str_replace('</html>', '', $temp);
$temp = str_replace('__REPLACEMELATER__', ' ', $temp);
echo $temp;
?>

输入

<p>paragraph 1 remains untouched</p>
<pre>preformatted 1</pre>
<div>
    <pre>preformatted 2</pre>
</div>
<div>
    <pre>preformatted 3 <span class="foo">span text</span> preformatted 3</pre>
</div>
<div>
    <pre>preformatted 4 <span class="foo">span <b class="bla">bold test</b> text</span> preformatted 3</pre>
</div>

输出

<p>paragraph 1 remains untouched</p>
<pre>preformatted 1</pre>
<div>
    <pre>preformatted 2</pre>
</div>
<div>
    <pre>preformatted 3 <span class="foo">span text</span> preformatted 3</pre>
</div>
<div>
    <pre>preformatted 4 <span class="foo">span <b class="bla">bold test</b> text</span> preformatted 3</pre>
</div>

注释 #1

DOMDocument::saveHTML() 方法允许指定要输出的节点。否则,您可以使用 str_replace()preg_replace() 来消除 doctype、html 和 body 标签。

注意#2

这个技巧似乎有效,并且可以减少一行代码,但我不确定它是否保证有效:

$e->nodeValue = utf8_encode(str_replace(' ', "\xA0", $e->nodeValue));
// dom library will attempt to convert 0xA0 to  
// nodeValue expects utf-8 encoded data but 0xA0 is not valid in this encoding
// hence replaced string must be utf-8 encoded

I see the short coming of my previous answer. Here is a workaround to preserve tags inside the <pre> tag:

<?php
$test = file_get_contents('input.html');
$dom = new DOMDocument('1.0');
$dom->loadHTML($test);
$xpath = new DOMXpath($dom);
$pre = $xpath->query('//pre//text()');
// manipulate nodes of type XML_TEXT_NODE
foreach($pre as $e) {
    $e->nodeValue = str_replace(' ', '__REPLACEMELATER__', $e->nodeValue);
    // when you attempt to write   in a dom node
    // the & will be converted to & :(
}
$temp = $dom->saveHTML();
$temp = str_replace('<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">', '', $temp);
$temp = str_replace('<html>', '', $temp);
$temp = str_replace('<body>', '', $temp);
$temp = str_replace('</body>', '', $temp);
$temp = str_replace('</html>', '', $temp);
$temp = str_replace('__REPLACEMELATER__', ' ', $temp);
echo $temp;
?>

Input

<p>paragraph 1 remains untouched</p>
<pre>preformatted 1</pre>
<div>
    <pre>preformatted 2</pre>
</div>
<div>
    <pre>preformatted 3 <span class="foo">span text</span> preformatted 3</pre>
</div>
<div>
    <pre>preformatted 4 <span class="foo">span <b class="bla">bold test</b> text</span> preformatted 3</pre>
</div>

Output

<p>paragraph 1 remains untouched</p>
<pre>preformatted 1</pre>
<div>
    <pre>preformatted 2</pre>
</div>
<div>
    <pre>preformatted 3 <span class="foo">span text</span> preformatted 3</pre>
</div>
<div>
    <pre>preformatted 4 <span class="foo">span <b class="bla">bold test</b> text</span> preformatted 3</pre>
</div>

Note #1

DOMDocument::saveHTML() method in PHP >= 5.3.6 allows you to specify the node to output. Otherwise you can use str_replace() or preg_replace() to elimitate doctype, html and body tags.

Note #2

This trick seems to work and results in one less line of code but I am not sure if it is guaranteed to work:

$e->nodeValue = utf8_encode(str_replace(' ', "\xA0", $e->nodeValue));
// dom library will attempt to convert 0xA0 to  
// nodeValue expects utf-8 encoded data but 0xA0 is not valid in this encoding
// hence replaced string must be utf-8 encoded
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文