防止 DOMDocument::loadHTML() 转换实体

发布于 2024-12-03 15:08:06 字数 1160 浏览 2 评论 0原文

我有一个字符串值,我正在尝试为其提取列表项。我想提取文本和任何子节点,但是,DOMDocument 正在将实体转换为字符,而不是保留原始状态。

我尝试将 DOMDocument::resolveExternals 和 DOMDocument::substituteEntities 设置为 false,但这没有效果。应该注意的是,我在 Win7 上运行 PHP 5.2.17。

示例代码为:

$example = '<ul><li>text</li>'.
    '<li>&frac12; of this is <strong>strong</strong></li></ul>';

echo 'To be converted:'.PHP_EOL.$example.PHP_EOL;

$doc = new DOMDocument();
$doc->resolveExternals = false;
$doc->substituteEntities = false;

$doc->loadHTML($example);

$domNodeList = $doc->getElementsByTagName('li');
$count = $domNodeList->length;

for ($idx = 0; $idx < $count; $idx++) {
    $value = trim(_get_inner_html($domNodeList->item($idx)));
    /* remainder of processing and storing in database */
    echo 'Saved '.$value.PHP_EOL;
}

function _get_inner_html( $node ) {
    $innerHTML= '';
    $children = $node->childNodes;
    foreach ($children as $child) {
        $innerHTML .= $child->ownerDocument->saveXML( $child );
    }

    return $innerHTML;
}

½ 最终转换为 ½(单字符/UTF-8 版本,而不是实体版本),这不是所需的格式。

I have a string value that I'm trying to extract list items for. I'd like to extract the text and any subnodes, however, DOMDocument is converting the entities to the character, instead of leaving in the original state.

I've tried setting DOMDocument::resolveExternals and DOMDocument::substituteEntities for false, but this has no effect. It should be noted I'm running on Win7 with PHP 5.2.17.

Example code is:

$example = '<ul><li>text</li>'.
    '<li>½ of this is <strong>strong</strong></li></ul>';

echo 'To be converted:'.PHP_EOL.$example.PHP_EOL;

$doc = new DOMDocument();
$doc->resolveExternals = false;
$doc->substituteEntities = false;

$doc->loadHTML($example);

$domNodeList = $doc->getElementsByTagName('li');
$count = $domNodeList->length;

for ($idx = 0; $idx < $count; $idx++) {
    $value = trim(_get_inner_html($domNodeList->item($idx)));
    /* remainder of processing and storing in database */
    echo 'Saved '.$value.PHP_EOL;
}

function _get_inner_html( $node ) {
    $innerHTML= '';
    $children = $node->childNodes;
    foreach ($children as $child) {
        $innerHTML .= $child->ownerDocument->saveXML( $child );
    }

    return $innerHTML;
}

½ ends up getting converted to ½ (single character / UTF-8 version, not entity version), which is not the desired format.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

紫﹏色ふ单纯 2024-12-10 15:08:06

不是 PHP 5.3.6++ 的解决方案

$html =<<<HTML
<ul><li>text</li>
<li>½ of this is <strong>strong</strong></li></ul>
HTML;

$doc = new DOMDocument();
$doc->resolveExternals = false;
$doc->substituteEntities = false;
$doc->loadHTML($html);
foreach ($doc->getElementsByTagName('li') as $node)
{
  echo htmlentities(iconv('UTF-8', 'ISO-8859-1', $node->nodeValue)), "\n";
}

Solution for not PHP 5.3.6++

$html =<<<HTML
<ul><li>text</li>
<li>½ of this is <strong>strong</strong></li></ul>
HTML;

$doc = new DOMDocument();
$doc->resolveExternals = false;
$doc->substituteEntities = false;
$doc->loadHTML($html);
foreach ($doc->getElementsByTagName('li') as $node)
{
  echo htmlentities(iconv('UTF-8', 'ISO-8859-1', $node->nodeValue)), "\n";
}
蓝眼睛不忧郁 2024-12-10 15:08:06

根据 ajreal 提供的 答案,我扩展了该示例变量来处理更多情况,并更改 _get_inner_html() 以进行递归调用并处理文本节点的实体转换。

这可能不是最好的答案,因为它对元素做出了一些假设(例如没有属性)。但由于我的特殊需求不需要传递属性(但是......我确信我的示例数据稍后会向我抛出该属性),因此该解决方案适合我。

$example = '<ul><li>text</li>'.
'<li>½ of this is <strong>strong</strong></li>'.
'<li>Entity <strong attr="3">in ½ tag</strong></li>'.
'<li>Nested nodes <strong attr="3">in ½ <em>tag ½</em></strong></li>'.
'</ul>';

echo 'To be converted:'.PHP_EOL.$example.PHP_EOL;

$doc = new DOMDocument();
$doc->resolveExternals = true;
$doc->substituteEntities = false;

$doc->loadHTML($example);

$domNodeList = $doc->getElementsByTagName('li');
$count = $domNodeList->length;

for ($idx = 0; $idx < $count; $idx++) {
    $value = trim(_get_inner_html($domNodeList->item($idx)));

    /* remainder of processing and storing in database */
    echo 'Saved '.$value.PHP_EOL;

}

function _get_inner_html( $node ) {
    $innerHTML= '';
    $children = $node->childNodes;
    foreach ($children as $child) {
        echo 'Node type is '.$child->nodeType.PHP_EOL;
        switch ($child->nodeType) {
        case 3:
            $innerHTML .= htmlentities(iconv('UTF-8', 'ISO-8859-1', $child->nodeValue));
            break;
        default:
            echo 'Non text node has '.$child->childNodes->length.' children'.PHP_EOL;
            echo 'Node name '.$child->nodeName.PHP_EOL;
            $innerHTML .= '<'.$child->nodeName.'>';
            $innerHTML .= _get_inner_html( $child );
            $innerHTML .= '</'.$child->nodeName.'>';
            break;
        }
    }

    return $innerHTML;
}

Based on the answer provided by ajreal, I've expanded the example variable to handle more cases, and changed _get_inner_html() to make recursive calls and handle the entity conversion for text nodes.

It's probably not the best answer, since it makes some assumptions about the elements (such as no attributes). But since my particular needs don't require attributes to be carried across (yet.. I'm sure my sample data will throw that one at me later on), this solution works for me.

$example = '<ul><li>text</li>'.
'<li>½ of this is <strong>strong</strong></li>'.
'<li>Entity <strong attr="3">in ½ tag</strong></li>'.
'<li>Nested nodes <strong attr="3">in ½ <em>tag ½</em></strong></li>'.
'</ul>';

echo 'To be converted:'.PHP_EOL.$example.PHP_EOL;

$doc = new DOMDocument();
$doc->resolveExternals = true;
$doc->substituteEntities = false;

$doc->loadHTML($example);

$domNodeList = $doc->getElementsByTagName('li');
$count = $domNodeList->length;

for ($idx = 0; $idx < $count; $idx++) {
    $value = trim(_get_inner_html($domNodeList->item($idx)));

    /* remainder of processing and storing in database */
    echo 'Saved '.$value.PHP_EOL;

}

function _get_inner_html( $node ) {
    $innerHTML= '';
    $children = $node->childNodes;
    foreach ($children as $child) {
        echo 'Node type is '.$child->nodeType.PHP_EOL;
        switch ($child->nodeType) {
        case 3:
            $innerHTML .= htmlentities(iconv('UTF-8', 'ISO-8859-1', $child->nodeValue));
            break;
        default:
            echo 'Non text node has '.$child->childNodes->length.' children'.PHP_EOL;
            echo 'Node name '.$child->nodeName.PHP_EOL;
            $innerHTML .= '<'.$child->nodeName.'>';
            $innerHTML .= _get_inner_html( $child );
            $innerHTML .= '</'.$child->nodeName.'>';
            break;
        }
    }

    return $innerHTML;
}
空宴 2024-12-10 15:08:06

我有点晚了,也许这不完全是你的情况,但我讨厌黑客,我找到了避免你正在谈论的转换的最干净的方法:

$d = new DOMDocument('1.0', 'UTF-8');
$d->loadXML('<?xml version="1.0" encoding="UTF-8"?><t>Hello ½ World</t>');
print_r($d->saveXML());

output: <t>Hello ½ World</t>
$d = new DOMDocument('1.0', 'UTF-8');
$d->loadXML('<?xml version="1.0"?><t>Hello ½ World</t>');
print_r($d->saveXML());

output: <t>Hello ½ World</t>

I'm a bit late and maybe it's not exactly your case, but I hate hacks and I found the cleanest way to avoid the conversions you're talking about:

$d = new DOMDocument('1.0', 'UTF-8');
$d->loadXML('<?xml version="1.0" encoding="UTF-8"?><t>Hello ½ World</t>');
print_r($d->saveXML());

output: <t>Hello ½ World</t>
$d = new DOMDocument('1.0', 'UTF-8');
$d->loadXML('<?xml version="1.0"?><t>Hello ½ World</t>');
print_r($d->saveXML());

output: <t>Hello ½ World</t>
想挽留 2024-12-10 15:08:06

不需要迭代子节点:

function innerHTML($node)
         {$html=$node->ownerDocument->saveXML($node);
          return preg_replace("%^<{$node->nodeName}[^>]*>|</{$node->nodeName}>$%", '', $html);
         }

Need no iterate child nodes:

function innerHTML($node)
         {$html=$node->ownerDocument->saveXML($node);
          return preg_replace("%^<{$node->nodeName}[^>]*>|</{$node->nodeName}>$%", '', $html);
         }
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文