如何防止 Php 的 DOMDocument 编码 html 实体?

发布于 2024-07-17 14:44:18 字数 668 浏览 11 评论 0原文

我有一个函数,可以使用 Php 的 DOMDocument 替换字符串中锚点的 href 属性。 这是一个片段:

$doc        = new DOMDocument('1.0', 'UTF-8');
$doc->loadHTML($text);
$anchors    = $doc->getElementsByTagName('a');

foreach($anchors as $a) {
    $a->setAttribute('href', 'http://google.com');
}

return $doc->saveHTML();

问题是 loadHTML($text) 包围了 doctype、html、body 等标记中的 $text。 我尝试通过这样做而不是 loadHTML() 来解决这个问题:

$doc        = new DOMDocument('1.0', 'UTF-8');
$node       = $doc->createTextNode($text);
$doc->appendChild($node);
...

不幸的是,这对所有实体(包括锚点)进行了编码。 有谁知道如何关闭此功能? 我已经彻底浏览了文档并尝试破解它,但无法弄清楚。

谢谢! :)

I have a function that replaces anchors' href attribute in a string using Php's DOMDocument. Here's a snippet:

$doc        = new DOMDocument('1.0', 'UTF-8');
$doc->loadHTML($text);
$anchors    = $doc->getElementsByTagName('a');

foreach($anchors as $a) {
    $a->setAttribute('href', 'http://google.com');
}

return $doc->saveHTML();

The problem is that loadHTML($text) surrounds the $text in doctype, html, body, etc. tags. I tried working around this by doing this instead of loadHTML():

$doc        = new DOMDocument('1.0', 'UTF-8');
$node       = $doc->createTextNode($text);
$doc->appendChild($node);
...

Unfortunately, this encodes all the entities (anchors included). Does anyone know how to turn this off? I've already thoroughly looked through the docs and tried hacking it, but can't figure it out.

Thanks! :)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

·深蓝 2024-07-24 14:44:18

$text 是带有占位符锚标记的翻译字符串

如果这些占位符具有严格、定义良好的格式,则可以使用简单的 preg_replacepreg_replace_callback 可能可以解决问题。
一般来说,我不建议使用正则表达式来处理 html 文档,但对于一个定义明确的小子集来说,它们是合适的。

$text is a translated string with place-holder anchor tags

If these place holders have a strict, well-defined format a simple preg_replace or preg_replace_callback might do the trick.
I do not suggest fiddling about html documents with regex in general, but for a small well-defined subset they are suitable.

仙女 2024-07-24 14:44:18

XML 只有很少的预定义实体 。 所有 html 实体都在其他地方定义。 当您使用 loadhtml() 时,这些实体定义会自动加载,而使用 loadxml() (或根本不使用 load())则不会自动加载。
createTextNode() 的作用正如其名称所示。 您作为值传递的所有内容都将被视为文本内容,而不是标记。 即,如果您传递对标记具有特殊含义的内容(<、>、...),则它会以解析器可以区分文本与实际标记的方式进行编码(<、>、 ...)

$text 从哪里来? 你不能在实际的html文档中进行替换吗?

XML has only very few predefined entities. All you html entities are defined somewhere else. When you use loadhtml() these entity definitions are load automagically, with loadxml() (or no load() at all) they are not.
createTextNode() does exactly what the name suggests. Everything you pass as value is treated as text content, not as markup. I.e. if you pass something that has a special meaning to the markup (<, >, ...) it's encoded in a way a parser can distinguish the text from the actual markup (<, >, ...)

Where does $text come from? Can't you do the replacement within the actual html document?

各空 2024-07-24 14:44:18

对于这个问题,这里有一个不那么棘手的解决方案,但它工作得很好。

$TempAttributeName='gewrbamsbgadg';

//$node - your a tag DOM node

$newAttr = $dom->createAttribute($TempAttributeName);
$newAttr->value = "{{your_placeholder_or_whatever}}";
$node->setAttributeNode($newAttr);
$node->removeAttribute('href');

//Then replace custom dom node with href
$finalHTMLString = $dom->saveHTML();
$finalHTMLString = str_replace($TempAttributeName,'href',$finalHTMLString);

echo $finalHTMLString;

Here is a little less hacky solution for this issue, but it works perfectly.

$TempAttributeName='gewrbamsbgadg';

//$node - your a tag DOM node

$newAttr = $dom->createAttribute($TempAttributeName);
$newAttr->value = "{{your_placeholder_or_whatever}}";
$node->setAttributeNode($newAttr);
$node->removeAttribute('href');

//Then replace custom dom node with href
$finalHTMLString = $dom->saveHTML();
$finalHTMLString = str_replace($TempAttributeName,'href',$finalHTMLString);

echo $finalHTMLString;
红玫瑰 2024-07-24 14:44:18

我最终以一种脆弱的方式破解了这个,将:更改

return $doc->saveHTML();

为:

$text       = $doc->saveHTML();
return mb_substr($text, 122, -19);

这消除了所有不必要的垃圾,将:更改

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" 
"http://www.w3.org/TR/REC-html40/loose.dtd"> <html><body><p>
You can <a href="http://www.google.com">click here</a> to visit Google.</p>
</body></html> 

为:

You can <a href="http://www.google.com">click here</a> to visit Google.

有人能想出更好的东西吗?

I ended up hacking this in a tenuous way, changing:

return $doc->saveHTML();

into:

$text       = $doc->saveHTML();
return mb_substr($text, 122, -19);

This cuts out all the unnecessary garbage, changing this:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" 
"http://www.w3.org/TR/REC-html40/loose.dtd"> <html><body><p>
You can <a href="http://www.google.com">click here</a> to visit Google.</p>
</body></html> 

into this:

You can <a href="http://www.google.com">click here</a> to visit Google.

Can anyone figure out something better?

原野 2024-07-24 14:44:18

好的,这是我最终得到的解决方案。 决定采纳 VolkerK 的建议。

public static function ReplaceAnchors($text, array $attributeSets)
{
    $expression = '/(<a)([\s\w\d:\/=_&\[\]\+%".?])*(>)/';

    if (empty($attributeSets) || !is_array($attributeSets)) {
        // no attributes to set. Set href="#".
        return preg_replace($expression, '$1 href="#"$3', $text);
    }

    $attributeStrs  = array();
    foreach ($attributeSets as $attributeKeyVal) {
        // loop thru attributes and set the anchor
        $attributePairs = array();
        foreach ($attributeKeyVal as $name => $value) {
            if (!is_string($value) && !is_int($value)) {
                continue; // skip
            }

            $name               = htmlspecialchars($name);
            $value              = htmlspecialchars($value);
            $attributePairs[]   = "$name=\"$value\"";
        }
        $attributeStrs[]    = implode(' ', $attributePairs);
    }

    $i      = -1;
    $pieces = preg_split($expression, $text);
    foreach ($pieces as &$piece) {
        if ($i === -1) {
            // skip the first token
            ++$i;
            continue;
        }

        // figure out which attribute string to use
        if (isset($attributeStrs[$i])) {
            // pick the parallel attribute string
            $attributeStr   = $attributeStrs[$i];
        } else {
            // pick the last attribute string if we don't have enough
            $attributeStr   = $attributeStrs[count($attributeStrs) - 1];
        }

        // build a opening new anchor for this token.
        $piece  = '<a '.$attributeStr.'>'.preg_replace($expression, '$1 href="#"$3', $piece);
        ++$i;
    }

    return implode('', $pieces);

这允许人们使用一组不同的锚属性来调用该函数。

OK, here's the final solution I ended up with. Decided to go with VolkerK's suggestion.

public static function ReplaceAnchors($text, array $attributeSets)
{
    $expression = '/(<a)([\s\w\d:\/=_&\[\]\+%".?])*(>)/';

    if (empty($attributeSets) || !is_array($attributeSets)) {
        // no attributes to set. Set href="#".
        return preg_replace($expression, '$1 href="#"$3', $text);
    }

    $attributeStrs  = array();
    foreach ($attributeSets as $attributeKeyVal) {
        // loop thru attributes and set the anchor
        $attributePairs = array();
        foreach ($attributeKeyVal as $name => $value) {
            if (!is_string($value) && !is_int($value)) {
                continue; // skip
            }

            $name               = htmlspecialchars($name);
            $value              = htmlspecialchars($value);
            $attributePairs[]   = "$name=\"$value\"";
        }
        $attributeStrs[]    = implode(' ', $attributePairs);
    }

    $i      = -1;
    $pieces = preg_split($expression, $text);
    foreach ($pieces as &$piece) {
        if ($i === -1) {
            // skip the first token
            ++$i;
            continue;
        }

        // figure out which attribute string to use
        if (isset($attributeStrs[$i])) {
            // pick the parallel attribute string
            $attributeStr   = $attributeStrs[$i];
        } else {
            // pick the last attribute string if we don't have enough
            $attributeStr   = $attributeStrs[count($attributeStrs) - 1];
        }

        // build a opening new anchor for this token.
        $piece  = '<a '.$attributeStr.'>'.preg_replace($expression, '$1 href="#"$3', $piece);
        ++$i;
    }

    return implode('', $pieces);

This allows one to call the function with a set of different anchor attributes.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文