HTML Purifier:根据元素的属性有条件地删除元素

发布于 2024-08-29 05:42:44 字数 2522 浏览 7 评论 0原文

根据 HTML Purifier Smoketest,“格式错误”的 URI 偶尔会被丢弃,留下一个无属性锚标记,例如

XSS 变为 XSS

...以及偶尔被剥离到协议,例如

XSS 变成 XSS

虽然这没有问题,但其本身有点难看。我没有尝试用正则表达式去除这些内容,而是希望使用 HTML Purifier 自己的库功能/注入器/插件/whathaveyou。

参考点:处理属性

有条件地删除 HTMLPurifier 中的属性很容易。这里,库提供了带有方法 confiscateAttr() 的类 HTMLPurifier_AttrTransform

虽然我个人不使用 confiscateAttr() 的功能,但我确实按照 此线程target="_blank" 添加到所有锚点。

// more configuration stuff up here
$htmlDef = $htmlPurifierConfiguration->getHTMLDefinition(true);
$anchor  = $htmlDef->addBlankElement('a');
$anchor->attr_transform_post[] = new HTMLPurifier_AttrTransform_Target();
// purify down here

当然,HTMLPurifier_AttrTransform_Target 是一个非常简单的类。

class HTMLPurifier_AttrTransform_Target extends HTMLPurifier_AttrTransform
{
    public function transform($attr, $config, $context) {
        // I could call $this->confiscateAttr() here to throw away an
        // undesired attribute
        $attr['target'] = '_blank';
        return $attr;
    }
}

自然,这部分就像魅力一样。

处理元素

也许我对 HTMLPurifier_TagTransform 的关注不够,或者看错了地方,或者通常不理解它,但我似乎找不到办法有条件地删除元素

说一下,效果如下:

// more configuration stuff up here
$htmlDef = $htmlPurifierConfiguration->getHTMLDefinition(true);
$anchor  = $htmlDef->addElementHandler('a');
$anchor->elem_transform_post[] = new HTMLPurifier_ElementTransform_Cull();
// add target as per 'point of reference' here
// purify down here

使用 Cull 类扩展具有 confiscateElement() 能力或类似能力的东西,其中我可以检查缺少的 href 属性或内容为 http://href 属性。

HTMLPurifier_Filter

我知道我可以创建一个过滤器,但示例(Youtube.php 和 ExtractStyleBlocks.php)建议我在其中使用正则表达式,我真的宁愿避免,如果可能的话< /em>.我希望有一个板载或准板载解决方案,可以利用 HTML Purifier 出色的解析功能。

不幸的是,在 HTMLPurifier_AttrTransform 的子类中返回 null 并不能解决问题。

任何人都有任何聪明的想法,还是我被正则表达式困住了? :)

As per the HTML Purifier smoketest, 'malformed' URIs are occasionally discarded to leave behind an attribute-less anchor tag, e.g.

<a href="javascript:document.location='http://www.google.com/'">XSS</a> becomes <a>XSS</a>

...as well as occasionally being stripped down to the protocol, e.g.

<a href="http://1113982867/">XSS</a> becomes <a href="http:/">XSS</a>

While that's unproblematic, per se, it's a bit ugly. Instead of trying to strip these out with regular expressions, I was hoping to use HTML Purifier's own library capabilities / injectors / plug-ins / whathaveyou.

Point of reference: Handling attributes

Conditionally removing an attribute in HTMLPurifier is easy. Here the library offers the class HTMLPurifier_AttrTransform with the method confiscateAttr().

While I don't personally use the functionality of confiscateAttr(), I do use an HTMLPurifier_AttrTransform as per this thread to add target="_blank" to all anchors.

// more configuration stuff up here
$htmlDef = $htmlPurifierConfiguration->getHTMLDefinition(true);
$anchor  = $htmlDef->addBlankElement('a');
$anchor->attr_transform_post[] = new HTMLPurifier_AttrTransform_Target();
// purify down here

HTMLPurifier_AttrTransform_Target is a very simple class, of course.

class HTMLPurifier_AttrTransform_Target extends HTMLPurifier_AttrTransform
{
    public function transform($attr, $config, $context) {
        // I could call $this->confiscateAttr() here to throw away an
        // undesired attribute
        $attr['target'] = '_blank';
        return $attr;
    }
}

That part works like a charm, naturally.

Handling elements

Perhaps I'm not squinting hard enough at HTMLPurifier_TagTransform, or am looking in the wrong place(s), or generally amn't understanding it, but I can't seem to figure out a way to conditionally remove elements.

Say, something to the effect of:

// more configuration stuff up here
$htmlDef = $htmlPurifierConfiguration->getHTMLDefinition(true);
$anchor  = $htmlDef->addElementHandler('a');
$anchor->elem_transform_post[] = new HTMLPurifier_ElementTransform_Cull();
// add target as per 'point of reference' here
// purify down here

With the Cull class extending something that has a confiscateElement() ability, or comparable, wherein I could check for a missing href attribute or a href attribute with the content http:/.

HTMLPurifier_Filter

I understand I could create a filter, but the examples (Youtube.php and ExtractStyleBlocks.php) suggest I'd be using regular expressions in that, which I'd really rather avoid, if it is at all possible. I'm hoping for an onboard or quasi-onboard solution that makes use of HTML Purifier's excellent parsing capabilities.

Returning null in a child-class of HTMLPurifier_AttrTransform unfortunately doesn't cut it.

Anyone have any smart ideas, or am I stuck with regexes? :)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

和影子一齐双人舞 2024-09-05 05:42:44

成功!感谢 另一个问题中的 Ambush Commander 和 mcgrailm,我现在正在使用一个非常简单的解决方案:

// a bit of context
$htmlDef = $this->configuration->getHTMLDefinition(true);
$anchor  = $htmlDef->addBlankElement('a');

// HTMLPurifier_AttrTransform_RemoveLoneHttp strips 'href="http:/"' from
// all anchor tags (see first post for class detail)
$anchor->attr_transform_post[] = new HTMLPurifier_AttrTransform_RemoveLoneHttp();

// this is the magic! We're making 'href' a required attribute (note the
// asterisk) - now HTML Purifier removes <a></a>, as well as
// <a href="http:/"></a> after HTMLPurifier_AttrTransform_RemoveLoneHttp
// is through with it!
$htmlDef->addAttribute('a', 'href*', new HTMLPurifier_AttrDef_URI());

它有效,有效,bahahahaHAHAHAHAnhͥͤͫ̀ğͮ͑̆ͦó̓̉ͬ͋h́ͧ̆̈́̉ğ̈́͐̈a̾̈́̑ͨô̔̄̑̇g̀̄h̘̝͊̐ͩͥ̋ ͤ͛g̦̣̙̙̒̀ͥ̐̔ͅo̤̣hg͓̈́͋̇̓́̆a͖̩̯̥͕͂̈̐ͮ̒o̶ͬ̽̀̍ͮ̾ͮ͢Љ̩͉̘͓̙̦̩̹͍̹̕g̵̡͔ ̙͉̱̠̙̩͚͑ͥ̎̓͛̋͗̍̽͋͑̈́̚...! * 狂躁的笑声,咕噜咕噜的声音,脸上带着微笑倒下*

Success! Thanks to Ambush Commander and mcgrailm in another question, I am now using a hilariously simple solution:

// a bit of context
$htmlDef = $this->configuration->getHTMLDefinition(true);
$anchor  = $htmlDef->addBlankElement('a');

// HTMLPurifier_AttrTransform_RemoveLoneHttp strips 'href="http:/"' from
// all anchor tags (see first post for class detail)
$anchor->attr_transform_post[] = new HTMLPurifier_AttrTransform_RemoveLoneHttp();

// this is the magic! We're making 'href' a required attribute (note the
// asterisk) - now HTML Purifier removes <a></a>, as well as
// <a href="http:/"></a> after HTMLPurifier_AttrTransform_RemoveLoneHttp
// is through with it!
$htmlDef->addAttribute('a', 'href*', new HTMLPurifier_AttrDef_URI());

It works, it works, bahahahaHAHAHAHAnhͥͤͫ̀ğͮ͑̆ͦó̓̉ͬ͋h́ͧ̆̈́̉ğ̈́͐̈a̾̈́̑ͨô̔̄̑̇g̀̄h̘̝͊̐ͩͥ̋ͤ͛g̦̣̙̙̒̀ͥ̐̔ͅo̤̣hg͓̈́͋̇̓́̆a͖̩̯̥͕͂̈̐ͮ̒o̶ͬ̽̀̍ͮ̾ͮ͢҉̩͉̘͓̙̦̩̹͍̹̠̕g̵̡͔̙͉̱̠̙̩͚͑ͥ̎̓͛̋͗̍̽͋͑̈́̚...! * manic laughter, gurgling noises, keels over with a smile on her face *

殊姿 2024-09-05 05:42:44

事实上,您无法使用 TagTransform 删除元素,这似乎是一个实现细节。删除节点(比标签稍微高级一点)的经典机制是使用注入器。

无论如何,您正在寻找的特定功能已经实现为 %AutoFormat.RemoveEmpty

The fact that you can't remove elements with a TagTransform appears to have been an implementation detail. The classic mechanism for removing nodes (a smidge higher-level than just tags) is to use an Injector though.

Anyway, the particular piece of functionality you're looking for is already implemented as %AutoFormat.RemoveEmpty

梦里人 2024-09-05 05:42:44

为了便于阅读,这是我当前的解决方案。它可以工作,但完全绕过 HTML Purifier。

/**
 * Removes <a></a> and <a href="http:/"></a> tags from the purified
 * HTML.
 * @todo solve this with an injector?
 * @param string $purified The purified HTML
 * @return string The purified HTML, sans pointless anchors.
 */
private function anchorCull($purified)
{
    if (empty($purified)) return '';
    // re-parse HTML
    $domTree = new DOMDocument();
    $domTree->loadHTML($purified);
    // find all anchors (even good ones)
    $anchors = $domTree->getElementsByTagName('a');
    // collect bad anchors (destroying them in this loop breaks the DOM)
    $destroyNodes = array();
    for ($i = 0; ($i < $anchors->length); $i++) {
        $anchor = $anchors->item($i);
        $href   = $anchor->attributes->getNamedItem('href');
        // <a></a>
        if (is_null($href)) {
            $destroyNodes[] = $anchor;
        // <a href="http:/"></a>
        } else if ($href->nodeValue == 'http:/') {
            $destroyNodes[] = $anchor;
        }
    }
    // destroy the collected nodes
    foreach ($destroyNodes as $node) {
        // preserve content
        $retain = $node->childNodes;
        for ($i = 0; ($i < $retain->length); $i++) {
            $rnode = $retain->item($i);
            $node->parentNode->insertBefore($rnode, $node);
        }
        // actually destroy the node
        $node->parentNode->removeChild($node);
    }
    // strip out HTML out of DOM structure string
    $html = $domTree->saveHTML();
    $begin = strpos($html, '<body>') + strlen('<body>');
    $end   = strpos($html, '</body>');
    return substr($html, $begin, $end - $begin);
}

我仍然更愿意有一个好的 HTML Purifier 解决方案来解决这个问题,因此,请注意,这个答案最终不会被自我接受。但如果最终没有更好的答案,至少它可能会帮助那些有类似问题的人。 :)

For perusal, this is my current solution. It works, but bypasses HTML Purifier entirely.

/**
 * Removes <a></a> and <a href="http:/"></a> tags from the purified
 * HTML.
 * @todo solve this with an injector?
 * @param string $purified The purified HTML
 * @return string The purified HTML, sans pointless anchors.
 */
private function anchorCull($purified)
{
    if (empty($purified)) return '';
    // re-parse HTML
    $domTree = new DOMDocument();
    $domTree->loadHTML($purified);
    // find all anchors (even good ones)
    $anchors = $domTree->getElementsByTagName('a');
    // collect bad anchors (destroying them in this loop breaks the DOM)
    $destroyNodes = array();
    for ($i = 0; ($i < $anchors->length); $i++) {
        $anchor = $anchors->item($i);
        $href   = $anchor->attributes->getNamedItem('href');
        // <a></a>
        if (is_null($href)) {
            $destroyNodes[] = $anchor;
        // <a href="http:/"></a>
        } else if ($href->nodeValue == 'http:/') {
            $destroyNodes[] = $anchor;
        }
    }
    // destroy the collected nodes
    foreach ($destroyNodes as $node) {
        // preserve content
        $retain = $node->childNodes;
        for ($i = 0; ($i < $retain->length); $i++) {
            $rnode = $retain->item($i);
            $node->parentNode->insertBefore($rnode, $node);
        }
        // actually destroy the node
        $node->parentNode->removeChild($node);
    }
    // strip out HTML out of DOM structure string
    $html = $domTree->saveHTML();
    $begin = strpos($html, '<body>') + strlen('<body>');
    $end   = strpos($html, '</body>');
    return substr($html, $begin, $end - $begin);
}

I'd still much rather have a good HTML Purifier solution to this, so, just as a heads-up, this answer won't end up self-accepted. But in case no better answer ends up coming around, at least it might help those with similar issues. :)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文