使用正则表达式和 HTML 清理器安全问题

发布于 2024-10-08 22:34:09 字数 4397 浏览 1 评论 0原文

我知道用正则表达式解析 HTML 不好,而且它不能适用于所有情况(Stack Overflow 上有很多关于这方面的主题)。 但我仍然想尝试使用基于白名单方法的正则表达式来清理 HTML。

我想向您展示我的代码(用 PHP 5.2 编写)。 看起来工作正常,但我仍然想知道是否存在安全问题。

那么,我是不是搞错了什么?

基本原理是使用 Html_Sanitizer::sanitize()

  1. 函数首先用 token 替换允许的没有属性的标签。然后解析具有属性的标签并将其替换为标记。
  2. 然后解析 HTML 标签以检测允许的属性(使用 cleanTag 函数)。因此,HTML 标签以(希望)安全的方式重新构建。
  3. htmlspecialchars 用于确保剩余代码是干净的
  4. 标记被替换为安全标记。

代码:

class Html_Sanitizer
{
    const VALIDATOR_CSS_UNIT = '(([\+\-]?[0-9\.]+)(em|ex|px|in|cm|mm|pt|pc|\%))|0';
    const VALIDATOR_URL = 'http://\\S+';
    const VALIDATOR_CSS_PROPERTY = '[a-z\-]+';
    const VALIDATOR_STYLE = '[^"]*';

    protected static $_tags = 'a|b|blockquote|br|cite|d[ldt]|h[1-6]|i|img|li|ol|p|span|strong|u|ul';

    protected static $_attributes = array(
        'img' => array(
            'width' => '[0-9]+',
            'height' => '[0-9]+',
            'src' => self::VALIDATOR_URL,
            'style' => self::VALIDATOR_STYLE
            ),
        'span' => array(
            'style' => self::VALIDATOR_STYLE
            ),
        'p' => array(
            'style' => self::VALIDATOR_STYLE
            ),
        'a' =>  array(
            'href' => self::VALIDATOR_URL
            )
    );

    protected static $_styleValidators = array(
        'color' => '(\#[a-fA-F0-9]+)|([a-z ]+)',
        'background-color' => '\#[a-zA-Z0-9]+',
        'font-style' => '(normal|italic|oblique)',
        'font-size' => '[\-a-z]+',
        'margin-left' => self::VALIDATOR_CSS_UNIT,
        'margin-right' => self::VALIDATOR_CSS_UNIT,
        'text-align' => '(left|right|center|justify)',
        'text-indent' => self::VALIDATOR_CSS_UNIT,
        'text-decoration' => '(none|overline|underline|blink|line-through)',
        'width' => self::VALIDATOR_CSS_UNIT,
        'height' => self::VALIDATOR_CSS_UNIT
    );

    public static function sanitize($str)
    {
        $tokens = array();

        //tokenize opening tags with no attributes
        $pattern = '#<(/)?('. self::$_tags .')>#';
        $replace = '__SAFE_TAG_$1$2__';
        $str = preg_replace($pattern, $replace, $str);

        // tokenize tags with attributes
        $pattern = '#<('. self::$_tags .')(?:\s+(?:[a-z]+)="(?:[^"\\\]*(?:\\\"[^"\\\]*)*)")*\s*(/)?>#';
        preg_match_all($pattern, $str, $matches, PREG_SET_ORDER);
        foreach($matches as $i => $match) {
            $tokens[$i] = self::cleanTag($match[1], $match[0]);
            $str = str_replace($match[0], '__SAFE_TOKEN_'.$i.'__', $str);
        }

        $str = htmlspecialchars($str);

        foreach ($tokens as $i => $cleanTag) {
            $str = str_replace('__SAFE_TOKEN_'.$i.'__', $cleanTag, $str);
        }

        $pattern = '#__SAFE_TAG_(/?(?:'. self::$_tags .'))__#';
        $replace = '<$1>';
        $str = preg_replace($pattern, $replace, $str);

        return $str;
    }

    public static function cleanTag($tag, $str)
    {
        $cleanTag = '<' . $tag;

        if ($tag === 'a') {
            $cleanTag .= ' rel="nofolow" target="_blank"';
        }

        if (isset(self::$_attributes[$tag])) {
            foreach(self::$_attributes[$tag] as $attr => $attrPattern) {
                $pattern = '#'.$attr.'="('. $attrPattern .')"#';
                preg_match($pattern, $str, $match);
                if (isset($match[1])) {
                    if ($attr == 'style') {
                        $cleanTag .= ' style="' . self::cleanStyle($match[1]) . '"';
                    } else {
                        $cleanTag .= ' ' . $attr . '="' . $match[1] . '"';
                    }
                }
            }
        }

        if ($tag === 'img') {
            $cleanTag .= ' /';
        }

        $cleanTag .= '>';
        return $cleanTag;
    }

    public static function cleanStyle($style)
    {
        $cleanStyle = '';

        foreach(self::$_styleValidators as $stl => $stlPattern) {
            $pattern = '#[; ]?' . $stl . '\s*:\s*(' . $stlPattern . ')\s*;#i';
            preg_match($pattern, $style, $match);
            if (isset($match[1])) {
                $cleanStyle .= ($cleanStyle ? ' ' : '') . $stl . ':' . $match[1] . ';';
            }
        }

        return $cleanStyle;
    }
}

I know that parsing HTML with regexp is bad, and it can not work for all cases (there are plenty topics about that on Stack Overflow).
But I still wanted to try to sanitize HTML with regex based on a whitelist method.

I would like to show you my code bellow (written in PHP 5.2).
It seems to work fine, but I'm still wondering if there are security issues.

So, did I got something wrong ?

Basic principle is to use Html_Sanitizer::sanitize()

  1. The function first replaces allowed tags with no attributes with tokens. Then parse for tags with attributes and replace them with token too.
  2. The HTML tags are then parsed to detect the allowed attributes (using the cleanTag function). The HTML tag is therefore re-builded in a (lets-hope) safe way.
  3. htmlspecialchars is used to be sure that remaining code is clean
  4. tokens are replaced with safe tags.

Code:

class Html_Sanitizer
{
    const VALIDATOR_CSS_UNIT = '(([\+\-]?[0-9\.]+)(em|ex|px|in|cm|mm|pt|pc|\%))|0';
    const VALIDATOR_URL = 'http://\\S+';
    const VALIDATOR_CSS_PROPERTY = '[a-z\-]+';
    const VALIDATOR_STYLE = '[^"]*';

    protected static $_tags = 'a|b|blockquote|br|cite|d[ldt]|h[1-6]|i|img|li|ol|p|span|strong|u|ul';

    protected static $_attributes = array(
        'img' => array(
            'width' => '[0-9]+',
            'height' => '[0-9]+',
            'src' => self::VALIDATOR_URL,
            'style' => self::VALIDATOR_STYLE
            ),
        'span' => array(
            'style' => self::VALIDATOR_STYLE
            ),
        'p' => array(
            'style' => self::VALIDATOR_STYLE
            ),
        'a' =>  array(
            'href' => self::VALIDATOR_URL
            )
    );

    protected static $_styleValidators = array(
        'color' => '(\#[a-fA-F0-9]+)|([a-z ]+)',
        'background-color' => '\#[a-zA-Z0-9]+',
        'font-style' => '(normal|italic|oblique)',
        'font-size' => '[\-a-z]+',
        'margin-left' => self::VALIDATOR_CSS_UNIT,
        'margin-right' => self::VALIDATOR_CSS_UNIT,
        'text-align' => '(left|right|center|justify)',
        'text-indent' => self::VALIDATOR_CSS_UNIT,
        'text-decoration' => '(none|overline|underline|blink|line-through)',
        'width' => self::VALIDATOR_CSS_UNIT,
        'height' => self::VALIDATOR_CSS_UNIT
    );

    public static function sanitize($str)
    {
        $tokens = array();

        //tokenize opening tags with no attributes
        $pattern = '#<(/)?('. self::$_tags .')>#';
        $replace = '__SAFE_TAG_$1$2__';
        $str = preg_replace($pattern, $replace, $str);

        // tokenize tags with attributes
        $pattern = '#<('. self::$_tags .')(?:\s+(?:[a-z]+)="(?:[^"\\\]*(?:\\\"[^"\\\]*)*)")*\s*(/)?>#';
        preg_match_all($pattern, $str, $matches, PREG_SET_ORDER);
        foreach($matches as $i => $match) {
            $tokens[$i] = self::cleanTag($match[1], $match[0]);
            $str = str_replace($match[0], '__SAFE_TOKEN_'.$i.'__', $str);
        }

        $str = htmlspecialchars($str);

        foreach ($tokens as $i => $cleanTag) {
            $str = str_replace('__SAFE_TOKEN_'.$i.'__', $cleanTag, $str);
        }

        $pattern = '#__SAFE_TAG_(/?(?:'. self::$_tags .'))__#';
        $replace = '<$1>';
        $str = preg_replace($pattern, $replace, $str);

        return $str;
    }

    public static function cleanTag($tag, $str)
    {
        $cleanTag = '<' . $tag;

        if ($tag === 'a') {
            $cleanTag .= ' rel="nofolow" target="_blank"';
        }

        if (isset(self::$_attributes[$tag])) {
            foreach(self::$_attributes[$tag] as $attr => $attrPattern) {
                $pattern = '#'.$attr.'="('. $attrPattern .')"#';
                preg_match($pattern, $str, $match);
                if (isset($match[1])) {
                    if ($attr == 'style') {
                        $cleanTag .= ' style="' . self::cleanStyle($match[1]) . '"';
                    } else {
                        $cleanTag .= ' ' . $attr . '="' . $match[1] . '"';
                    }
                }
            }
        }

        if ($tag === 'img') {
            $cleanTag .= ' /';
        }

        $cleanTag .= '>';
        return $cleanTag;
    }

    public static function cleanStyle($style)
    {
        $cleanStyle = '';

        foreach(self::$_styleValidators as $stl => $stlPattern) {
            $pattern = '#[; ]?' . $stl . '\s*:\s*(' . $stlPattern . ')\s*;#i';
            preg_match($pattern, $style, $match);
            if (isset($match[1])) {
                $cleanStyle .= ($cleanStyle ? ' ' : '') . $stl . ':' . $match[1] . ';';
            }
        }

        return $cleanStyle;
    }
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

淡淡的优雅 2024-10-15 22:34:09

我可以使用与样式相关的属性来破坏您的网站。在某些浏览器中,通过设置巨大的宽度和高度组合来使浏览器崩溃。

我可以使用样式来进行更多的破坏。

我可以使用图像 SRC 属性来跟踪您的用户动作。监视他们并赚钱。

如果您按照预期完美编码,那么我可以对您的课程执行这些操作。您很可能没有这样做,并且可能还有其他漏洞可以访问更狡猾的东西。

I can use the style related attributes to deface your site. And in some browsers, to crash the browser by setting a huge width and height combo.

I can use styles to do even more defacing.

I can use the image SRC attribute to track your users movements. To spy on them and to make money.

These are all things I can do with your class if you coded it PERFECTLY as intended. Which odds are you didn't and there are probably other holes that open access to even more devious things.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文