允许用户在 PHP 中提交 HTML

发布于 2024-08-03 15:40:14 字数 2687 浏览 11 评论 0原文

我想允许大量用户提交 html 作为用户配置文件,我目前尝试过滤掉我不想要的内容,但我现在想更改并使用白名单方法。

这是我当前的非白名单方法

function FilterHTML($string) {
    if (get_magic_quotes_gpc()) {
        $string = stripslashes($string);
    }
    $string = html_entity_decode($string, ENT_QUOTES, "ISO-8859-1");
    // convert decimal
    $string = preg_replace('/&#(\d+)/me', "chr(\\1)", $string); // decimal notation
    // convert hex
    $string = preg_replace('/&#x([a-f0-9]+)/mei', "chr(0x\\1)", $string); // hex notation
    //$string = html_entity_decode($string, ENT_COMPAT, "UTF-8");
    $string = preg_replace('#(&\#*\w+)[\x00-\x20]+;#U', "$1;", $string);
    $string = preg_replace('#(<[^>]+[\s\r\n\"\'])(on|xmlns)[^>]*>#iU', "$1>", $string);
    //$string = preg_replace('#(&\#x*)([0-9A-F]+);*#iu', "$1$2;", $string); //bad line
    $string = preg_replace('#/*\*()[^>]*\*/#i', "", $string); // REMOVE /**/
    $string = preg_replace('#([a-z]*)[\x00-\x20]*([\`\'\"]*)[\\x00-\x20]*j[\x00-\x20]*a[\x00-\x20]*v[\x00-\x20]*a[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iU', '...', $string); //JAVASCRIPT
    $string = preg_replace('#([a-z]*)([\'\"]*)[\x00-\x20]*v[\x00-\x20]*b[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iU', '...', $string); //VBSCRIPT
    $string = preg_replace('#([a-z]*)[\x00-\x20]*([\\\]*)[\\x00-\x20]*@([\\\]*)[\x00-\x20]*i([\\\]*)[\x00-\x20]*m([\\\]*)[\x00-\x20]*p([\\\]*)[\x00-\x20]*o([\\\]*)[\x00-\x20]*r([\\\]*)[\x00-\x20]*t#iU', '...', $string); //@IMPORT
    $string = preg_replace('#([a-z]*)[\x00-\x20]*e[\x00-\x20]*x[\x00-\x20]*p[\x00-\x20]*r[\x00-\x20]*e[\x00-\x20]*s[\x00-\x20]*s[\x00-\x20]*i[\x00-\x20]*o[\x00-\x20]*n#iU', '...', $string); //EXPRESSION
    $string = preg_replace('#</*\w+:\w[^>]*>#i', "", $string);
    $string = preg_replace('#</?t(able|r|d)(\s[^>]*)?>#i', '', $string); // strip out tables
    $string = preg_replace('/(potspace|pot space|rateuser|marquee)/i', '...', $string); // filter some words
    //$string = str_replace('left:0px; top: 0px;','',$string);
    do {
        $oldstring = $string;
        //bgsound|
        $string = preg_replace('#</*(applet|meta|xml|blink|link|script|iframe|frame|frameset|ilayer|layer|title|base|body|xml|AllowScriptAccess|big)[^>]*>#i', "...", $string);
    } while ($oldstring != $string);
    return addslashes($string);
}

上面的效果很好,我使用它两年后从未遇到过任何问题,但是对于白名单方法,有没有类似于 stackoverflows C# 方法但在 PHP 中的东西? http://refactormycode.com/codes/333-sanitize-html

I want to allow a lot of user submitted html for user profiles, I currently try to filter out what I don't want but I am now wanting to change and use a whitelist approach.

Here is my current non-whitelist approach

function FilterHTML($string) {
    if (get_magic_quotes_gpc()) {
        $string = stripslashes($string);
    }
    $string = html_entity_decode($string, ENT_QUOTES, "ISO-8859-1");
    // convert decimal
    $string = preg_replace('/&#(\d+)/me', "chr(\\1)", $string); // decimal notation
    // convert hex
    $string = preg_replace('/&#x([a-f0-9]+)/mei', "chr(0x\\1)", $string); // hex notation
    //$string = html_entity_decode($string, ENT_COMPAT, "UTF-8");
    $string = preg_replace('#(&\#*\w+)[\x00-\x20]+;#U', "$1;", $string);
    $string = preg_replace('#(<[^>]+[\s\r\n\"\'])(on|xmlns)[^>]*>#iU', "$1>", $string);
    //$string = preg_replace('#(&\#x*)([0-9A-F]+);*#iu', "$1$2;", $string); //bad line
    $string = preg_replace('#/*\*()[^>]*\*/#i', "", $string); // REMOVE /**/
    $string = preg_replace('#([a-z]*)[\x00-\x20]*([\`\'\"]*)[\\x00-\x20]*j[\x00-\x20]*a[\x00-\x20]*v[\x00-\x20]*a[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iU', '...', $string); //JAVASCRIPT
    $string = preg_replace('#([a-z]*)([\'\"]*)[\x00-\x20]*v[\x00-\x20]*b[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iU', '...', $string); //VBSCRIPT
    $string = preg_replace('#([a-z]*)[\x00-\x20]*([\\\]*)[\\x00-\x20]*@([\\\]*)[\x00-\x20]*i([\\\]*)[\x00-\x20]*m([\\\]*)[\x00-\x20]*p([\\\]*)[\x00-\x20]*o([\\\]*)[\x00-\x20]*r([\\\]*)[\x00-\x20]*t#iU', '...', $string); //@IMPORT
    $string = preg_replace('#([a-z]*)[\x00-\x20]*e[\x00-\x20]*x[\x00-\x20]*p[\x00-\x20]*r[\x00-\x20]*e[\x00-\x20]*s[\x00-\x20]*s[\x00-\x20]*i[\x00-\x20]*o[\x00-\x20]*n#iU', '...', $string); //EXPRESSION
    $string = preg_replace('#</*\w+:\w[^>]*>#i', "", $string);
    $string = preg_replace('#</?t(able|r|d)(\s[^>]*)?>#i', '', $string); // strip out tables
    $string = preg_replace('/(potspace|pot space|rateuser|marquee)/i', '...', $string); // filter some words
    //$string = str_replace('left:0px; top: 0px;','',$string);
    do {
        $oldstring = $string;
        //bgsound|
        $string = preg_replace('#</*(applet|meta|xml|blink|link|script|iframe|frame|frameset|ilayer|layer|title|base|body|xml|AllowScriptAccess|big)[^>]*>#i', "...", $string);
    } while ($oldstring != $string);
    return addslashes($string);
}

The above works pretty well, I have never had any problems after 2 years of use with it but for a whitelist approach is there anything similars to stackoverflows C# method but in PHP?
http://refactormycode.com/codes/333-sanitize-html

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

反话 2024-08-10 15:40:14

HTML Purifier 是一个
符合标准的 HTML 过滤器
用 PHP 编写的库。 HTML 净化器
不仅会删除所有恶意
代码(更广为人知的名称是 XSS)
经过彻底审核,安全可靠
宽容的白名单,它也会
确保您的文件
符合标准,仅此而已
可以通过全面的
了解 W3C 规范。

HTML Purifier is a
standards-compliant HTML filter
library written in PHP. HTML Purifier
will not only remove all malicious
code (better known as XSS) with a
thoroughly audited, secure yet
permissive whitelist, it will also
make sure your documents are
standards compliant, something only
achievable with a comprehensive
knowledge of W3C's specifications.

維他命╮ 2024-08-10 15:40:14

也许使用 DOMDocument 正确分析它更安全,删除不允许的用removeChild()标记,然后得到结果。
使用正则表达式过滤内容并不总是安全的,特别是当事情开始变得如此复杂时。黑客可以找到一种方法来欺骗您的过滤器,论坛和社交网络对此非常了解。

例如,浏览器会忽略 < 后面的空格。你的正则表达式过滤器<脚本,但如果我使用<脚本...大失败!

Maybe it is safer to use DOMDocument to analyze it correctly, remove disallowed tags with removeChild() and then get the result.
It is not always safe to filter stuff with regular expressions, specially if things start to get such complexity. Hackers can find a way to cheat your filters, forums and social networks do know that very well.

For instance, browsers ignore spaces after the <. Your regex filter <script, but if I use < script... big FAIL!

小傻瓜 2024-08-10 15:40:14

HTML Purifier 是最好的 HTML 解析器/清理器。

HTML Purifier is the best HTML parser/cleaner out there.

缱倦旧时光 2024-08-10 15:40:14

对于那些建议只使用 strip_tags 的人...请注意: strip_tags 不会删除标签属性和损坏的标签也会搞砸的。

从手册页:

警告因为 strip_tags() 实际上并不验证 HTML,部分或损坏的标签可能会导致删除比预期更多的文本/数据。

警告此函数不会修改
您指定的标签上的任何属性
允许使用 allowed_tags ,包括
style 和 onmouseover 属性
恶作剧的用户可能会滥用
发布将显示的文本
其他用户。

您不能仅依赖这一种解决方案。

For those of you suggesting simply using strip_tags...be aware: strip_tags will NOT strip out tag attributes and broken tags will also mess it up.

From the manual page:

Warning Because strip_tags() does not actually validate the HTML, partial, or broken tags can result in the removal of more text/data than expected.

Warning This function does not modify
any attributes on the tags that you
allow using allowable_tags , including
the style and onmouseover attributes
that a mischievous user may abuse when
posting text that will be shown to
other users.

You CANNOT rely on just this one solution.

对风讲故事 2024-08-10 15:40:14

您可以使用 strip_tags() 函数,

因为该函数定义为

string strip_tags  ( string $str  [, string $allowable_tags  ] )

您可以这样做:

$html = $_POST['content'];
$html = strip_tags($html, '<b><a><i><u><span>');

但请注意,使用 strip_tags,您将无法过滤掉属性。例如

<a href="javascript:alert('haha caught cha!');">link</a>

You can just use the strip_tags() function

Since the function is defined as

string strip_tags  ( string $str  [, string $allowable_tags  ] )

You can do this:

$html = $_POST['content'];
$html = strip_tags($html, '<b><a><i><u><span>');

But take note that using strip_tags, you won't be able to filter off the attributes. e.g.

<a href="javascript:alert('haha caught cha!');">link</a>
早乙女 2024-08-10 15:40:14

尝试下面这个函数“getCleanHTML”,从元素中提取文本内容,但标签名称在白名单中的元素除外。这段代码很干净,易于理解和调试。

<?php

$TagWhiteList = array(
    'b', 'i', 'u', 'strong', 'em', 'a', 'img'
);

function getHTMLCode($Node) {
    $Document = new DOMDocument();    
    $Document->appendChild($Document->importNode($Node, true));
    return $Document->saveHTML();
}
function getCleanHTML($Node, $Text = "") {
    global $TagWhiteList;

    $TextName = $Node->tagName;
    if ($TextName == null)
        return $Text.$Node->textContent;

    if (in_array($TextName, $TagWhiteList)) 
        return $Text.getHTMLCode($Node);

    $Node = $Node->firstChild;
    if ($Node != null)
        $Text = getCleanHTML($Node, $Text);

    while($Node->nextSibling != null) {
        $Text = getCleanHTML($Node->nextSibling, $Text);
        $Node = $Node->nextSibling;
    }
    return $Text;
}

$Doc = new DOMDocument();
$Doc->loadHTMLFile("Test.html");
echo getCleanHTML($Doc->documentElement)."\n";

?>

希望这有帮助。

Try this function "getCleanHTML" below, extract text content from the elements with exceptions of elements with tag name in the whitelist. This code is clean and easy to understand and debug.

<?php

$TagWhiteList = array(
    'b', 'i', 'u', 'strong', 'em', 'a', 'img'
);

function getHTMLCode($Node) {
    $Document = new DOMDocument();    
    $Document->appendChild($Document->importNode($Node, true));
    return $Document->saveHTML();
}
function getCleanHTML($Node, $Text = "") {
    global $TagWhiteList;

    $TextName = $Node->tagName;
    if ($TextName == null)
        return $Text.$Node->textContent;

    if (in_array($TextName, $TagWhiteList)) 
        return $Text.getHTMLCode($Node);

    $Node = $Node->firstChild;
    if ($Node != null)
        $Text = getCleanHTML($Node, $Text);

    while($Node->nextSibling != null) {
        $Text = getCleanHTML($Node->nextSibling, $Text);
        $Node = $Node->nextSibling;
    }
    return $Text;
}

$Doc = new DOMDocument();
$Doc->loadHTMLFile("Test.html");
echo getCleanHTML($Doc->documentElement)."\n";

?>

Hope this helps.

唯憾梦倾城 2024-08-10 15:40:14

实际上,这是一个非常简单的目标 - 您只需要检查白名单标签列表中不属于某些标签的任何内容,并将它们从源中删除。使用一个正则表达式可以很容易地完成这一任务。

function sanitize($html) {
  $whitelist = array(
    'b', 'i', 'u', 'strong', 'em', 'a'
  );

  return preg_replace("/<(^".implode("|", $whitelist).")(.*)>(.*)<\/(^".implode("|", $whitelist).")>/", "", $html);
}

我还没有对此进行测试,并且可能在某个地方存在错误,但您已经了解了它是如何工作的要点。您可能还想考虑使用格式化语言,例如 Textile 或 Markdown。

杰米

It's a pretty simple aim to achieve actually - you just need to check for anything that's NOT some tags from a list of whitelisted tags and remove them from the source. It can be done quite easily with one regex.

function sanitize($html) {
  $whitelist = array(
    'b', 'i', 'u', 'strong', 'em', 'a'
  );

  return preg_replace("/<(^".implode("|", $whitelist).")(.*)>(.*)<\/(^".implode("|", $whitelist).")>/", "", $html);
}

I haven't tested this, and there's probably an error in there somewhere but you get the gist of how it works. You might also want to look at using a formatting language such as Textile or Markdown.

Jamie

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文