如何使用 PHP 去除 HTML 文档中的所有 javascript?

发布于 2024-10-05 08:13:26 字数 280 浏览 5 评论 0原文

在我的电子邮件程序中,我在发送电子邮件之前使用 Tidy 来清理 HTML。一个问题开始持续存在,如果我发送一封邮件从网络上的 url 获取 html,文档中可能存在一些 javascript。

我想通过删除所有嵌入的、引用的和任何形式的 JavaScript 来进一步清理这个 html 文档,以便邮件仅存在 html。

我想使用 php 的 preg_replace() 从邮件中删除所有 javascript,并且我需要一些有关最佳正则表达式的帮助,因为我必须承认这不是我的最强点。

In my email program I use Tidy to clean up the HTML before I send out the emails. A problem is beginning to persist that if I send a mail fetching the html from a url on the web there may exist some javascript in the document.

I want to clean up this html document even more by stripping out all javascript, embedded, referenced and in any form so that the mail exist only of html.

I want to use php's preg_replace() to strip out all javascript from a mail and I need some help with the best regex because it's not my strongest point i must confess.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

梦里兽 2024-10-12 08:13:40

我用了这个:

//remove js,css,head.....
static function cleanElements($html){

  $search = array (
         "'<script[^>]*?>.*?</script>'si",  //remove js
          "'<style[^>]*?>.*?</style>'si", //remove css 

      "'<head[^>]*?>.*?</head>'si", //remove head
     "'<link[^>]*?>.*?</link>'si", //remove link
     "'<object[^>]*?>.*?</object>'si"
                  ); 
        $replace = array ( 
              "",
                                   "",
              "",
              "",
              ""
                      );                 
  return preg_replace ($search, $replace, $html);
 }

http:// allenprogram.blogspot.pt/2012/04/php-remove-js-css-head-obj-elements.html

删除除 body 和 html 之外的所有标签、脚本和样式,所以使用它后,我使用 strip_tags 。

I used this one:

//remove js,css,head.....
static function cleanElements($html){

  $search = array (
         "'<script[^>]*?>.*?</script>'si",  //remove js
          "'<style[^>]*?>.*?</style>'si", //remove css 

      "'<head[^>]*?>.*?</head>'si", //remove head
     "'<link[^>]*?>.*?</link>'si", //remove link
     "'<object[^>]*?>.*?</object>'si"
                  ); 
        $replace = array ( 
              "",
                                   "",
              "",
              "",
              ""
                      );                 
  return preg_replace ($search, $replace, $html);
 }

http://allenprogram.blogspot.pt/2012/04/php-remove-js-css-head-obj-elements.html

Removes all tags, scripts and styles, except body and html, so after using it, i use strip_tags.

漆黑的白昼 2024-10-12 08:13:39

这并不能保证(如下),但我尝试制定轻量级解决方案,因为 html purifier (http://htmlpurifier.org) 对于我的小目标来说有点大了。
我的目标是防止 XSS,仅此而已,因此 XSS 尝试的结果对于这段代码来说会是很多肮脏的事情,但我认为它是安全的::

<?
//href="javascript:
//style="....expression
//style="....behavior
//<script
//on*="
$str = '
    asd 
    <a STyLE="asd; expression" hRef=" javascript:" onx="asd">asd</a>
    asd
    <code><a href="javascript:">asd</a></code>
    <scr<script></script>ipt ... >asd</script>
    <a style="hey:good boy;" href="javascript:">asd</a>';

function stripteaser($str, $StripHTMLTags = true, $AllowableTags = NULL) {
    $str = explode('<code>', $str);
    $codes = array();
    if (count($str) > 1) {
        foreach ($str as $idx => $val) {
            $val = explode('</code>', $val);
            if (count($val) > 1) {
                $uid = md5(uniqid(mt_rand(), true));
                $codes[$uid] = htmlentities(array_shift($val), ENT_QUOTES, 'UTF-8');
                $str[$idx] = "##$uid##" . implode('', $val);
            }
        }
    }
    $str = implode('', $str);
    while (stripos($str, '<script') !== false) {
        $str = str_ireplace('<script', '<script', $str);
    }
    $rptjob = function(&$str, $regexp) {
                while (preg_match($regexp, $str, $matches)) {
                    $str = str_ireplace($matches[0], htmlentities($matches[0], ENT_QUOTES, 'UTF-8'), $str);
                }
            };
    $rptjob($str, '/href[\s\n\t]*=[\s\n\t]*[\"\'][\s\n\t]*(javascript:|data:)/i'); //href = "javascript:
    $rptjob($str, '/style[\s\n\t]*=[\s\n\t]*[\"][^\"]*expression/i'); //style = "...expression
    $rptjob($str, '/style[\s\n\t]*=[\s\n\t]*[\'][^\']*expression/i'); //style = '...expression
    $rptjob($str, '/style[\s\n\t]*=[\s\n\t]*[\"][^\"]*behavior/i'); //style = "...behavior
    $rptjob($str, '/style[\s\n\t]*=[\s\n\t]*[\'][^\']*behavior/i'); //style = '...behavior
    $rptjob($str, '/on\w+[\s\n\t]*=[\s\n\t]*[\"\']/i'); //onasd = "
    if ($StripHTMLTags)
        $str = strip_tags($str, $AllowableTags);
    foreach ($codes as $idx => $code) {
        $str = str_replace("##$idx##", $code, $str);
    }
    return $str;
}

echo stripteaser($str);
exit;
?>

D
家里这个月亮的脏代码......然而,这不是一个好工作(很多情况下需要一些CPU时间),但对于我的小目标来说,它比另一个巨大的组件(如html净化器)更好。

结果将是:

asd 
<a STyLE="asd; expression" hRef=" javascript:" onx="asd">asd</a>
asd
<a href="javascript:">asd</a>
<scri<script></script>pt ... >asd</script>
<a style="hey:good boy;" href="javascript:">asd</a>

我对 css 表达式没有经验,但我知道 IE 中 JS VML 用于弯角的行为,因此可能很危险。
最后,没有任何保证。

我希望它对一些朋友有用
;)

There's no guarantee with this(as below) but I tried to make my light weight solution because html purifier (http://htmlpurifier.org) is a few huge for my tiny goal.
My goal is to preventing XSS and nothing more so the result for XSS attempts will be a lot of dirty things for this code BUT I think it will be SAFE :

<?
//href="javascript:
//style="....expression
//style="....behavior
//<script
//on*="
$str = '
    asd 
    <a STyLE="asd; expression" hRef=" javascript:" onx="asd">asd</a>
    asd
    <code><a href="javascript:">asd</a></code>
    <scr<script></script>ipt ... >asd</script>
    <a style="hey:good boy;" href="javascript:">asd</a>';

function stripteaser($str, $StripHTMLTags = true, $AllowableTags = NULL) {
    $str = explode('<code>', $str);
    $codes = array();
    if (count($str) > 1) {
        foreach ($str as $idx => $val) {
            $val = explode('</code>', $val);
            if (count($val) > 1) {
                $uid = md5(uniqid(mt_rand(), true));
                $codes[$uid] = htmlentities(array_shift($val), ENT_QUOTES, 'UTF-8');
                $str[$idx] = "##$uid##" . implode('', $val);
            }
        }
    }
    $str = implode('', $str);
    while (stripos($str, '<script') !== false) {
        $str = str_ireplace('<script', '<script', $str);
    }
    $rptjob = function(&$str, $regexp) {
                while (preg_match($regexp, $str, $matches)) {
                    $str = str_ireplace($matches[0], htmlentities($matches[0], ENT_QUOTES, 'UTF-8'), $str);
                }
            };
    $rptjob($str, '/href[\s\n\t]*=[\s\n\t]*[\"\'][\s\n\t]*(javascript:|data:)/i'); //href = "javascript:
    $rptjob($str, '/style[\s\n\t]*=[\s\n\t]*[\"][^\"]*expression/i'); //style = "...expression
    $rptjob($str, '/style[\s\n\t]*=[\s\n\t]*[\'][^\']*expression/i'); //style = '...expression
    $rptjob($str, '/style[\s\n\t]*=[\s\n\t]*[\"][^\"]*behavior/i'); //style = "...behavior
    $rptjob($str, '/style[\s\n\t]*=[\s\n\t]*[\'][^\']*behavior/i'); //style = '...behavior
    $rptjob($str, '/on\w+[\s\n\t]*=[\s\n\t]*[\"\']/i'); //onasd = "
    if ($StripHTMLTags)
        $str = strip_tags($str, $AllowableTags);
    foreach ($codes as $idx => $code) {
        $str = str_replace("##$idx##", $code, $str);
    }
    return $str;
}

echo stripteaser($str);
exit;
?>

:D
Dirty code for this moon at home and ... However it's not a good job (a lot of while conditions take a few CPU time) but it's better than another huge component like html purifier for my tiny goal.

RESULT WILL BE:

asd 
<a STyLE="asd; expression" hRef=" javascript:" onx="asd">asd</a>
asd
<a href="javascript:">asd</a>
<scri<script></script>pt ... >asd</script>
<a style="hey:good boy;" href="javascript:">asd</a>

I have no experience to css expressions but I know about behavior using for JS VML in IE for curved corners so can be dangerous.
AND FINALLY THERE IS NO AND NO GUARANTEE.

I hope it can be useful for some friend
;)

素手挽清风 2024-10-12 08:13:36

您可以使用 strip_tags,传入您希望允许(白名单)作为第二个参数的标签,但这不会删除内联 JS - 内联 JS 可能存在于 onclick 属性等中。

echo strip_tags($html, '<p><a><small>');

You can use strip_tags, passing in the tags you wish to allow (whitelist) as the second parameter, but that will not remove inline JS - which might be present in onclick properties and such.

echo strip_tags($html, '<p><a><small>');
又爬满兰若 2024-10-12 08:13:34
echo preg_replace('/<script\b[^>]*>(.*?)<\/script>/is', "", $var); 

此处所示。

echo preg_replace('/<script\b[^>]*>(.*?)<\/script>/is', "", $var); 

As shown here.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文