在 PHP 中将 utf-8 字符列入白名单的最有效方法是什么？

发布于 2024-10-18 15:26:54 字数 348 浏览 11 评论 0原文

我的目标是通过为从客户端收到的所有 POST 数据创建严格的允许字符白名单来保护我的网站免受攻击。

当停留在 ASCII 字符范围内时，这是小菜一碟。例如：

if(preg_match('/[^aA-zZ0-9]/', $stringToTest))
{
   // Battle stations!!
}

但是，我需要能够允许任何和所有 utf-8 字符，尤其是亚洲字符集，如日语、中文和韩语。但我不想排除任何具有古怪字符的人，比如阿拉伯语或俄语，或其他什么。同一个世界，同一份爱！ ;)

如何允许人们输入他们的母语字符，同时排除邪恶脚本中使用的讨厌的字符，如 *、?、尖括号等？

原文

My goal is to protect my web site from attacks by creating a strict whitelist of allowed characters for any and all POST data recieved from the client side.

This is a piece of cake when staying within ASCII characters. Something like:

if(preg_match('/[^aA-zZ0-9]/', $stringToTest))
{
   // Battle stations!!
}

However, I need to be able to allow any and all utf-8 characters, especially asian character sets like Japanese, Chinese, and Korean. But I don't want to exclude anybody with wacky characters, like Arabic or Russian, or whatever. One world, one love! ;)

How can I allow people to input the characters of their native language while excluding the nasties used in evil scripts, like *, ?, angle brackets, and so on?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

剩余の解释 2024-10-25 15:26:54

\w 将为您提供单词字符（字母、数字和下划线），这可能就是您在 \s 后的空白字符。

例如

if(preg_match('/[\w\s]/', $stringToTest))
{
   // Battle stations!!
}

regular-expressions.info 是这方面的一个很好的参考 - 此处和

编辑：需要更多说明，抱歉！

这是我通常用于 CJK 的内容：

function get_CJK_ranges() {

    return array(
                "[\x{2E80}-\x{2EFF}]",      # CJK Radicals Supplement
                "[\x{2F00}-\x{2FDF}]",      # Kangxi Radicals
                "[\x{2FF0}-\x{2FFF}]",      # Ideographic Description Characters
                "[\x{3000}-\x{303F}]",      # CJK Symbols and Punctuation
                "[\x{3040}-\x{309F}]",      # Hiragana
                "[\x{30A0}-\x{30FF}]",      # Katakana
                "[\x{3100}-\x{312F}]",      # Bopomofo
                "[\x{3130}-\x{318F}]",      # Hangul Compatibility Jamo
                "[\x{3190}-\x{319F}]",      # Kanbun
                "[\x{31A0}-\x{31BF}]",      # Bopomofo Extended
                "[\x{31F0}-\x{31FF}]",      # Katakana Phonetic Extensions
                "[\x{3200}-\x{32FF}]",      # Enclosed CJK Letters and Months
                "[\x{3300}-\x{33FF}]",      # CJK Compatibility
                "[\x{3400}-\x{4DBF}]",      # CJK Unified Ideographs Extension A
                "[\x{4DC0}-\x{4DFF}]",      # Yijing Hexagram Symbols
                "[\x{4E00}-\x{9FFF}]",      # CJK Unified Ideographs
                "[\x{A000}-\x{A48F}]",      # Yi Syllables
                "[\x{A490}-\x{A4CF}]",      # Yi Radicals
                "[\x{AC00}-\x{D7AF}]",      # Hangul Syllables
                "[\x{F900}-\x{FAFF}]",      # CJK Compatibility Ideographs
                "[\x{FE30}-\x{FE4F}]",      # CJK Compatibility Forms
                "[\x{1D300}-\x{1D35F}]",    # Tai Xuan Jing Symbols
                "[\x{20000}-\x{2A6DF}]",    # CJK Unified Ideographs Extension B
                "[\x{2F800}-\x{2FA1F}]"     # CJK Compatibility Ideographs Supplement
    );

}

function contains_CJK($string) {
    $regex = '/'.implode('|',get_CJK_ranges()).'/u';
    return preg_match($regex,$string);
}

要获取可能成为转义问题和其他黑帽内容的所有内容，请使用：

/[^\p{Punctuation}]/ ( == /[^\p{P}]/ )

或

/[^\32-\151]/ ( == /[^!-~]/ )

另一个好链接

\w will give you word characters (letters, digits, and underscores), which is probably what you're after \s for whitespace.

e.g.

if(preg_match('/[\w\s]/', $stringToTest))
{
   // Battle stations!!
}

regular-expressions.info is an excellent reference for this stuff - here and here are a couple of relevant pages :)

edit: some more clarification needed, sorry!

here's what I usually use for CJK:

function get_CJK_ranges() {

    return array(
                "[\x{2E80}-\x{2EFF}]",      # CJK Radicals Supplement
                "[\x{2F00}-\x{2FDF}]",      # Kangxi Radicals
                "[\x{2FF0}-\x{2FFF}]",      # Ideographic Description Characters
                "[\x{3000}-\x{303F}]",      # CJK Symbols and Punctuation
                "[\x{3040}-\x{309F}]",      # Hiragana
                "[\x{30A0}-\x{30FF}]",      # Katakana
                "[\x{3100}-\x{312F}]",      # Bopomofo
                "[\x{3130}-\x{318F}]",      # Hangul Compatibility Jamo
                "[\x{3190}-\x{319F}]",      # Kanbun
                "[\x{31A0}-\x{31BF}]",      # Bopomofo Extended
                "[\x{31F0}-\x{31FF}]",      # Katakana Phonetic Extensions
                "[\x{3200}-\x{32FF}]",      # Enclosed CJK Letters and Months
                "[\x{3300}-\x{33FF}]",      # CJK Compatibility
                "[\x{3400}-\x{4DBF}]",      # CJK Unified Ideographs Extension A
                "[\x{4DC0}-\x{4DFF}]",      # Yijing Hexagram Symbols
                "[\x{4E00}-\x{9FFF}]",      # CJK Unified Ideographs
                "[\x{A000}-\x{A48F}]",      # Yi Syllables
                "[\x{A490}-\x{A4CF}]",      # Yi Radicals
                "[\x{AC00}-\x{D7AF}]",      # Hangul Syllables
                "[\x{F900}-\x{FAFF}]",      # CJK Compatibility Ideographs
                "[\x{FE30}-\x{FE4F}]",      # CJK Compatibility Forms
                "[\x{1D300}-\x{1D35F}]",    # Tai Xuan Jing Symbols
                "[\x{20000}-\x{2A6DF}]",    # CJK Unified Ideographs Extension B
                "[\x{2F800}-\x{2FA1F}]"     # CJK Compatibility Ideographs Supplement
    );

}

function contains_CJK($string) {
    $regex = '/'.implode('|',get_CJK_ranges()).'/u';
    return preg_match($regex,$string);
}

To get everything that's could be a problem for escaping and other black-hat stuff, use:

/[^\p{Punctuation}]/ ( == /[^\p{P}]/ )

/[^\32-\151]/ ( == /[^!-~]/ )

another good link

回复收藏 0 原文

天邊彩虹 2024-10-25 15:26:54

对于某些东西，您可以进行 Base64 编码，但我不得不删除一些不可行的功能，因为保留所有字符似乎更重要，而且现在肯定不值得花更多时间。

...

说完我遇到了这个，但如果你想要一个通用功能，那么问题似乎会变得效率，因为有这么多字符，但这不是一个大问题（中文，俄语和希腊语可能有单独的网页等）。

http://www.php.net/manual/en/regexp.reference .unicode.php。

回复收藏 0 原文

蓝天白云 2024-10-25 15:26:54

尝试反转测试 - 使用黑名单而不是白名单。例如，

if(preg_match('/[\*\?<>]/', $stringToTest))
{
    // Battle stations!!
}

正则表达式可能不太正确，但您明白了。

Try inverting the test - use a blacklist instead of a whitelist. e.g.

if(preg_match('/[\*\?<>]/', $stringToTest))
{
    // Battle stations!!
}

Regex might not be quite right, but you get the idea.

回复收藏 0 原文

赏烟花じ飞满天 2024-10-25 15:26:54

我怀疑你能用这种方式保护任何东西。
你只会让公平用户的事情变得复杂，但不会阻止恶意用户。

我会退出一个不允许我输入问号、引言或电子邮件的网站。
简单的点肯定是“邪恶脚本中使用的令人讨厌的东西”之一。但任何没有它的消息都会看起来很难看。

而 SQL 注入只能使用字母字符来完成。

我认为这种“保护”毫无意义。

回复收藏 0 原文

~没有更多了~

关于作者

狼性发作

暂无简介

文章

27 人气

关注发私信

友情链接

文江博客

在 PHP 中将 utf-8 字符列入白名单的最有效方法是什么？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

燃烧我的卡路李先生

qq_2gSKZM

∞梦里开花

qq_IklFPL

迷途知返

深海不蓝

友情链接

在 PHP 中将 utf-8 字符列入白名单的最有效方法是什么？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

燃烧我的卡路李先生

qq_2gSKZM

∞梦里开花

qq_IklFPL

迷途知返

深海不蓝

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。