在 PHP 中将 utf-8 字符列入白名单的最有效方法是什么?
我的目标是通过为从客户端收到的所有 POST 数据创建严格的允许字符白名单来保护我的网站免受攻击。
当停留在 ASCII 字符范围内时,这是小菜一碟。例如:
if(preg_match('/[^aA-zZ0-9]/', $stringToTest))
{
// Battle stations!!
}
但是,我需要能够允许任何和所有 utf-8 字符,尤其是亚洲字符集,如日语、中文和韩语。但我不想排除任何具有古怪字符的人,比如阿拉伯语或俄语,或其他什么。同一个世界,同一份爱! ;)
如何允许人们输入他们的母语字符,同时排除邪恶脚本中使用的讨厌的字符,如 *、?、尖括号等?
My goal is to protect my web site from attacks by creating a strict whitelist of allowed characters for any and all POST data recieved from the client side.
This is a piece of cake when staying within ASCII characters. Something like:
if(preg_match('/[^aA-zZ0-9]/', $stringToTest))
{
// Battle stations!!
}
However, I need to be able to allow any and all utf-8 characters, especially asian character sets like Japanese, Chinese, and Korean. But I don't want to exclude anybody with wacky characters, like Arabic or Russian, or whatever. One world, one love! ;)
How can I allow people to input the characters of their native language while excluding the nasties used in evil scripts, like *, ?, angle brackets, and so on?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
\w
将为您提供单词字符(字母、数字和下划线),这可能就是您在\s
后的空白字符。例如
regular-expressions.info 是这方面的一个很好的参考 - 此处 和
编辑:需要更多说明,抱歉!
这是我通常用于 CJK 的内容:
要获取可能成为转义问题和其他黑帽内容的所有内容,请使用:
/[^\p{Punctuation}]/
( ==/[^\p{P}]/
)或
/[^\32-\151]/
( ==/[^!-~]/
)另一个好链接
\w
will give you word characters (letters, digits, and underscores), which is probably what you're after\s
for whitespace.e.g.
regular-expressions.info is an excellent reference for this stuff - here and here are a couple of relevant pages :)
edit: some more clarification needed, sorry!
here's what I usually use for CJK:
To get everything that's could be a problem for escaping and other black-hat stuff, use:
/[^\p{Punctuation}]/
( ==/[^\p{P}]/
)or
/[^\32-\151]/
( ==/[^!-~]/
)another good link
对于某些东西,您可以进行 Base64 编码,但我不得不删除一些不可行的功能,因为保留所有字符似乎更重要,而且现在肯定不值得花更多时间。
...
说完我遇到了这个,但如果你想要一个通用功能,那么问题似乎会变得效率,因为有这么多字符,但这不是一个大问题(中文,俄语和希腊语可能有单独的网页等) 。
http://www.php.net/manual/en/regexp.reference .unicode.php。
For some things you can base64 encode, but I've had to remove a tiny bit of functionality where that's not doable as keeping all characters seems more important and it's certainly not worth any more time right now.
...
After saying that I came across this but it seems the issue then becomes efficiency due to so many characters if you want a generic function but that isn't a huge issue (Chinese, Russian and Greek may have separate webpages etc.).
http://www.php.net/manual/en/regexp.reference.unicode.php.
尝试反转测试 - 使用黑名单而不是白名单。例如,
正则表达式可能不太正确,但您明白了。
Try inverting the test - use a blacklist instead of a whitelist. e.g.
Regex might not be quite right, but you get the idea.
我怀疑你能用这种方式保护任何东西。
你只会让公平用户的事情变得复杂,但不会阻止恶意用户。
我会退出一个不允许我输入问号、引言或电子邮件的网站。
简单的点肯定是“邪恶脚本中使用的令人讨厌的东西”之一。但任何没有它的消息都会看起来很难看。
而 SQL 注入只能使用字母字符来完成。
我认为这种“保护”毫无意义。
I doubt you can protect anything this way.
You will just complicate matters for the fair users, but don't stop malicious one.
I would just quit a site that won't allow me to enter a question mark or a quote, or e-mail.
Simple dot is among "nasties used in evil scripts" for sure. But any message without it would look ugly.
While SQL injection can be done using alphabet characters only.
I see no sense in such a "protection".