这个正则表达式是多字节安全的吗?
我使用以下正则表达式来检查图像文件名仅包含字母数字、下划线、连字符、小数点:
preg_match('!^[\w.-]*$!',$filename)
这可以正常工作。但我对多字节字符感到担忧。我应该专门处理它们以防止未确定的错误,还是这个正则表达式应该拒绝 mb 文件名?
I'm using the following regex to check an image filename only contains alphanumeric, underscore, hyphen, decimal point:
preg_match('!^[\w.-]*$!',$filename)
This works ok. But I have concerns about multibyte characters. Should I specifically handle them to prevent undetermined errors, or should this regex reject mb filenames ok?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
PHP 没有对多字节字符的“本机”支持;您需要使用"mbstring"扩展文档< /sup>(可能可用,也可能不可用)。此外,似乎没有办法创建“多字节字符串”,因此,人们选择使用特殊的“mbstring”函数将本机字符串视为多字节字符串。换句话说,PHP 字符串不知道它自己的字符编码——您必须手动跟踪它。
只要您使用 UTF-8(或类似)编码,您就可以摆脱它。 UTF-8 始终将多字节字符编码为“高”字节(例如,
ß
编码为0xcf 0x9f
),因此 PHP 可能会像对待任何其他字符一样对待它们。您将无法使用可能将多字节字符编码为“特殊”PHP 字节的编码,例如“双引号”符号0x22
。PHP 中唯一知道如何处理多个字符集范围内的特定多字节字符的正则表达式函数是
mb_ereg
文档、mb_eregi
文档、mb_ereg_replace< /code>文档
和
mb_eregi_replace文档
。
基于 PCRE 的正则表达式函数,例如
preg_match
Docs 通过使用u-修饰符 (PCRE8)文档
。
当然,如上所述,PHP 字符串不知道自己的编码,因此您首先需要使用 mb_regex_encoding 函数来指示“mbstring”库。请注意,该函数指定要匹配的字符串的编码,而不是包含正则表达式本身的字符串。
PHP does not have "native" support for multibyte characters; you need to use the "mbstring" extensionDocs (which may or may not be available). Furthermore, it would appear that there is no way to create a "multibyte-character string", as such -- rather, one chooses to treat a native string as multibyte-character string by using special "mbstring" functions. In other words, a PHP string does not know its own character encoding -- you have to keep track of it manually.
You may be able to get away with it so long as you use UTF-8 (or similar) encoding. UTF-8 always encodes multibyte characters to "high" bytes (for instance,
ß
is encoded as0xcf 0x9f
), so PHP will probably treat them just like any other character. You would not be able to use an encoding that might potentially encode a multibyte character into "special" PHP bytes, such as0x22
, the "double-quote" symbol.The only regular expression functions in PHP that know how to deal with specific multibyte characters out of a range of multiple character-sets are
mb_ereg
Docs,mb_eregi
Docs,mb_ereg_replace
Docs andmb_eregi_replace
Docs.PCRE based regular expression functions like
preg_match
Docs support UTF-8 by using theu
-modifier (PCRE8)Docs.But of course, as described above PHP strings don't know their own encoding, so you first need to instruct the "mbstring" library using the mb_regex_encoding function. Note that that function specifies the encoding of the string you're matching, not the string containing the regular expression itself.