这个正则表达式是多字节安全的吗?

发布于 2024-10-24 07:52:38 字数 178 浏览 4 评论 0原文

我使用以下正则表达式来检查图像文件名仅包含字母数字、下划线、连字符、小数点:

preg_match('!^[\w.-]*$!',$filename) 

这可以正常工作。但我对多字节字符感到担忧。我应该专门处理它们以防止未确定的错误,还是这个正则表达式应该拒绝 mb 文件名?

I'm using the following regex to check an image filename only contains alphanumeric, underscore, hyphen, decimal point:

preg_match('!^[\w.-]*$!',$filename) 

This works ok. But I have concerns about multibyte characters. Should I specifically handle them to prevent undetermined errors, or should this regex reject mb filenames ok?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

我不会写诗 2024-10-31 07:52:38

PHP 没有对多字节字符的“本机”支持;您需要使用"mbstring"扩展文档< /sup>(可能可用,也可能不可用)。此外,似乎没有办法创建“多字节字符串”,因此,人们选择使用特殊的“mbstring”函数将本机字符串视为多字节字符串。换句话说,PHP 字符串不知道它自己的字符编码——您必须手动跟踪它。

只要您使用 UTF-8(或类似)编码,您就可以摆脱它。 UTF-8 始终将多字节字符编码为“高”字节(例如,ß 编码为 0xcf 0x9f),因此 PHP 可能会像对待任何其他字符一样对待它们。您将无法使用可能将多字节字符编码为“特殊”PHP 字节的编码,例如“双引号”符号0x22

PHP 中唯一知道如何处理多个字符集范围内的特定多字节字符的正则表达式函数是 mb_ereg文档mb_eregi文档mb_ereg_replace< /code>文档mb_eregi_replace文档

基于 PCRE 的正则表达式函数,例如 preg_matchDocs 通过使用 u-修饰符 (PCRE8)文档

当然,如上所述,PHP 字符串不知道自己的编码,因此您首先需要使用 mb_regex_encoding 函数来指示“mbstring”库。请注意,该函数指定要匹配的字符串的编码,而不是包含正则表达式本身的字符串。

PHP does not have "native" support for multibyte characters; you need to use the "mbstring" extension­Docs (which may or may not be available). Furthermore, it would appear that there is no way to create a "multibyte-character string", as such -- rather, one chooses to treat a native string as multibyte-character string by using special "mbstring" functions. In other words, a PHP string does not know its own character encoding -- you have to keep track of it manually.

You may be able to get away with it so long as you use UTF-8 (or similar) encoding. UTF-8 always encodes multibyte characters to "high" bytes (for instance, ß is encoded as 0xcf 0x9f), so PHP will probably treat them just like any other character. You would not be able to use an encoding that might potentially encode a multibyte character into "special" PHP bytes, such as 0x22, the "double-quote" symbol.

The only regular expression functions in PHP that know how to deal with specific multibyte characters out of a range of multiple character-sets are mb_ereg­Docs, mb_eregi­Docs, mb_ereg_replace­Docs and mb_eregi_replace­Docs.

PCRE based regular expression functions like preg_match­Docs support UTF-8 by using the u-modifier (PCRE8)­Docs.

But of course, as described above PHP strings don't know their own encoding, so you first need to instruct the "mbstring" library using the mb_regex_encoding function. Note that that function specifies the encoding of the string you're matching, not the string containing the regular expression itself.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文