PCRE/PHP 中匹配 Unicode 字母字符
我正在尝试在 PHP 中编写一个相当宽松的名称验证器,我的第一次尝试包含以下模式:
// unicode letters, apostrophe, hyphen, space
$namePattern = "/^([\\p{L}'\\- ])+$/";
这最终传递给对 preg_match()
的调用。据我所知,这适用于普通的 ASCII 字母,但似乎会遇到像 Ă 或张这样的更复杂的字符。
难道是图案本身有问题吗?也许我期望 \p{L}
做的工作比我想象的更多?
或者它与传入输入的方式有关?我不确定它是否相关,但我确实确保在表单页面上指定了 UTF8 编码。
I'm trying to write a reasonably permissive validator for names in PHP, and my first attempt consists of the following pattern:
// unicode letters, apostrophe, hyphen, space
$namePattern = "/^([\\p{L}'\\- ])+$/";
This is eventually passed to a call to preg_match()
. As far as I can tell, this works with your vanilla ASCII alphabet, but seems to trip up on spicier characters like Ă or 张.
Is there something wrong with the pattern itself? Perhaps I'm expecting \p{L}
to do more work than I think it does?
Or does it have something to do with the way input is being passed in? I'm not sure if it's relevant, but I did make sure to specify a UTF8 encoding on the form page.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
我认为问题比这简单得多:您忘记指定
u
修饰符。 Unicode 字符属性仅在 UTF-8 模式下可用。你的正则表达式应该是:
I think the problem is much simpler than that: You forgot to specify the
u
modifier. The Unicode character properties are only available in UTF-8 mode.Your regex should be:
如果其他人看到这里但无法使其正常工作,请注意
/u
不会在不同 PHP 版本中使用 Unicode 脚本产生一致的结果。请参阅示例:https://3v4l.org/4hB9e
相关:不同 PHP 版本中泰语字符的正则表达式结果不一致
Anyone else looking here and not getting this to work, please note that
/u
will not produce consistent result with Unicode scripts across different PHP versions.See example: https://3v4l.org/4hB9e
Related: Incosistent regex result for Thai characters across different PHP version
如果你想用
新模式
替换Unicode旧模式
,你应该这样写:所以这里的关键是
u
修饰符注意 :您的服务器
php版本
应至少为PHP 4.3.5
,如此处php.net |模式修饰符
感谢
AgreeOrNot
在这里给了我这个密钥 preg_replace 匹配整个阿拉伯语中的单词我尝试了它,它在本地主机中工作,但是当我在远程服务器中尝试它时,它不起作用,然后我发现 php.net 开始在 PHP 4.3 中使用
u
修饰符.5. ,我升级了 php 版本并且它可以工作重要的是要知道这种方法对阿拉伯语用户非常有帮助(请参阅),因为 - 正如我所相信 - unicode 是阿拉伯语语言的最佳编码,如果您不使用,替换将不起作用
u
修饰符,请参阅下一个示例,它应该适用于您$text = preg_replace('/\bмидаб ك\b/u', 'NEW', $text);
If you want to replace Unicode
old pattern
withnew pattern
you should write:So the key here is
u
modifierNote : Your server
php version
shoud be at leastPHP 4.3.5
as mentioned here php.net | Pattern Modifiers
Thanks
AgreeOrNot
who give me that key here preg_replace match whole word in arabicI tried it and it worked in localhost but when I try it in remote server it didn't work, then I found that php.net start use
u
modifier in PHP 4.3.5. , I upgrade php version and it worksIts important to know that this method is very helpful for Arabic users (عربي) because - as I believe - unicode is the best encode for arabic language, and replacement will not work if you don't use the
u
modifier, see next example it should work with you$text = preg_replace('/\bمرحبا بك\b/u', 'NEW', $text);
首先,如果您在编写这些内容时使用单撇号而不是双引号,您的生活会容易得多 - 您只需要一个反斜杠。其次,还应该包括组合标记
\pM
。如果您发现某个字符不匹配,请找出它的 Unicode 代码点,然后您可以使用 http://www .fileformat.info/info/unicode/ 找出它在哪里。我发现 http://hsivonen.iki.fi/php-utf8/ 是一个非常宝贵的工具使用 UTF-8 属性进行调试(在尝试查找之前不要忘记转换为十六进制:array_map('dechex', utf8ToUnicode($text))
)。例如,Ă 结果是 http://www.fileformat.info /info/unicode/char/0102/index.htm 并且在 Lu 中,所以 L 应该匹配它,它确实适合我。另一个字符是 http://www.fileformat.info/info/unicode /char/5f20/index.htm 也是 isLetter 并且确实适合我。你有编译过的Unicode字符表吗?
First of all, your life would be a lot easier if you'd use single apostrophes instead of double quotes when writing these -- you need only one backslash. Second, combining marks
\pM
should also be included. If you find a character not matched please find out its Unicode code point and then you can use http://www.fileformat.info/info/unicode/ to figure out where it is. I found http://hsivonen.iki.fi/php-utf8/ an invaluable tool when doing debugging with UTF-8 properties (don't forget to convert to hex before trying to look up:array_map('dechex', utf8ToUnicode($text))
).For example, Ă turns out to be http://www.fileformat.info/info/unicode/char/0102/index.htm and to be in Lu and so L should match it and it does match for me. The other character is http://www.fileformat.info/info/unicode/char/5f20/index.htm and is also isLetter and indeed matches for me. Do you have the Unicode character tables compiled in?