PHP:UTF8 中西里尔字母字符串的不区分大小写的 preg_replace
我有 一个 PHP 5.3 脚本 显示我网站的用户,并且想要替换某个俄罗斯城市(存储PostgreSQL 8.4.7 数据库 + CentOS 5.5/64 位 Linux 中的 UTF8 格式)按其旧名称(这是一个内部笑话):
preg_replace('/Волгоград/iu', 'Сталинград', $city);
不幸的是,这仅适用于完全匹配:Волгоград。
这不适用于其他情况,例如 ВОЛГОГРАД 或 волгоград。
如果我修改我的源代码
preg_replace('/[Вв]олгоград/iu', 'Сталинград', $city);
,它将捕获上面的第二种情况。
有谁知道发生了什么以及如何解决它(假设我不想为每个字母写[Xx])?
谢谢你! 亚历克斯
更新:
# rpm -qa|grep php
php53-bcmath-5.3.3-1.el5
php53-gd-5.3.3-1.el5
php53-common-5.3.3-1.el5
php53-pdo-5.3.3-1.el5
php53-mbstring-5.3.3-1.el5
php53-xml-5.3.3-1.el5
php53-5.3.3-1.el5
php53-cli-5.3.3-1.el5
php53-pgsql-5.3.3-1.el5
# rpm -qa|grep pcre
pcre-6.6-2.el5_1.7
I have a PHP 5.3 script displaying users of my web site and would like to replace a certain Russian city (stored in UTF8 in PostgreSQL 8.4.7 database + CentOS 5.5/64 bits Linux) by its older name (it is an insider joke):
preg_replace('/Волгоград/iu', 'Сталинград', $city);
Unfortunately this only works for exact matches: Волгоград.
This does not work for other cases, like ВОЛГОГРАД or волгоград.
If I modify my source code to
preg_replace('/[Вв]олгоград/iu', 'Сталинград', $city);
then it will catch the 2nd case above.
Does anybody know what it going on and how to fix it (assuming I don't want to write [Xx] for every letter)?
Thank you!
Alex
UPDATE:
# rpm -qa|grep php
php53-bcmath-5.3.3-1.el5
php53-gd-5.3.3-1.el5
php53-common-5.3.3-1.el5
php53-pdo-5.3.3-1.el5
php53-mbstring-5.3.3-1.el5
php53-xml-5.3.3-1.el5
php53-5.3.3-1.el5
php53-cli-5.3.3-1.el5
php53-pgsql-5.3.3-1.el5
# rpm -qa|grep pcre
pcre-6.6-2.el5_1.7
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
我无法使用 PHP 5.3.3 (
PHP 5.3.3-1ubuntu9.3 with Suhosin-Patch (cli)
) 重现您的问题:输出
您的 PHP 使用的是哪个 PCRE 版本?检查
phpinfo()
中的pcre
部分。这是我系统上的:I cannot reproduce your issue with a PHP 5.3.3 (
PHP 5.3.3-1ubuntu9.3 with Suhosin-Patch (cli)
):outputs
Which PCRE version is your PHP using? Check you
phpinfo()
for thepcre
-section. That's the one on my system:你可以跳过正则表达式,它在 PHP 5.2.11 中对我有用:)
输出
这引起了我的兴趣,所以 我问了一个问题。
You can skip the regex, it worked for me in PHP 5.2.11 :)
Output
This intrigued me, so I asked a question.
这个问题解决了:
This one solved the problem:
我复制+粘贴了你的大
В
。它确实是U+D092
,而不是普通的拉丁文B
。但由于它们看起来非常相似:ВB
我相信俄语字母是被整理到U+0042
的拉丁语B上的。所以要么是 PHP 对其进行了预格式化,要么是 PCRE 在那里也有些不精确。测试您的
print PCRE_VERSION;
并查看变更日志。无论如何,为了避免这个问题,我建议你只使用小写字母。它们更有可能与拉丁字母不同。
PS:邪恶的内部笑话!
I copy+pasted your big
В
. It is indeedU+D092
, not the normal latinB
. But since they look so much alike:ВB
I believe the russian letter is collated onto the Latin B ofU+0042
.So either it's PHP preformatting it, or maybe PCRE is somewhat inexact there too. Test your
print PCRE_VERSION;
and have a look into the changelog.Anyway, to evade the problem I would suggest you only use the lowercase letters. They are more likely to be distinct from the Latin alphabet.
P.S.: Evil inside joke!
也许尝试:mb_eregi_replace
http://www.php.net/manual/en/function .mb-eregi-replace.php
Perhaps try: mb_eregi_replace
http://www.php.net/manual/en/function.mb-eregi-replace.php
在我的盒子上就像一个魅力...
输出:
您确定输入数据 ($city) 是 UTF8 格式吗?
Works like a charm on my box...
Output:
Are you sure that input data ($city) is in UTF8?
只是猜测,但将字符串显式编码为 unicode 可能会有所帮助:
Just guessing, but explicitly encoding the string to unicode may help:
实际上,在 Windows 上使用 PHP 5.2.x 时,选择解决的答案对我来说不起作用。
我必须转换为 Windows-1251 才能使其正常工作。
示例如下:
上面的示例将成功(不区分大小写)用 YYYYYY 替换“гъз”,并返回 UTF-8 版本。
问候!
Actually with PHP 5.2.x on windows the selected for a solved answer did not work for me.
I had to go through converting to Windows-1251 to make it work.
Here you go the example:
The example above will substitute successfully (case-insesitively) 'гъз' with YYYYYY and give you back the UTF-8 version.
Regards!
对于那些支持庞大的遗留代码库,并在字符集和字符集方面苦苦挣扎的人编码问题,并且没有转换代码字符集的选项 - 这是一个答案:
虽然您可以重写正则表达式以使用一些 utf 技巧,将代码转换为 utf,但当您使用巨大的代码库/数据库等时,这不是一个选项
for those who support a huge legacy code base, struggling with charset & encoding issues, and without option to convert code charset - here's an answer:
while you can rewrite your regexp to use some utf tricks, convert your code to utf, it's not an option when you work with a huge codebase/database etc