PHP:UTF8 中西里尔字母字符串的不区分大小写的 preg_replace

发布于 2024-10-27 04:36:26 字数 915 浏览 2 评论 0原文

我有 一个 PHP 5.3 脚本 显示我网站的用户,并且想要替换某个俄罗斯城市(存储PostgreSQL 8.4.7 数据库 + CentOS 5.5/64 位 Linux 中的 UTF8 格式)按其旧名称(这是一个内部笑话):

preg_replace('/Волгоград/iu', 'Сталинград', $city);

不幸的是,这仅适用于完全匹配:Волгоград

这不适用于其他情况,例如 ВОЛГОГРАДволгоград

如果我修改我的源代码

preg_replace('/[Вв]олгоград/iu', 'Сталинград', $city);

,它将捕获上面的第二种情况。

有谁知道发生了什么以及如何解决它(假设我不想为每个字母写[Xx])?

谢谢你! 亚历克斯

更新:

# rpm -qa|grep php
php53-bcmath-5.3.3-1.el5
php53-gd-5.3.3-1.el5
php53-common-5.3.3-1.el5
php53-pdo-5.3.3-1.el5
php53-mbstring-5.3.3-1.el5
php53-xml-5.3.3-1.el5
php53-5.3.3-1.el5
php53-cli-5.3.3-1.el5
php53-pgsql-5.3.3-1.el5

# rpm -qa|grep pcre
pcre-6.6-2.el5_1.7

I have a PHP 5.3 script displaying users of my web site and would like to replace a certain Russian city (stored in UTF8 in PostgreSQL 8.4.7 database + CentOS 5.5/64 bits Linux) by its older name (it is an insider joke):

preg_replace('/Волгоград/iu', 'Сталинград', $city);

Unfortunately this only works for exact matches: Волгоград.

This does not work for other cases, like ВОЛГОГРАД or волгоград.

If I modify my source code to

preg_replace('/[Вв]олгоград/iu', 'Сталинград', $city);

then it will catch the 2nd case above.

Does anybody know what it going on and how to fix it (assuming I don't want to write [Xx] for every letter)?

Thank you!
Alex

UPDATE:

# rpm -qa|grep php
php53-bcmath-5.3.3-1.el5
php53-gd-5.3.3-1.el5
php53-common-5.3.3-1.el5
php53-pdo-5.3.3-1.el5
php53-mbstring-5.3.3-1.el5
php53-xml-5.3.3-1.el5
php53-5.3.3-1.el5
php53-cli-5.3.3-1.el5
php53-pgsql-5.3.3-1.el5

# rpm -qa|grep pcre
pcre-6.6-2.el5_1.7

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

煞人兵器 2024-11-03 04:36:26

我无法使用 PHP 5.3.3 (PHP 5.3.3-1ubuntu9.3 with Suhosin-Patch (cli)) 重现您的问题:

$str1 = 'Волгоград';
$str2 = 'ВОЛГОГРАД';
$str3 = 'волгоград';

var_dump(preg_replace('/Волгоград/iu', 'Сталинград', $str1));
var_dump(preg_replace('/Волгоград/iu', 'Сталинград', $str2));
var_dump(preg_replace('/Волгоград/iu', 'Сталинград', $str3));

输出

string(20) "Сталинград"
string(20) "Сталинград"
string(20) "Сталинград"

您的 PHP 使用的是哪个 PCRE 版本?检查 phpinfo() 中的 pcre 部分。这是我系统上的:

...
pcre

PCRE (Perl Compatible Regular Expressions) Support => enabled
PCRE Library Version => 8.02 2010-03-19
...

I cannot reproduce your issue with a PHP 5.3.3 (PHP 5.3.3-1ubuntu9.3 with Suhosin-Patch (cli)):

$str1 = 'Волгоград';
$str2 = 'ВОЛГОГРАД';
$str3 = 'волгоград';

var_dump(preg_replace('/Волгоград/iu', 'Сталинград', $str1));
var_dump(preg_replace('/Волгоград/iu', 'Сталинград', $str2));
var_dump(preg_replace('/Волгоград/iu', 'Сталинград', $str3));

outputs

string(20) "Сталинград"
string(20) "Сталинград"
string(20) "Сталинград"

Which PCRE version is your PHP using? Check you phpinfo() for the pcre-section. That's the one on my system:

...
pcre

PCRE (Perl Compatible Regular Expressions) Support => enabled
PCRE Library Version => 8.02 2010-03-19
...
自我难过 2024-11-03 04:36:26

你可以跳过正则表达式,它在 PHP 5.2.11 中对我有用:)

$city = 'Unfortunately this only works for exact matches: Волгоград.

This does not work for other cases, like ВОЛГОГРАД or волгоград.';

echo str_ireplace('Волгоград', '[found]', $city);

输出

"Unfortunately this only works for exact matches: [found].

This does not work for other cases, like [found] or [found]."

这引起了我的兴趣,所以 我问了一个问题

You can skip the regex, it worked for me in PHP 5.2.11 :)

$city = 'Unfortunately this only works for exact matches: Волгоград.

This does not work for other cases, like ВОЛГОГРАД or волгоград.';

echo str_ireplace('Волгоград', '[found]', $city);

Output

"Unfortunately this only works for exact matches: [found].

This does not work for other cases, like [found] or [found]."

This intrigued me, so I asked a question.

你丑哭了我 2024-11-03 04:36:26

这个问题解决了:

setlocale(LC_ALL, 'ru_RU.CP1251', 'rus_RUS.CP1251', 'Russian_Russia.1251');

This one solved the problem:

setlocale(LC_ALL, 'ru_RU.CP1251', 'rus_RUS.CP1251', 'Russian_Russia.1251');
拥抱我好吗 2024-11-03 04:36:26

我复制+粘贴了你的大В。它确实是 U+D092,而不是普通的拉丁文 B。但由于它们看起来非常相似:ВB我相信俄语字母是被整理到U+0042的拉丁语B上的。

所以要么是 PHP 对其进行了预格式化,要么是 PCRE 在那里也有些不精确。测试您的 print PCRE_VERSION; 并查看变更日志。

无论如何,为了避免这个问题,我建议你只使用小写字母。它们更有可能与拉丁字母不同。

preg_replace('/волгоград/iu', 'Сталинград', $city);

PS:邪恶的内部笑话!

I copy+pasted your big В. It is indeed U+D092, not the normal latin B. But since they look so much alike: ВB I believe the russian letter is collated onto the Latin B of U+0042.

So either it's PHP preformatting it, or maybe PCRE is somewhat inexact there too. Test your print PCRE_VERSION; and have a look into the changelog.

Anyway, to evade the problem I would suggest you only use the lowercase letters. They are more likely to be distinct from the Latin alphabet.

preg_replace('/волгоград/iu', 'Сталинград', $city);

P.S.: Evil inside joke!

念﹏祤嫣 2024-11-03 04:36:26

也许尝试:mb_eregi_replace
http://www.php.net/manual/en/function .mb-eregi-replace.php

mb_eregi_replace — 用多字节支持替换正则表达式,忽略大小写

Perhaps try: mb_eregi_replace
http://www.php.net/manual/en/function.mb-eregi-replace.php

mb_eregi_replace — Replace regular expression with multibyte support ignoring case

为你拒绝所有暧昧 2024-11-03 04:36:26

在我的盒子上就像一个魅力...

<?php
    $city = 'Волгоград';
    var_dump(preg_match('/волгоград/ui', $city));
    var_dump(preg_match('/ВОЛГОГРАД/ui', $city));
    var_dump(preg_replace('/волгоград/ui', 'Сталинград', $city));
    var_dump(preg_replace('/ВОЛГОГРАД/ui', 'Сталинград', $city));

输出:

int 1
int 1
string 'Сталинград' (length=20)
string 'Сталинград' (length=20)

您确定输入数据 ($city) 是 UTF8 格式吗?

Works like a charm on my box...

<?php
    $city = 'Волгоград';
    var_dump(preg_match('/волгоград/ui', $city));
    var_dump(preg_match('/ВОЛГОГРАД/ui', $city));
    var_dump(preg_replace('/волгоград/ui', 'Сталинград', $city));
    var_dump(preg_replace('/ВОЛГОГРАД/ui', 'Сталинград', $city));

Output:

int 1
int 1
string 'Сталинград' (length=20)
string 'Сталинград' (length=20)

Are you sure that input data ($city) is in UTF8?

幽蝶幻影 2024-11-03 04:36:26

只是猜测,但将字符串显式编码为 un​​icode 可能会有所帮助:

preg_replace('/Волгоград/iu', utf8_encode('Сталинград'), $city);

Just guessing, but explicitly encoding the string to unicode may help:

preg_replace('/Волгоград/iu', utf8_encode('Сталинград'), $city);
最舍不得你 2024-11-03 04:36:26

实际上,在 Windows 上使用 PHP 5.2.x 时,选择解决的答案对我来说不起作用。

我必须转换为 Windows-1251 才能使其正常工作。

示例如下:

$new_content = preg_replace(iconv('UTF-8', 'Windows-1251', "/\bгъз\b/i"), iconv('UTF-8', 'Windows-1251', "YYYYYY"), iconv('UTF-8', 'Windows-1251', "ти си gyz gyz гъз ГЪЗ gyzgyz гЪз gyz"));
$new_content = iconv('Windows-1251', 'UTF-8', $new_content);

上面的示例将成功(不区分大小写)用 YYYYYY 替换“гъз”,并返回 UTF-8 版本。

问候!

Actually with PHP 5.2.x on windows the selected for a solved answer did not work for me.

I had to go through converting to Windows-1251 to make it work.

Here you go the example:

$new_content = preg_replace(iconv('UTF-8', 'Windows-1251', "/\bгъз\b/i"), iconv('UTF-8', 'Windows-1251', "YYYYYY"), iconv('UTF-8', 'Windows-1251', "ти си gyz gyz гъз ГЪЗ gyzgyz гЪз gyz"));
$new_content = iconv('Windows-1251', 'UTF-8', $new_content);

The example above will substitute successfully (case-insesitively) 'гъз' with YYYYYY and give you back the UTF-8 version.

Regards!

素食主义者 2024-11-03 04:36:26

对于那些支持庞大的遗留代码库,并在字符集和字符集方面苦苦挣扎的人编码问题,并且没有转换代码字符集的选项 - 这是一个答案:

//for 
setlocale(LC_ALL, 'ru_RU.cp1251');  
//(or any other locale) to take effect, 
//you MUST generate system locale, i.e.

sudo su
#view supported locales
#less /usr/share/i18n/SUPPORTED
echo "ru_RU.cp1251 CP1251" >> /var/lib/locales/supported.d/local
dpkg-reconfigure locales
exit

#and (for ubuntu/debian)

apt-get install php5-intl

虽然您可以重写正则表达式以使用一些 utf 技巧,将代码转换为 utf,但当您使用巨大的代码库/数据库等时,这不是一个选项

for those who support a huge legacy code base, struggling with charset & encoding issues, and without option to convert code charset - here's an answer:

//for 
setlocale(LC_ALL, 'ru_RU.cp1251');  
//(or any other locale) to take effect, 
//you MUST generate system locale, i.e.

sudo su
#view supported locales
#less /usr/share/i18n/SUPPORTED
echo "ru_RU.cp1251 CP1251" >> /var/lib/locales/supported.d/local
dpkg-reconfigure locales
exit

#and (for ubuntu/debian)

apt-get install php5-intl

while you can rewrite your regexp to use some utf tricks, convert your code to utf, it's not an option when you work with a huge codebase/database etc

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文