preg_replace，字符转义&带重音的字符。 /u 在一台服务器上工作，但在另一台服务器上不起作用

发布于 2024-12-05 15:06:37 字数 356 浏览 1 评论 0原文

我有以下代码：

 preg_replace('/[^\w-]/u','.','Bréánná MÓÚLÍN');

在服务器 A (PHP 5.3.5) 上返回：
“Bréánná.Móúlín”（应该如此）

但是，在服务器 B (PHP 5.2.11) 上它返回：
“Br..n..M..ln”（根本不是我想要的）

我是否正确地认为这取决于编译整个过程时是否设置了PCRE_UCP？

如果是这种情况，有什么办法可以覆盖这个吗？

如果做不到这一点，是否有任何方法可以轻松地将这些字符替换为“标准”等效字符？（类似于 utf8_decode 但更广泛）

原文

I have the following code:

 preg_replace('/[^\w-]/u','.','Bréánná MÓÚLÍN');

Which on server A (PHP 5.3.5) returns:
"Bréánná.Móúlín" (as it should)

However, on server B (PHP 5.2.11) it returns:
"Br..n..M..l.n" (not what what I want at all)

Am I right in thinking that this is down to whether or not PCRE_UCP was set when the whole thing was compiled?

Is there any way of overriding this if this is the case?

Failing that, is there any way of easily replacing such characters with a 'standard' equivalent? (Like utf8_decode but more expansive)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

最近可好 2024-12-12 15:06:37

我不确定编译期间定义的 PCRE_UCP 是否会影响 preg_replace()，但解决问题的方法是使用多字节字符串函数 mb_ereg_replace()：

<?php
mb_internal_encoding("UTF-8");
mb_regex_encoding("UTF-8");

echo mb_ereg_replace('[^0-9A-Za-zÀ-ÖØ-öø-˿Ͱ-ͽͿ-῿‌-‍⁰-↏Ⰰ-⿯、-퟿豈-﷏ﷰ-�̀-ͯ‿-⁀\\-]','.','Bréánná MÓÚLÍN');

PHP 5.2 结果：http://codepad.viper-7.com/UnZeyf

编辑： 我最初认为多字节 ereg 函数支持 Unicode 字符类型转义，但事实证明并非如此真的。相反，您需要确定您认为“字母”的字符范围。我使用了 XML 标准对 NameChar 的定义中的字符范围< /a> 使用以下 Java 程序生成 RegExp 字符串（显然多字节 ereg 函数也不支持 Unicode 字符转义序列）：

import java.io.*;

public class SO7456963 {
    public static void main(String[] args) throws Throwable {
        Writer w = new OutputStreamWriter(new FileOutputStream("SO7456963.txt"), "UTF-8");
        w.write("[^0-9A-Za-z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u02FF\u0370-\u037D\u037F-\u1FFF\u200C-\u200D\u2070-\u218F\u2C00-\u2FEF\u3001-\uD7FF\uF900-\uFDCF\uFDF0-\uFFFD\u0300-\u036F\u203F-\u2040\\\\-]");
        w.close();
    }
}

I am not sure whether PCRE_UCP defined during compilation affects preg_replace(), but a work-around to your problem is to use the multibyte string function mb_ereg_replace():

<?php
mb_internal_encoding("UTF-8");
mb_regex_encoding("UTF-8");

echo mb_ereg_replace('[^0-9A-Za-zÀ-ÖØ-öø-˿Ͱ-ͽͿ-῿‌-‍⁰-↏Ⰰ-⿯、-퟿豈-﷏ﷰ-�̀-ͯ‿-⁀\\-]','.','Bréánná MÓÚLÍN');

PHP 5.2 results: http://codepad.viper-7.com/UnZeyf

EDIT: I originally thought that the multibyte ereg functions supported Unicode character type escapes, but this turns out not to be true. Instead, you need to determine the ranges of characters that you consider "letters". I used the character ranges from the XML Standard's definition of NameChar with the following Java program to generate the RegExp string (as apparently the multibyte ereg functions do not support Unicode character escape sequences, either):

import java.io.*;

public class SO7456963 {
    public static void main(String[] args) throws Throwable {
        Writer w = new OutputStreamWriter(new FileOutputStream("SO7456963.txt"), "UTF-8");
        w.write("[^0-9A-Za-z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u02FF\u0370-\u037D\u037F-\u1FFF\u200C-\u200D\u2070-\u218F\u2C00-\u2FEF\u3001-\uD7FF\uF900-\uFDCF\uFDF0-\uFFFD\u0300-\u036F\u203F-\u2040\\\\-]");
        w.close();
    }
}

回复收藏 0 原文

~没有更多了~