为什么 PHP 的 preg_split 会分割希伯来字母 “נ”在 UTF-8 中分割 “\s” 时?

发布于 2024-10-03 03:20:44 字数 543 浏览 3 评论 0原文

这不起作用,它会把它变成乱码:

$foo = 'נ';
$bar = mb_convert_encoding($foo, 'UTF-8', mb_detect_encoding($foo));
print_r(preg_split('/\s/', $bar));

数组 ( [0] => � [1] => )

但这有效:

$foo = 'נ';
$bar = mb_convert_encoding($foo, 'ISO-8859-8', mb_detect_encoding($foo));
$baz = preg_split('/\s/', $bar);
echo(mb_convert_encoding($baz[0], 'UTF-8', 'ISO-8859-8'));

问题仅在于字母“נ” 它与所有其他希伯来字母配合得很好。有解决办法吗?

This doesn't work, it turns it to gibberish:

$foo = 'נ';
$bar = mb_convert_encoding($foo, 'UTF-8', mb_detect_encoding($foo));
print_r(preg_split('/\s/', $bar));

Array ( [0] => � [1] => )

But this works:

$foo = 'נ';
$bar = mb_convert_encoding($foo, 'ISO-8859-8', mb_detect_encoding($foo));
$baz = preg_split('/\s/', $bar);
echo(mb_convert_encoding($baz[0], 'UTF-8', 'ISO-8859-8'));

נ

The problem is only with the letter "נ". It works fine with all the other Hebrew letters. Is there a solution for that?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

Smile简单爱 2024-10-10 03:20:44

使用 UTF-8 数据时,请始终在模式中使用 u 修饰符

/\s/u

因为否则该模式不会被解释为 UTF-8。

就像在本例中一样,字符 נ (U+05E0) 在 UTF-8 中使用 0xD7A0 进行编码。 \s 代表任何空白字符(根据 PCRE):

\s 字符为 HT (9)、LF (10)、FF (12)、CR (13) 和空格 (32)。

当添加 UTF-8 支持时,他们还添加了一个名为 PCRE_UCP 的特殊选项,以具有 \b\d\s\w 不仅匹配 US-ASCII 字符,还根据其 Unicode 属性匹配其他 Unicode 字符:

默认情况下,在 UTF-8 模式下,值大于 128 的字符永远不会匹配 \d\s\w ,并且始终匹配 \D\S\W。 […] 但是,如果 PCRE 是使用 Unicode 属性支持进行编译的,并且设置了 PCRE_UCP 选项,则行为会发生更改,以便使用 Unicode 属性来确定字符类型,如下所示:

  • \d \p{Nd} 匹配的任何字符(十进制数字)
  • \s \p{Z} 匹配的任何字符,加上 HT、LF、FF、CR
  • \w \p{L}\p{N} 匹配的任何字符,加上下划线

以及不间断空格 U +00A0 具有分隔符的属性 (\p{Z})。

因此,尽管您的模式不是 UTF-8 模式,但似乎 \s 确实 匹配 UTF-8 代码字 0xD7A0 中的 0xA0,在该位置分割字符串并返回一个相当于 array("\xD7", "") 的数组。

这显然是一个错误,因为该模式在 UTF-8 模式下不是,但 0xA0 大于 0x80(此外,0xA0 将被编码为 0xC2A0)。 bug #52971 PCRE-Meta-Characters 不适用于 utf-8可能与此有关。

When working with UTF-8 data, always use the u modifier in your patterns:

/\s/u

Because otherwise the pattern is not interpreted as UTF-8.

Like in this case the character נ (U+05E0) is encoded with 0xD7A0 in UTF-8. And \s represents any whitespace character (according to PCRE):

The \s characters are HT (9), LF (10), FF (12), CR (13), and space (32).

When UTF-8 support was added, they have also added a special option called PCRE_UCP to have \b, \d, \s, and \w not just match US-ASCII characters but also other Unicode characters by their Unicode properties:

By default, in UTF-8 mode, characters with values greater than 128 never match \d, \s, or \w, and always match \D, \S, and \W. […] However, if PCRE is compiled with Unicode property support, and the PCRE_UCP option is set, the behaviour is changed so that Unicode properties are used to determine character types, as follows:

  • \d any character that \p{Nd} matches (decimal digit)
  • \s any character that \p{Z} matches, plus HT, LF, FF, CR
  • \w any character that \p{L} or \p{N} matches, plus underscore

And that non-breaking space U+00A0 has the property of a separator (\p{Z}).

So although your pattern is not in UTF-8 mode, it seems that \s does match that 0xA0 in the UTF-8 code word 0xD7A0, splitting the string at that position and returning an array that is equivalent to array("\xD7", "").

And that’s obviously a bug as the pattern is not in UTF-8 mode but 0xA0 is greater than 0x80 (additionally, 0xA0 would be encoded as 0xC2A0). The bug #52971 PCRE-Meta-Characters not working with utf-8 could be related with this.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文