为什么 PHP 的 preg_split 会分割希伯来字母 “נ”在 UTF-8 中分割 “\s” 时？

发布于 2024-10-03 03:20:44 字数 543 浏览 3 评论 0原文

这不起作用，它会把它变成乱码：

$foo = 'נ';
$bar = mb_convert_encoding($foo, 'UTF-8', mb_detect_encoding($foo));
print_r(preg_split('/\s/', $bar));

数组 ( [0] => � [1] => )

但这有效：

$foo = 'נ';
$bar = mb_convert_encoding($foo, 'ISO-8859-8', mb_detect_encoding($foo));
$baz = preg_split('/\s/', $bar);
echo(mb_convert_encoding($baz[0], 'UTF-8', 'ISO-8859-8'));

问题仅在于字母“נ” 它与所有其他希伯来字母配合得很好。有解决办法吗？

原文

This doesn't work, it turns it to gibberish:

$foo = 'נ';
$bar = mb_convert_encoding($foo, 'UTF-8', mb_detect_encoding($foo));
print_r(preg_split('/\s/', $bar));

Array ( [0] => � [1] => )

But this works:

$foo = 'נ';
$bar = mb_convert_encoding($foo, 'ISO-8859-8', mb_detect_encoding($foo));
$baz = preg_split('/\s/', $bar);
echo(mb_convert_encoding($baz[0], 'UTF-8', 'ISO-8859-8'));

נ

The problem is only with the letter "נ". It works fine with all the other Hebrew letters. Is there a solution for that?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

Smile简单爱 2024-10-10 03:20:44

使用 UTF-8 数据时，请始终在模式中使用 u 修饰符：

/\s/u

因为否则该模式不会被解释为 UTF-8。

就像在本例中一样，字符 נ (U+05E0) 在 UTF-8 中使用 0xD7A0 进行编码。 \s 代表任何空白字符（根据 PCRE）：

\s 字符为 HT (9)、LF (10)、FF (12)、CR (13) 和空格 (32)。

当添加 UTF-8 支持时，他们还添加了一个名为 PCRE_UCP 的特殊选项，以具有 \b、\d、\s 和\w 不仅匹配 US-ASCII 字符，还根据其 Unicode 属性匹配其他 Unicode 字符：

默认情况下，在 UTF-8 模式下，值大于 128 的字符永远不会匹配 \d、\s 或 \w ，并且始终匹配 \D、\S 和 \W。 […] 但是，如果 PCRE 是使用 Unicode 属性支持进行编译的，并且设置了 PCRE_UCP 选项，则行为会发生更改，以便使用 Unicode 属性来确定字符类型，如下所示：
\d \p{Nd} 匹配的任何字符（十进制数字）
\s \p{Z} 匹配的任何字符，加上 HT、LF、FF、CR
\w \p{L} 或 \p{N} 匹配的任何字符，加上下划线

以及不间断空格 U +00A0 具有分隔符的属性 (\p{Z})。

因此，尽管您的模式不是 UTF-8 模式，但似乎 \s 确实匹配 UTF-8 代码字 0xD7A0 中的 0xA0，在该位置分割字符串并返回一个相当于 array("\xD7", "") 的数组。

这显然是一个错误，因为该模式在 UTF-8 模式下不是，但 0xA0 大于 0x80（此外，0xA0 将被编码为 0xC2A0）。 bug #52971 PCRE-Meta-Characters 不适用于 utf-8可能与此有关。

When working with UTF-8 data, always use the u modifier in your patterns:

/\s/u

Because otherwise the pattern is not interpreted as UTF-8.

Like in this case the character נ (U+05E0) is encoded with 0xD7A0 in UTF-8. And \s represents any whitespace character (according to PCRE):

The \s characters are HT (9), LF (10), FF (12), CR (13), and space (32).

When UTF-8 support was added, they have also added a special option called PCRE_UCP to have \b, \d, \s, and \w not just match US-ASCII characters but also other Unicode characters by their Unicode properties:

By default, in UTF-8 mode, characters with values greater than 128 never match \d, \s, or \w, and always match \D, \S, and \W. […] However, if PCRE is compiled with Unicode property support, and the PCRE_UCP option is set, the behaviour is changed so that Unicode properties are used to determine character types, as follows:
\d any character that \p{Nd} matches (decimal digit)
\s any character that \p{Z} matches, plus HT, LF, FF, CR
\w any character that \p{L} or \p{N} matches, plus underscore

And that non-breaking space U+00A0 has the property of a separator (\p{Z}).

So although your pattern is not in UTF-8 mode, it seems that \s does match that 0xA0 in the UTF-8 code word 0xD7A0, splitting the string at that position and returning an array that is equivalent to array("\xD7", "").

And that’s obviously a bug as the pattern is not in UTF-8 mode but 0xA0 is greater than 0x80 (additionally, 0xA0 would be encoded as 0xC2A0). The bug #52971 PCRE-Meta-Characters not working with utf-8 could be related with this.

回复收藏 0 原文

~没有更多了~