为什么 PHP 的 preg_split 会分割希伯来字母 “נ”在 UTF-8 中分割 “\s” 时?
这不起作用,它会把它变成乱码:
$foo = 'נ';
$bar = mb_convert_encoding($foo, 'UTF-8', mb_detect_encoding($foo));
print_r(preg_split('/\s/', $bar));
数组 ( [0] => � [1] => )
但这有效:
$foo = 'נ';
$bar = mb_convert_encoding($foo, 'ISO-8859-8', mb_detect_encoding($foo));
$baz = preg_split('/\s/', $bar);
echo(mb_convert_encoding($baz[0], 'UTF-8', 'ISO-8859-8'));
问题仅在于字母“נ
” 它与所有其他希伯来字母配合得很好。有解决办法吗?
This doesn't work, it turns it to gibberish:
$foo = 'נ';
$bar = mb_convert_encoding($foo, 'UTF-8', mb_detect_encoding($foo));
print_r(preg_split('/\s/', $bar));
Array ( [0] => � [1] => )
But this works:
$foo = 'נ';
$bar = mb_convert_encoding($foo, 'ISO-8859-8', mb_detect_encoding($foo));
$baz = preg_split('/\s/', $bar);
echo(mb_convert_encoding($baz[0], 'UTF-8', 'ISO-8859-8'));
נ
The problem is only with the letter "נ
". It works fine with all the other Hebrew letters. Is there a solution for that?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
使用 UTF-8 数据时,请始终在模式中使用 u 修饰符:
因为否则该模式不会被解释为 UTF-8。
就像在本例中一样,字符
נ
(U+05E0) 在 UTF-8 中使用 0xD7A0 进行编码。\s
代表任何空白字符(根据 PCRE):当添加 UTF-8 支持时,他们还添加了一个名为 PCRE_UCP 的特殊选项,以具有
\b
、\d
、\s
和\w
不仅匹配 US-ASCII 字符,还根据其 Unicode 属性匹配其他 Unicode 字符:以及不间断空格 U +00A0 具有分隔符的属性 (
\p{Z}
)。因此,尽管您的模式不是 UTF-8 模式,但似乎
\s
确实 匹配 UTF-8 代码字 0xD7A0 中的 0xA0,在该位置分割字符串并返回一个相当于 array("\xD7", "") 的数组。这显然是一个错误,因为该模式在 UTF-8 模式下不是,但 0xA0 大于 0x80(此外,0xA0 将被编码为 0xC2A0)。 bug #52971 PCRE-Meta-Characters 不适用于 utf-8可能与此有关。
When working with UTF-8 data, always use the u modifier in your patterns:
Because otherwise the pattern is not interpreted as UTF-8.
Like in this case the character
נ
(U+05E0) is encoded with 0xD7A0 in UTF-8. And\s
represents any whitespace character (according to PCRE):When UTF-8 support was added, they have also added a special option called PCRE_UCP to have
\b
,\d
,\s
, and\w
not just match US-ASCII characters but also other Unicode characters by their Unicode properties:And that non-breaking space U+00A0 has the property of a separator (
\p{Z}
).So although your pattern is not in UTF-8 mode, it seems that
\s
does match that 0xA0 in the UTF-8 code word 0xD7A0, splitting the string at that position and returning an array that is equivalent toarray("\xD7", "")
.And that’s obviously a bug as the pattern is not in UTF-8 mode but 0xA0 is greater than 0x80 (additionally, 0xA0 would be encoded as 0xC2A0). The bug #52971 PCRE-Meta-Characters not working with utf-8 could be related with this.