如何在 PHP 中从 utf8mb4 字符(😀;又名“\u{D83D}\u{DE00}”)获取整数
我想用php写一个saslprep算法(我知道有一个lib,我想自己做)。我的一个单元测试失败了,因为测试向量 "\u{D83D}\u{DE00}"
又名
I want to write a saslprep algorithm with php (I know there is a lib, I want to do it myself). One of my unit tests failes because the test vector "\u{D83D}\u{DE00}"
aka ????
fails to convert to code points (array of integer).
echo mb_ord("\u{D83D}\u{DE00}","UTF-32LE");
failes returning false
iconv("UTF-8","UTF-32LE","\u{D83D}\u{DE00}");
failes
The expected result is 128512
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
首先我们来分析一下
php
的编码方式:At first, let's analyze
php
way of encoding:Output: shows that surrogate code points are encoded as WTF-8 (Wobbly Transformation Format − 8-bit):
Now we can write the following functions and combine them to get desired number:
function CodepointFromWTF8
: decode from well-formedWTF-8
to code points;function CodepointFromSurrogates
: decode from potentially ill-formedUTF-16
to code points. The following formula should suffice for a well-formedUTF-16
surrogate pair:codepoint = 0x10000 + ((high - 0xD800) << 10) + (low - 0xDC00)
BTW: Tested using
"ař\u{05FF}€????\u{D83D}\u{DE00}"
sample string where characters are as follows (columnCodePoint
contains Unicode (U+hhhh
) andWTF-8
bytes and columnDescription
contains surrogates in parentheses, if apply):Edit
Here's full simplified solution (I don't follow Userland Naming Guide arbitrarily mixing snake_case, camelCase and PascalCase rules, sorry):
Result:
71249878.php