在 PHP 中使用两字节代码点
我正在研究这个代码 由乔恩和马里奥撰写。它适用于印地语中的辅音 (क - ह),但不适用于元音。一个原因可能是我无法为字母(अः)提供两个代码点
我正在尝试这些代码范围 - अ - अः
// Used decimal number.
// Error - Fatal error: Allowed memory size of 134217728 bytes
foreach (range(2309, 23092307) as $char) {
$char = html_entity_decode("&#$char;", ENT_COMPAT, "UTF-8");
$alphabets[$char] = ++$i;
}
print_r($alphabets);
在 for 循环中也尝试过这一点 - “foreach (range(0x0905, '0x0905 0x0903') as $char)”
另外,这段代码:
// Output is Japanese/Chinese characters:
//
function unichr($intval) {
return mb_convert_encoding(pack('n', $intval), 'UTF-8', 'UTF-16BE');
}
function uniord($u) {
$k = mb_convert_encoding($u, 'UCS-2LE', 'UTF-8');
$k1 = ord(substr($k, 0, 1));
$k2 = ord(substr($k, 1, 1));
return $k2 * 256 + $k1;
}
for($char = uniord('अ'); $char <= uniord('अः'); ++$char) {
$alphabet[] = unichr($char);
}
print_r($alphabet);
看起来也有文件编码的东西!现在它返回的是: Array ( [0] => अ ) // 只有一行 我尝试过 utf-8 和 utf-16 文档编码。
I am working on this code
written by Jon and Mario. It is working for Consonants in Hindi ( क - ह) but not for vowels. One reason can be I am not able to feed two code points for letter (अः)
I am trying these codes for range - अ - अः
// Used decimal number.
// Error - Fatal error: Allowed memory size of 134217728 bytes
foreach (range(2309, 23092307) as $char) {
$char = html_entity_decode("$char;", ENT_COMPAT, "UTF-8");
$alphabets[$char] = ++$i;
}
print_r($alphabets);
Tried this as well in for loop -
"foreach (range(0x0905, '0x0905 0x0903') as $char)"
Also, this code:
// Output is Japanese/Chinese characters:
//
function unichr($intval) {
return mb_convert_encoding(pack('n', $intval), 'UTF-8', 'UTF-16BE');
}
function uniord($u) {
$k = mb_convert_encoding($u, 'UCS-2LE', 'UTF-8');
$k1 = ord(substr($k, 0, 1));
$k2 = ord(substr($k, 1, 1));
return $k2 * 256 + $k1;
}
for($char = uniord('अ'); $char <= uniord('अः'); ++$char) {
$alphabet[] = unichr($char);
}
print_r($alphabet);
It looks there was something with file encoding as well! As now it is returning this:
Array ( [0] => अ ) // only one line
I have tried with utf-8 and utf-16 document encoding.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我认为这是一个大问题,因为
अः
没有单一的 unicode 代码点(字符)。相反,它是两个字符अ
(0x0905 或十进制 2309)和ः
(0x0903 或十进制 2907)的组合。因此,您的第一个代码示例循环终点
23092307
无效。您所做的只是将两个代码点连接在一起并将它们视为单个值。您的第二个代码示例仅生成单个字符,因为它仅使用
अः
中两个代码点中的第一个,该代码点与अ
相同。也许你可以看看嵌套循环。让外循环覆盖基本字符,内循环添加组合字符。像这样的东西:
I think this is a big problem because there is not a single unicode code point (character) for
अः
. Instead it is the composition of the two charactersअ
(0x0905 or decimal 2309) andः
(0x0903 or decimal 2907).So your first code sample loop end point of
23092307
is not valid. What you have done there is just concatenate the two code points together and treat them as a single value.Your second code sample is only producing the single character because it is just using the first of the two code points in the
अः
which is the same code point asअ
.Maybe you could look at a nested loop. Have your outer loop over the base characters, and your inner loop add the composition characters. Something like: