在 PHP 中使用两字节代码点

发布于 2024-12-11 13:34:23 字数 1161 浏览 0 评论 0原文

我正在研究这个代码由乔恩和马里奥撰写。它适用于印地语中的辅音 (क - ह)，但不适用于元音。一个原因可能是我无法为字母（अः）提供两个代码点

我正在尝试这些代码范围 - अ - अः

// Used decimal number. 
// Error - Fatal error: Allowed memory size of 134217728 bytes
foreach (range(2309, 23092307) as $char) {

    $char = html_entity_decode("&#$char;", ENT_COMPAT, "UTF-8");
    $alphabets[$char] = ++$i;
}

print_r($alphabets);

在 for 循环中也尝试过这一点 - “foreach (range(0x0905, '0x0905 0x0903') as $char)”

另外，这段代码：

// Output is Japanese/Chinese characters:
// 
function unichr($intval) {
    return mb_convert_encoding(pack('n', $intval), 'UTF-8', 'UTF-16BE');
}

function uniord($u) {
    $k = mb_convert_encoding($u, 'UCS-2LE', 'UTF-8');
    $k1 = ord(substr($k, 0, 1));
    $k2 = ord(substr($k, 1, 1));
    return $k2 * 256 + $k1;
}

for($char = uniord('अ'); $char <= uniord('अः'); ++$char) {
    $alphabet[] = unichr($char);
}

print_r($alphabet);

看起来也有文件编码的东西！现在它返回的是： Array ( [0] => अ ) // 只有一行我尝试过 utf-8 和 utf-16 文档编码。

原文

I am working on this code
written by Jon and Mario. It is working for Consonants in Hindi ( क - ह) but not for vowels. One reason can be I am not able to feed two code points for letter (अः)

I am trying these codes for range - अ - अः

// Used decimal number. 
// Error - Fatal error: Allowed memory size of 134217728 bytes
foreach (range(2309, 23092307) as $char) {

    $char = html_entity_decode("&#$char;", ENT_COMPAT, "UTF-8");
    $alphabets[$char] = ++$i;
}

print_r($alphabets);

Tried this as well in for loop -
"foreach (range(0x0905, '0x0905 0x0903') as $char)"

Also, this code:

// Output is Japanese/Chinese characters:
// 
function unichr($intval) {
    return mb_convert_encoding(pack('n', $intval), 'UTF-8', 'UTF-16BE');
}

function uniord($u) {
    $k = mb_convert_encoding($u, 'UCS-2LE', 'UTF-8');
    $k1 = ord(substr($k, 0, 1));
    $k2 = ord(substr($k, 1, 1));
    return $k2 * 256 + $k1;
}

for($char = uniord('अ'); $char <= uniord('अः'); ++$char) {
    $alphabet[] = unichr($char);
}

print_r($alphabet);

It looks there was something with file encoding as well! As now it is returning this:
Array ( [0] => अ ) // only one line
I have tried with utf-8 and utf-16 document encoding.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

荒人说梦 2024-12-18 13:34:23

我认为这是一个大问题，因为 अः 没有单一的 unicode 代码点（字符）。相反，它是两个字符 अ（0x0905 或十进制 2309）和 ः（0x0903 或十进制 2907）的组合。

因此，您的第一个代码示例循环终点 23092307 无效。您所做的只是将两个代码点连接在一起并将它们视为单个值。

您的第二个代码示例仅生成单个字符，因为它仅使用 अः 中两个代码点中的第一个，该代码点与 अ 相同。

也许你可以看看嵌套循环。让外循环覆盖基本字符，内循环添加组合字符。像这样的东西：

<?php
$i = 0;
foreach (range(0x0905, 0x0938) as $char)
{
    $txt = html_entity_decode("&#$char;", ENT_COMPAT, "UTF-8");
    $alphabets[$txt] = ++$i;
    foreach ( range(0x0901, 0x0903) as $combine )
    {
        $txt = html_entity_decode("&#$char;", ENT_COMPAT, "UTF-8")
             . html_entity_decode("&#$combine;", ENT_COMPAT, "UTF-8");
        $alphabets[$txt] = ++$i;
    }
}
print_r($alphabets);
?>

I think this is a big problem because there is not a single unicode code point (character) for अः. Instead it is the composition of the two characters अ (0x0905 or decimal 2309) and ः (0x0903 or decimal 2907).

So your first code sample loop end point of 23092307 is not valid. What you have done there is just concatenate the two code points together and treat them as a single value.

Your second code sample is only producing the single character because it is just using the first of the two code points in the अः which is the same code point as अ.

Maybe you could look at a nested loop. Have your outer loop over the base characters, and your inner loop add the composition characters. Something like:

<?php
$i = 0;
foreach (range(0x0905, 0x0938) as $char)
{
    $txt = html_entity_decode("&#$char;", ENT_COMPAT, "UTF-8");
    $alphabets[$txt] = ++$i;
    foreach ( range(0x0901, 0x0903) as $combine )
    {
        $txt = html_entity_decode("&#$char;", ENT_COMPAT, "UTF-8")
             . html_entity_decode("&#$combine;", ENT_COMPAT, "UTF-8");
        $alphabets[$txt] = ++$i;
    }
}
print_r($alphabets);
?>

回复收藏 0 原文

~没有更多了~