有谁知道下面的代码可能做什么？

发布于 2024-12-08 15:28:04 字数 1226 浏览 0 评论 0原文

/* utf-8: 0xc0, 0xe0, 0xf0, 0xf8, 0xfc */
static unsigned char _mblen_table_utf8[] = 
{
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
    2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
    3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
    4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 1, 1
};

我敢打赌这与编码有关，

但它到底是如何工作的呢？

更新

        while (str < ptr)
        {
            j = mblen[(*str)];
            tree_nput(r->tree, cr, sizeof(struct rule_item), str, j);
            str += j;
        }
    }

原文

/* utf-8: 0xc0, 0xe0, 0xf0, 0xf8, 0xfc */
static unsigned char _mblen_table_utf8[] = 
{
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
    2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
    3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
    4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 1, 1
};

I bet it has something to do with the encodings,

but how exactly it works?

UPDATE

        while (str < ptr)
        {
            j = mblen[(*str)];
            tree_nput(r->tree, cr, sizeof(struct rule_item), str, j);
            str += j;
        }
    }

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

空城之時有危險 2024-12-15 15:28:04

由于多字节字符串中的字符具有可变长度，因此该表将每个字符映射到一个长度。

最后 64 个字符比 1 个字节宽，长度为 2 到 6。

用法是这样的：

unsigned char current_char = *mbstr;

for (i = 0; i < _mblen_table_utf8[current_char]; i++) {
  /* treat *mbstr++ as a part of the current character */
}

Because a character in a multibyte string has a variable length, this table maps each character to a length.

The last 64 characters are wider than one byte, having lengths of 2 to 6.

The usage would be something like that:

unsigned char current_char = *mbstr;

for (i = 0; i < _mblen_table_utf8[current_char]; i++) {
  /* treat *mbstr++ as a part of the current character */
}

回复收藏 0 原文

姐不稀罕 2024-12-15 15:28:04

从历史上看，每个字符都用 7 位（后来是 8 位）进行编码，这足以对欧洲语言字母表进行编码。

只有前 128 个字符对每个人来说都是通用的，其余 128 个通过代码页进行标准化 (ISO-8859 -1 是一个例子）。

对较长字母语言（例如中文）进行编码的需要导致了 Unicode 工作，每个字符都编码在多个字节上。

UTF-8 是一种以高效、可变代码长度的方式对 Unicode 字符进行编码的方法。这意味着您读取的第一个字节决定了字符字节序列的长度。

基本上，您的表是一个查找表，用于检查从用作表索引的字节开始的字符有多少字节。您将在此处看到此表的另一个版本以及说明。

我添加了表索引作为注释以使其更清晰：

/* utf-8: 0xc0, 0xe0, 0xf0, 0xf8, 0xfc */
static unsigned char _mblen_table_utf8[] = 
{
/*0x00*/    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
/*0x10*/    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
/*0x20*/    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
/*0x30*/    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
/*0x40*/    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
/*0x50*/    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
/*0x60*/    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
/*0x70*/    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
/*0x80*/    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
/*0x90*/    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
/*0xA0*/    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
/*0xB0*/    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
/*0xC0*/    2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
/*0xD0*/    2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
/*0xE0*/    3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
/*0xF0*/    4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 1, 1
};

Historically, each character was coded on 7 bits (then 8 bits) which was more than enough to encode european languages alphabets.

Only the 128 first characters were common to everyone, the remaining 128 were standardized through codepages (ISO-8859-1 is an example).

The need to encode longer alphabet languages such as Chinese resulted in the Unicode effort were each character is coded on several bytes.

UTF-8 is a way to encode Unicode characters in an efficient, variable code-length way. This means that the first byte you read determines the length of the character byte-sequence.

Basically, your table is a lookup-table to check how many bytes is a character that start from the byte you use as table index. You will see another version of this table here with explanations.

I added the table indexes as comments to make it clearer:

/* utf-8: 0xc0, 0xe0, 0xf0, 0xf8, 0xfc */
static unsigned char _mblen_table_utf8[] = 
{
/*0x00*/    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
/*0x10*/    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
/*0x20*/    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
/*0x30*/    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
/*0x40*/    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
/*0x50*/    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
/*0x60*/    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
/*0x70*/    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
/*0x80*/    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
/*0x90*/    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
/*0xA0*/    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
/*0xB0*/    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
/*0xC0*/    2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
/*0xD0*/    2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
/*0xE0*/    3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
/*0xF0*/    4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 1, 1
};

回复收藏 0 原文