如何向嵌入式项目添加 UTF-8 支持和关联的字体表?

发布于 2024-07-15 12:18:39 字数 404 浏览 5 评论 0原文

我目前正在为嵌入式显示器设计字体引擎。 基本问题如下:

我需要获取动态生成的文本字符串,在 UTF-8 表中查找该字符串中的值,然后使用该表指向所有支持的字符的压缩位图数组。 完成后,我调用一个位复制例程,将数据从位图数组移动到显示器。

我不会支持完整的 UTF-8 字符集,因为我可以使用的系统资源非常有限(32K ROM、8K RAM),但希望能够在以后添加所需的字形以实现本地化目的。 所有开发都是用 C 语言和汇编语言完成的。

字形大小最大为 16 位宽 x 16 位高。 我们可能需要支持整个基本多语言平面(3 字节),因为我们的一些较大客户位于亚洲。 但是,我们不会将整个表包含在任何特定的本地化中。

我的问题是这样的:
添加此 UTF-8 支持和关联表的最佳方法是什么?

I am currently designing a font engine for an embedded display. The basic problem is the following:

I need to take a dynamically generated text string, look up the values from that string in a UTF-8 table, then use the table to point to the compressed bitmap array of all the supported characters. After that is complete, I call a bitcopy routine that moves the data from the bitmap array to the display.

I will not be supporting the full UTF-8 character set, as I have very limited system resources to work with (32K ROM, 8K RAM), but want to have the ability to add the needed glyphs later on for localization purposes. All development is being done in C and assembly.

The glyph size is a maximum of 16 bits wide by 16 bits tall. We will probably need to have support for the whole of the Basic Multilingual Plane (3 bytes), as some of our larger customers are in Asia. However, we would not be including the whole table in any specific localization.

My question is this:
What is the best way to add this UTF-8 support and associated table?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

鸩远一方 2024-07-22 12:18:39

下面的解决方案假设 Unicode 空间的低 16 位对您来说足够了。 如果您的位图表有,例如在位置 0x00 到 0x5E 处有 U+0020 到 U+007E,在位置 0x5F 到 0xBE 处有 U+00A0 到 U+00FF,在 0xBF 到 0xFF 处有 U+1200 到 U+1241,你可以这样做下面的代码(未经测试,甚至未经编译测试)。

位图map 包含一系列值对。 第一对中的第一个值是索引 0 处的位图表示的 Unicode 代码点。 假设位图表包含一系列直接相邻的 Unicode 代码点。 所以第二个值表示这个系列有多长。

while 循环的第一部分迭代 UTF-8 输入并在 ucs2char 中构建 Unicode 代码点。 一旦找到完整的字符,第二部分就会在位图中提到的范围之一中搜索该字符。 如果找到合适的位图索引,则会将其添加到索引中。 不存在位图的字符将被静默删除。

该函数返回找到的位图索引的数量。

就 unicode-> 位图表而言,这种处理方式应该具有内存效率,相当快且相当灵活。

// Code below assumes C99, but is about three cut-and-pastes from C89
// Assuming an unsigned short is 16-bit

unsigned short bitmapmap[]={0x0020, 0x005E,
                            0x00A0, 0x0060,
                            0x1200, 0x0041,
                            0x0000};

int utf8_to_bitmap_indexes(unsigned char *utf8, unsigned short *indexes)
{
    int bitmapsfound=0;
    int utf8numchars;
    unsigned char c;
    unsigned short ucs2char;
    while (*utf8)
    {
        c=*utf8;
        if (c>=0xc0)
        {
            utf8numchars=0;
            while (c&0x80)
            {
                utf8numchars++;
                c<<=1;
            }
            c>>=utf8numchars;
            ucs2char=0;
        }
        else if (utf8numchars && c<0x80)
        {
            // This is invalid UTF-8.  Do our best.
            utf8numchars=0;
        }

        if (utf8numchars)
        {
            c&=0x3f;
            ucs2char<<=6;
            ucs2char+=c;
            utf8numchars--;
            if (utf8numchars)
                continue; // Our work here is done - no char yet
        }
        else
            ucs2char=c;

        // At this point, we have a complete UCS-2 char in ucs2char

        unsigned short bmpsearch=0;
        unsigned short bmpix=0;
        while (bitmapmap[bmpsearch])
        {
            if (ucs2char>=bitmapmap[bmpsearch] && ucs2char<=bitmapmap[bmpsearch]+bitmapmap[bmpsearch+1])
            {
                *indexes++ = bmpix+(ucs2char-bitmapmap[bmpsearch]);
                bitmapsfound++;
                break;
            }

            bmpix+=bitmapmap[bmpsearch+1];
            bmpsearch+=2;
        }
    }
    return bitmapsfound;
}

编辑:您提到您需要的不仅仅是低 16 位。 s/无符号短/无符号整数/;s/ucs2char/codepoint/; 在上面的代码中,它可以完成整个 Unicode 空间。

The solution below assumes that the lower 16 bits of the Unicode space will be enough for you. If your bitmap table has, say U+0020 through U+007E at positions 0x00 to 0x5E and U+00A0 through U+00FF at positions 0x5F to 0xBE and U+1200 through U+1241 at 0xBF to 0xFF, you could do something like the code below (which isn't tested, not even compile-tested).

bitmapmap contains a series of pairs of values. The first value in the first pair is the Unicode code point which the bitmap at index 0 represents. The assumption is that the bitmap table contains a series of directly adjacent Unicode code points. So the second value says how long this series is.

The first part of the while loop iterates through UTF-8 input and builds up a Unicode code point in ucs2char. Once a complete character is found, the second part searches for that character in one of the ranges mentioned in bitmapmap. If it finds an appropriate bitmap index, it adds it to indexes. Characters for which no bitmap is present are silently dropped.

The function returns the number of bitmap indexes found.

This way of doing things should be memory-efficient in terms of the unicode->bitmap table, reasonably fast and reasonably flexible.

// Code below assumes C99, but is about three cut-and-pastes from C89
// Assuming an unsigned short is 16-bit

unsigned short bitmapmap[]={0x0020, 0x005E,
                            0x00A0, 0x0060,
                            0x1200, 0x0041,
                            0x0000};

int utf8_to_bitmap_indexes(unsigned char *utf8, unsigned short *indexes)
{
    int bitmapsfound=0;
    int utf8numchars;
    unsigned char c;
    unsigned short ucs2char;
    while (*utf8)
    {
        c=*utf8;
        if (c>=0xc0)
        {
            utf8numchars=0;
            while (c&0x80)
            {
                utf8numchars++;
                c<<=1;
            }
            c>>=utf8numchars;
            ucs2char=0;
        }
        else if (utf8numchars && c<0x80)
        {
            // This is invalid UTF-8.  Do our best.
            utf8numchars=0;
        }

        if (utf8numchars)
        {
            c&=0x3f;
            ucs2char<<=6;
            ucs2char+=c;
            utf8numchars--;
            if (utf8numchars)
                continue; // Our work here is done - no char yet
        }
        else
            ucs2char=c;

        // At this point, we have a complete UCS-2 char in ucs2char

        unsigned short bmpsearch=0;
        unsigned short bmpix=0;
        while (bitmapmap[bmpsearch])
        {
            if (ucs2char>=bitmapmap[bmpsearch] && ucs2char<=bitmapmap[bmpsearch]+bitmapmap[bmpsearch+1])
            {
                *indexes++ = bmpix+(ucs2char-bitmapmap[bmpsearch]);
                bitmapsfound++;
                break;
            }

            bmpix+=bitmapmap[bmpsearch+1];
            bmpsearch+=2;
        }
    }
    return bitmapsfound;
}

EDIT: You mentioned that you need more than the lower 16 bits. s/unsigned short/unsigned int/;s/ucs2char/codepoint/; in the above code and it can then do the whole Unicode space.

你对谁都笑 2024-07-22 12:18:39

您没有指定字符的大小,或者字符集的大小是多少,因此很难估计大小要求。

我将以直接数组格式存储位图,具体取决于字符的大小,它可能会相当有效地存储,而不需要打包/解包元素。

例如,如果我们采用 36 个字符的字母表和 8x6 字符,则需要 216 字节的数组存储空间。 (6 字节/字符 * 36 - 每个字节都是字符的垂直切片)。

对于解析来说,只需在表中进行偏移即可。
旧的 (char - 'A') 和 (char - '0') 技巧效果很好。

另一个问题是在哪里存储位图数组。
在 ROM 中是显而易见的答案,但如果您需要支持其他字形,则可能需要重新编程,而您没有指定这是否是一个问题。

如果必须对字形进行动态编程,那么您别无选择,只能将其放入 RAM 中。

You didn't specify the size of your characters, or what is the size of your character set so it is difficult to estimate the size requirements.

I would store the bitmaps in a straight array format, depending on the size of the characters, it might store fairly efficiently without the need to pack/unpack elements.

For example, if we take a 36 character alphabet with a 8x6 character, you need 216 bytes of storage for the array. (6 bytes/character * 36 - Each byte would be a vertical slice of the character).

For the parsing, it is simply a matter of doing offset in the table.
The old (char - 'A') and (char - '0') tricks do quite well.

The other question is where to store the bitmap array.
In ROM is the obvious answer, but if you need to support other glyphs it might need reprogramming, which you don't specify if it's an issue.

If the glyphs must be programmed dynamically, then you don't have a choice but to put it in RAM.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文