C:确定 UTF-8 字符串中的 UTF-16 字符串需要多少字节的最有效方法

发布于 2024-11-02 04:38:18 字数 408 浏览 6 评论 0原文

我已经看到一些非常聪明的代码用于在 Unicode 代码点和 UTF-8 之间进行转换,所以我想知道是否有人拥有(或愿意设计)这个。

  • 给定一个 UTF-8 字符串,同一字符串的 UTF-16 编码需要多少字节。
  • 假设 UTF-8 字符串已经过验证。它没有 BOM、没有超长序列、没有无效序列、以 null 终止。它不是 CESU-8
  • 必须支持带有代理的完整 UTF-16。

具体来说,我想知道是否有捷径可以知道何时需要代理对,而无需将 UTF-8 序列完全转换为代码点。

我见过的最好的 UTF-8 到代码点代码使用矢量化技术,所以我想知道这是否也可能在这里。

I've seen some very clever code out there for converting between Unicode codepoints and UTF-8 so I was wondering if anybody has (or would enjoy devising) this.

  • Given a UTF-8 string, how many bytes are needed for the UTF-16 encoding of the same string.
  • Assume the UTF-8 string has already been validated. It has no BOM, no overlong sequences, no invalid sequences, is null-terminated. It is not CESU-8.
  • Full UTF-16 with surrogates must be supported.

Specifically I wonder if there are shortcuts to knowing when a surrogate pair will be needed without fully converting the UTF-8 sequence into a codepoint.

The best UTF-8 to codepoint code I've seen uses vectorizing techniques so I wonder if that's also possible here.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

你是我的挚爱i 2024-11-09 04:38:18

效率始终是速度与尺寸的权衡。如果速度比大小更重要,那么最有效的方法就是根据源字符串的长度进行猜测。

需要考虑 4 种情况,只需将最坏的情况作为最终缓冲区大小:

  • U+0000-U+007F - 将在 utf8 中编码为 1 个字节,在 utf16 中编码为每个字符 2 个字节。 (1:2 = x2)
  • U+0080-U+07FF - 编码为 2 字节 utf8 序列,或每个字符 2 字节 utf16 字符。 (2:2 = x1)
  • U+0800-U+FFFF - 存储为 3 字节 utf8 序列,但仍适合单个 utf16 字符。 (3:2 = x.67)
  • U+10000-U+10FFFF - 存储为 4 字节 utf8 序列,或 utf16 中的代理对。 (4:4 = x1)

最糟糕的情况扩展因子是当将 U+0000-U+007f 从 utf8 转换为 utf16 时:缓冲区(按字节)只需是源字符串的两倍即可。当编码为 utf16 时,每个其他 unicode 代码点都会产生与 utf8 相同的大小或更小的字节分配。

Efficiency is always a speed vs size tradeoff. If speed is favored over size then the most efficient way is just to guess based on the length of the source string.

There are 4 cases that need to be considered, simply take the worst case as the final buffer size:

  • U+0000-U+007F - will encode to 1byte in utf8, and 2bytes per character in utf16. (1:2 = x2)
  • U+0080-U+07FF - encoded to 2byte utf8 sequences, or 2byte per character utf16 characters. (2:2 = x1)
  • U+0800-U+FFFF - are stored as 3byte utf8 sequences, but still fit in single utf16 characters. (3:2 = x.67)
  • U+10000-U+10FFFF - are stored as 4byte utf8 sequences, or surrogate pairs in utf16. (4:4 = x1)

The worse case expansion factor is when translating U+0000-U+007f from utf8 to utf16: the buffer, bytewise, merely has to be twice as large as the source string. Every other unicode codepoint results in an equal size, or smaller bytewise allocation when encoded as utf16 as utf8.

坏尐絯℡ 2024-11-09 04:38:18

很简单:计算头字节数,重复计算字节F0及以上。

在代码中:

size_t count(unsigned char *s)
{
    size_t l;
    for (l=0; *s; s++) l+=(*s-0x80U>=0x40)+(*s>=0xf0);
    return l;
}

注意:此函数返回 UTF-16 代码单元的长度。如果您想要所需的字节数,请乘以 2。如果要存储空终止符,您还需要为此考虑空间(一个额外的代码单元/两个额外的字节)。

Very simple: count the number of head bytes, double-counting bytes F0 and up.

In code:

size_t count(unsigned char *s)
{
    size_t l;
    for (l=0; *s; s++) l+=(*s-0x80U>=0x40)+(*s>=0xf0);
    return l;
}

Note: This function returns the length in UTF-16 code units. If you want the number of bytes needed, multiply by 2. If you're going to store a null terminator you'll also need to account for space for that (one extra code unit/two extra bytes).

你在看孤独的风景 2024-11-09 04:38:18

这不是一种算法,但如果我理解正确的话,规则是这样的:

  • 每个 MSB 为 0 的字节都会添加 2 个字节(1 个 UTF-16 代码单元)
    • 该字节表示 U+0000 - U+007F 范围内的单个 Unicode 代码点
  • 每个具有 MSB 1101110 的字节添加 2 个字节(1 个 UTF- 16个代码单元)
    • 这些字节分别开始 2 字节和 3 字节序列,表示 U+0080 - U+FFFF 范围内的 Unicode 代码点
  • 每个具有 4 MSB 集的字节(即以 1111 开始)添加4 个字节(2 个 UTF-16 代码单元)
    • 这些字节开始 4 字节序列,覆盖 Unicode 范围的“其余部分”,可以用 UTF-16 中的低位和高位代理表示
  • 每隔一个字节的低位和高位代理来表示(即以 1010 开头的字节)代码>) 可以跳过
    • 这些字节已经与其他字节一起计算在内。

我不是 C 专家,但这看起来很容易矢量化。

It's not an algorithm, but if I understand correctly the rules are as such:

  • every byte having a MSB of 0 adds 2 bytes (1 UTF-16 code unit)
    • that byte represents a single Unicode codepoint in the range U+0000 - U+007F
  • every byte having the MSBs 110 or 1110 adds 2 bytes (1 UTF-16 code unit)
    • these bytes start 2- and 3-byte sequences respectively which represent Unicode codepoints in the range U+0080 - U+FFFF
  • every byte having the 4 MSB set (i.e. starting with 1111) adds 4 bytes (2 UTF-16 code units)
    • these bytes start 4-byte sequences which cover "the rest" of the Unicode range, which can be represented with a low and high surrogate in UTF-16
  • every other byte (i.e. those starting with 10) can be skipped
    • these bytes are already counted with the others.

I'm not a C expert, but this looks easily vectorizable.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文