C:确定 UTF-8 字符串中的 UTF-16 字符串需要多少字节的最有效方法
我已经看到一些非常聪明的代码用于在 Unicode 代码点和 UTF-8 之间进行转换,所以我想知道是否有人拥有(或愿意设计)这个。
- 给定一个 UTF-8 字符串,同一字符串的 UTF-16 编码需要多少字节。
- 假设 UTF-8 字符串已经过验证。它没有 BOM、没有超长序列、没有无效序列、以 null 终止。它不是 CESU-8。
- 必须支持带有代理的完整 UTF-16。
具体来说,我想知道是否有捷径可以知道何时需要代理对,而无需将 UTF-8 序列完全转换为代码点。
我见过的最好的 UTF-8 到代码点代码使用矢量化技术,所以我想知道这是否也可能在这里。
I've seen some very clever code out there for converting between Unicode codepoints and UTF-8 so I was wondering if anybody has (or would enjoy devising) this.
- Given a UTF-8 string, how many bytes are needed for the UTF-16 encoding of the same string.
- Assume the UTF-8 string has already been validated. It has no BOM, no overlong sequences, no invalid sequences, is null-terminated. It is not CESU-8.
- Full UTF-16 with surrogates must be supported.
Specifically I wonder if there are shortcuts to knowing when a surrogate pair will be needed without fully converting the UTF-8 sequence into a codepoint.
The best UTF-8 to codepoint code I've seen uses vectorizing techniques so I wonder if that's also possible here.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
效率始终是速度与尺寸的权衡。如果速度比大小更重要,那么最有效的方法就是根据源字符串的长度进行猜测。
需要考虑 4 种情况,只需将最坏的情况作为最终缓冲区大小:
最糟糕的情况扩展因子是当将 U+0000-U+007f 从 utf8 转换为 utf16 时:缓冲区(按字节)只需是源字符串的两倍即可。当编码为 utf16 时,每个其他 unicode 代码点都会产生与 utf8 相同的大小或更小的字节分配。
Efficiency is always a speed vs size tradeoff. If speed is favored over size then the most efficient way is just to guess based on the length of the source string.
There are 4 cases that need to be considered, simply take the worst case as the final buffer size:
The worse case expansion factor is when translating U+0000-U+007f from utf8 to utf16: the buffer, bytewise, merely has to be twice as large as the source string. Every other unicode codepoint results in an equal size, or smaller bytewise allocation when encoded as utf16 as utf8.
很简单:计算头字节数,重复计算字节
F0
及以上。在代码中:
注意:此函数返回 UTF-16 代码单元的长度。如果您想要所需的字节数,请乘以 2。如果要存储空终止符,您还需要为此考虑空间(一个额外的代码单元/两个额外的字节)。
Very simple: count the number of head bytes, double-counting bytes
F0
and up.In code:
Note: This function returns the length in UTF-16 code units. If you want the number of bytes needed, multiply by 2. If you're going to store a null terminator you'll also need to account for space for that (one extra code unit/two extra bytes).
这不是一种算法,但如果我理解正确的话,规则是这样的:
0
的字节都会添加 2 个字节(1 个 UTF-16 代码单元)110
或1110
的字节添加 2 个字节(1 个 UTF- 16个代码单元)1111
开始)添加4 个字节(2 个 UTF-16 代码单元)10
10 开头的字节)代码>) 可以跳过我不是 C 专家,但这看起来很容易矢量化。
It's not an algorithm, but if I understand correctly the rules are as such:
0
adds 2 bytes (1 UTF-16 code unit)110
or1110
adds 2 bytes (1 UTF-16 code unit)1111
) adds 4 bytes (2 UTF-16 code units)10
) can be skippedI'm not a C expert, but this looks easily vectorizable.