在现实语言中呈现一个字形可能需要的 Unicode 组合字符的最大数量是多少?
我正在研究 Linux 控制台应用程序中的 Unicode 支持。我遇到需要更改屏幕缓冲区格式以存储 Unicode 字形而不是表示 ASCII 字符的字节。 Unicode 具有组合字符,因此可以将多个 Unicode 代码点呈现到一个控制台单元中。
问题是:在现实语言中呈现一个字形可能需要的 Unicode 组合字符的最大数量是多少?例如,世界上是否有任何语言的字形需要超过 8 个组合字符才能呈现?假设我不需要“Zalgo 文本”支持,但代价是由于实现动态长度变量来存储每个控制台缓冲区字形而导致性能下降。
I'm working on Unicode support in a Linux console application. I ran into a need to change the screen buffer format to store Unicode glyphs instead of bytes representing ASCII characters. Unicode has combined characters, hence more than one Unicode code point can be rendered into one console cell.
The question is: what is the maximum number of Unicode combined characters that may be needed to render one glyph in real-life languages? Are there any languages in the world that have glyphs that need more than 8 combined characters to render, for example? Let's assume that I don't need "Zalgo text" support at the cost of performance degradation caused by implementing dynamic length variables to store each console buffer glyph.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
没有人能成为每种语言中“现实生活”角色的专家,所以我可能会在这里遗漏一些较长的序列。但我确实知道很多表情符号!有一些用于地理细分标志的表情符号,它们是通过组合代码点实现的。例如,苏格兰国旗
Nobody can be an expert in what makes up a "real-life" character in every language, so I might be missing some longer sequences here. But I do know about a lot of emoji! There are a few emojis for flags of geographic subdivisions which are implemented with combining codepoints. For example, the flag for Scotland, ????????????????????????????, is 7 codepoints, taking up 28 bytes in UTF-32:
Country flags, like ????????, have just two combining codepoints.
Family emojis with 4 people, like ????????????????, are also 7 codepoints. The only emoji I'm aware of that's longer are family emojis with a skin-tone specified for each family member, but these don't have a lot of support right now. Here's what one displays as on your device: ???????????????????????????????? (if you just see four heads, then you don't have a font installed that supports this). That emoji has 11 codepoints.
That being said, keep in mind that not all languages are rendered as a series of glyphs in sequence:
أهلا
is segmented using Unicode rules into 4 distinct characters.