寻找“实际” QString 中的字符(字素)
假设我有一个可能包含任何 Unicode 字符的 QString,并且我想迭代其字符或对它们进行计数。我所说的“字符”是指用户所感知的内容(大致相当于“字形”),而不仅仅是 QChars(16 位 Unicode 字符)。一些“实际”字符由多个 QChar(代理对;基本字符 + 组合标记)组成。对于某些组合字符,我可能会通过规范化字符串来创建复合字符,但这并不总是有帮助。
我是否忽略了一个将 QString 拆分为“实际”字符的内置函数?
或者如果我必须自己解析它,这是结构(在 EBNF 中)还是我遗漏了什么?
character = ((high_surrogate, low_surrogate) | base_character), {combining_mark}
(base_character
是不是代理或组合字符的每个 QChar)
Let's say I have a QString that may consist of any Unicode characters, and I want to iterate through its characters or count them. And by "characters" I mean what the user perceives as such (so roughly equivalent to "glyphs") and not simply QChars (16-bit Unicode characters). Some "actual" characters are built of several QChars (surrogate pairs; base character + combining marks). For some combining characters I might get away with normalizing the string to create composite characters, but that does not always help.
Have I overlooked a built-in function that splits a QString into "actual" characters?
Or if I have to parse it myself, is this the structure (in EBNF) or am I missing something?
character = ((high_surrogate, low_surrogate) | base_character), {combining_mark}
(with base_character
being every QChar that is not a surrogate or combining character)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
经过更多研究,我找到了表示“实际字符”的术语,grapheme,以及用于查找字素边界的 Qt 类:QTextBoundaryFinder。
After more research I found the term for "actual character", grapheme, and with it the Qt class for finding grapheme boundaries: QTextBoundaryFinder.
我不确定组合标记,但对于代理对,我认为您可以使用 QString::toUcs4() 应该返回字符串的 32 位 Unicode 表示形式。
I am not sure about the combining marks, but for the surrogate pairs, I think you can use QString::toUcs4() which should return a 32-bit Unicode representation of your string.