Microsoft 如何处理 UTF-16 在其 C++ 中是可变长度编码这一事实?标准库实现
标准中间接禁止使用可变长度编码。
所以我有几个问题:
标准的以下部分是如何处理的?
17.3.2.1.3.3 宽字符序列
宽字符序列是一个数组对象 (8.3.4) A,可以声明为 TA[N],其中 T 是 wchar_t 类型 (3.9.1),可以选择通过 const 或 volatile 的任意组合进行限定。数组的初始元素已定义内容,直到并包括由某个谓词确定的元素。字符序列可以通过指定其第一个元素的指针值 S 来指定。
NTWCS 的长度是终止空宽字符之前的元素数。空 NTWCS 的长度为零。
问题:
basic_string
operator[]
是如何实现的以及它返回什么?- 标准:
如果 pos < size(),返回 data()[pos]。否则,如果 pos == size(),则 const 版本返回 charT()。否则,行为未定义。
- 标准:
size()
返回元素数量还是字符串长度?- 标准:
返回:字符串中当前类字符对象的数量。
- 标准:
resize()
如何工作?- 与标准无关,只是做什么
- 如何处理
insert()
、erase()
等中的位置?
cwctype
- 几乎所有内容都在这里。变量编码是如何处理的?
cwchar
getwchar()
显然无法返回整个平台字符,那么它是如何工作的呢?
加上其余所有的角色功能(主题是一样的)。
编辑:我将开立赏金以获得一些确认。我想要得到一些明确的答案,或者至少是更清晰的选票分配。
编辑:这开始变得毫无意义。这充满了完全矛盾的答案。你们中的一些人谈论外部编码(我不关心这些,一旦读入字符串,UTF-8编码仍将存储为UTF-16,输出相同),其余的只是相互矛盾。 :-/
Having a variable length encoding is indirectly forbidden in the standard.
So I have several questions:
How is the following part of the standard handled?
17.3.2.1.3.3 Wide-character sequences
A wide-character sequence is an array object (8.3.4) A that can be declared as T A[N], where T is type wchar_t (3.9.1), optionally qualified by any combination of const or volatile. The initial elements of the array have defined contents up to and including an element determined by some predicate. A character sequence can be designated by a pointer value S that designates its first element.
The length of an NTWCS is the number of elements that precede the terminating null wide character. An empty NTWCS has a length of zero.
Questions:
basic_string<wchar_t>
- How is
operator[]
implemented and what does it return?- standard:
If pos < size(), returns data()[pos]. Otherwise, if pos == size(), the const version returns charT(). Otherwise, the behavior is undefined.
- standard:
- Does
size()
return the number of elements or the length of the string?- standard:
Returns: a count of the number of char-like objects currently in the string.
- standard:
- How does
resize()
work?- unrelated to standard, just what does it do
- How are the position in
insert()
,erase()
and others handled?
cwctype
- Pretty much everything in here. How is the variable encoding handled?
cwchar
getwchar()
obviously can't return a whole platform-character, so how does this work?
Plus all the rest of the character function (the theme is the same).
Edit: I will be opening a bounty to get some confirmation. I want to get some clear answers or at least a clearer distribution of votes.
Edit: This is starting to get pointless. This is full of totally conflicting answers. Some of you talk about external encodings (I don't care about those, UTF-8 encoded will still be stored as UTF-16 once read into the string, the same for output), the rest simply contradicts each other. :-/
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
以下是 Microsoft 的 STL 实现处理可变长度编码的方式:
basic_string::operator[])(
可以单独返回低位或高位代理项。basic_string: :size()
返回wchar_t
对象的数量。代理对(一个 Unicode 字符)使用两个 wchar_t,因此将basic_string:: resize()
可以截断代理对中间的字符串basic_string::insert()
可以在代理对中间插入字符串。 wchar_t>::erase() 可以擦除代理项对的任意一半。
一般来说,模式应该很清楚:STL 不假设
std::wstring
是 UTF 格式。 -16,也不强制它保持 UTF-16。Here's how Microsoft's STL implementation handles the variable-length encoding:
basic_string<wchar_t>::operator[])(
can return a low or a high surrogate, in isolation.basic_string<wchar_t>::size()
returns the number ofwchar_t
objects. A surrogate pair (one Unicode character) uses two wchar_t's and therefore adds two to the size.basic_string<wchar_t>::resize()
can truncate a string in the middle of a surrogate pair.basic_string<wchar_t>::insert()
can insert in the middle of a surrogate pair.basic_string<wchar_t>::erase()
can erase either half of a surrogate pair.In general, the pattern should be clear: the STL does not assume that a
std::wstring
is in UTF-16, nor enforce that it remains UTF-16.STL 将字符串简单地视为字符数组的包装器,因此 STL 字符串上的 size() 或 length() 将告诉您它包含多少个 char 或 wchar_t 元素,而不一定是字符串中可打印字符的数量。
STL deals with strings as simply a wrapper for an array of characters therefore size() or length() on an STL string will tell you how many char or wchar_t elements it contains and not necessarily the number of printable characters it would be in a string.
假设您正在讨论
wstring
类型,则不会处理编码 - 它只处理wchar_t
元素,而不了解有关编码的任何信息。它只是一个wchar_t
序列。您需要使用其他函数的功能来处理编码问题。Assuming that you're talking about the
wstring
type, there would be no handling of the encoding - it just deals withwchar_t
elements without knowing anything about the encoding. It's just a sequence ofwchar_t
's. You'll need to deal with encoding issues using functionality of other functions.有两件事:
Two things:
MSVC 将
wchar_t
存储在wstring
中。这些可以被解释为 unicode 16 位字,或者其他任何东西。如果您想访问 unicode 字符或字形,则必须按照 unicode 标准处理所述原始字符串。您可能还想在不破坏的情况下处理常见的极端情况。
这是这样一个图书馆的草图。它的内存效率大约是它的一半,但它确实可以让您就地访问
std::string
中的 unicode 字形。它依赖于拥有一个像样的array_view
类,但无论如何您都想编写其中一个类:更智能的代码位将生成
unicode_char
和unicode_glyph 使用某种工厂迭代器动态运行。更紧凑的实现将跟踪前一个的结束指针和下一个的开始指针始终相同的事实,并将它们别名在一起。另一种优化是基于大多数字形是一个字符的假设,对字形使用小对象优化,如果它们是两个字符,则使用动态分配。
请注意,我将 CGJ 视为标准变音符号,并将双变音符号视为形成一个 (unicode) 的一组 3 个字符,但半变音符号不会将内容合并到一个字形中。这些都是值得商榷的选择。
这是在失眠期间写下的。希望它至少能起到一定的作用。
MSVC stores
wchar_t
inwstring
s. These can be interpreted as unicode 16 bit words, or anything else really.If you want to get access to unicode characters or glyphs, you'll have to process said raw string by the unicode standard. You probably also want to handle common corner cases without breaking.
Here is a sketch of such a library. It is about half as memory efficient as it could be, but it does give you in-place access to unicode glyphs in a
std::string
. It relies on having a decentarray_view
class, but you want to write one of those anyhow:a smarter bit of code would generate the
unicode_char
s andunicode_glyph
s on the fly with a factory iterator of some kind. A more compact implementation would keep track of the fact that the end pointer of the previous and begin pointer of the next are always identical, and alias them together. Another optimization would be to use a small object optimization on glyph based off the assumption that most glyphs are one character, and use dynamic allocation if they are two.Note that I treat CGJ as a standard diacrit, and the double-diacrits as a set of 3 characters that form one (unicode), but half-diacrits don't merge things into one glyph. These are all questionable choices.
This was written in a bout of insomnia. Hope it at least somewhat works.