Microsoft 如何处理 UTF-16 在其 C++ 中是可变长度编码这一事实?标准库实现

发布于 2024-09-29 08:05:15 字数 1151 浏览 8 评论 0原文

标准中间接禁止使用可变长度编码。

所以我有几个问题:

标准的以下部分是如何处理的?

17.3.2.1.3.3 宽字符序列

宽字符序列是一个数组对象 (8.3.4) A,可以声明为 TA[N],其中 T 是 wchar_t 类型 (3.9.1),可以选择通过 const 或 volatile 的任意组合进行限定。数组的初始元素已定义内容,直到并包括由某个谓词确定的元素。字符序列可以通过指定其第一个元素的指针值 S 来指定。

NTWCS 的长度是终止空宽字符之前的元素数。空 NTWCS 的长度为零。

问题:

basic_string

  • operator[] 是如何实现的以及它返回什么?
    • 标准:如果 pos < size(),返回 data()[pos]。否则,如果 pos == size(),则 const 版本返回 charT()。否则,行为未定义。
  • size() 返回元素数量还是字符串长度?
    • 标准:返回:字符串中当前类字符对象的数量。
  • resize()如何工作?
    • 与标准无关,只是做什么
  • 如何处理 insert()erase() 等中的位置?

cwctype

  • 几乎所有内容都在这里。变量编码是如何处理的?

cwchar

  • getwchar() 显然无法返回整个平台字符,那么它是如何工作的呢?

加上其余所有的角色功能(主题是一样的)。

编辑:我将开立赏金以获得一些确认。我想要得到一些明确的答案,或者至少是更清晰的选票分配。

编辑:这开始变得毫无意义。这充满了完全矛盾的答案。你们中的一些人谈论外部编码(我不关心这些,一旦读入字符串,UTF-8编码仍将存储为UTF-16,输出相同),其余的只是相互矛盾。 :-/

Having a variable length encoding is indirectly forbidden in the standard.

So I have several questions:

How is the following part of the standard handled?

17.3.2.1.3.3 Wide-character sequences

A wide-character sequence is an array object (8.3.4) A that can be declared as T A[N], where T is type wchar_t (3.9.1), optionally qualified by any combination of const or volatile. The initial elements of the array have defined contents up to and including an element determined by some predicate. A character sequence can be designated by a pointer value S that designates its first element.

The length of an NTWCS is the number of elements that precede the terminating null wide character. An empty NTWCS has a length of zero.

Questions:

basic_string<wchar_t>

  • How is operator[] implemented and what does it return?
    • standard: If pos < size(), returns data()[pos]. Otherwise, if pos == size(), the const version returns charT(). Otherwise, the behavior is undefined.
  • Does size() return the number of elements or the length of the string?
    • standard: Returns: a count of the number of char-like objects currently in the string.
  • How does resize() work?
    • unrelated to standard, just what does it do
  • How are the position in insert(), erase() and others handled?

cwctype

  • Pretty much everything in here. How is the variable encoding handled?

cwchar

  • getwchar() obviously can't return a whole platform-character, so how does this work?

Plus all the rest of the character function (the theme is the same).

Edit: I will be opening a bounty to get some confirmation. I want to get some clear answers or at least a clearer distribution of votes.

Edit: This is starting to get pointless. This is full of totally conflicting answers. Some of you talk about external encodings (I don't care about those, UTF-8 encoded will still be stored as UTF-16 once read into the string, the same for output), the rest simply contradicts each other. :-/

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

三生一梦 2024-10-06 08:05:15

以下是 Microsoft 的 STL 实现处理可变长度编码的方式:

basic_string::operator[])( 可以单独返回低位或高位代理项。

basic_string: :size() 返回 wchar_t 对象的数量。代理对(一个 Unicode 字符)使用两个 wchar_t,因此将

basic_string:: resize() 可以截断代理对中间的字符串

basic_string::insert() 可以在代理对中间插入字符串

。 wchar_t>::erase() 可以擦除代理项对的任意一半。

一般来说,模式应该很清楚:STL 不假设 std::wstring 是 UTF 格式。 -16,也不强制它保持 UTF-16。

Here's how Microsoft's STL implementation handles the variable-length encoding:

basic_string<wchar_t>::operator[])( can return a low or a high surrogate, in isolation.

basic_string<wchar_t>::size() returns the number of wchar_t objects. A surrogate pair (one Unicode character) uses two wchar_t's and therefore adds two to the size.

basic_string<wchar_t>::resize() can truncate a string in the middle of a surrogate pair.

basic_string<wchar_t>::insert() can insert in the middle of a surrogate pair.

basic_string<wchar_t>::erase() can erase either half of a surrogate pair.

In general, the pattern should be clear: the STL does not assume that a std::wstring is in UTF-16, nor enforce that it remains UTF-16.

思慕 2024-10-06 08:05:15

STL 将字符串简单地视为字符数组的包装器,因此 STL 字符串上的 size() 或 length() 将告诉您它包含多少个 char 或 wchar_t 元素,而不一定是字符串中可打印字符的数量。

STL deals with strings as simply a wrapper for an array of characters therefore size() or length() on an STL string will tell you how many char or wchar_t elements it contains and not necessarily the number of printable characters it would be in a string.

风向决定发型 2024-10-06 08:05:15

假设您正在讨论 wstring 类型,则不会处理编码 - 它只处理 wchar_t 元素,而不了解有关编码的任何信息。它只是一个 wchar_t 序列。您需要使用其他函数的功能来处理编码问题。

Assuming that you're talking about the wstring type, there would be no handling of the encoding - it just deals with wchar_t elements without knowing anything about the encoding. It's just a sequence of wchar_t's. You'll need to deal with encoding issues using functionality of other functions.

ぇ气 2024-10-06 08:05:15

有两件事:

  1. 没有“Microsoft STL 实现”。 Visual C++ 附带的 C++ 标准库由 Dinkumware 授权。
  2. 当前的 C++ 标准对 Unicode 及其编码形式一无所知。 std::wstring 只是 wchar_t 单元的容器,在 Windows 上恰好是 16 位。实际上,如果您想将 UTF-16 编码的字符串存储到 wstring 中,只需考虑到您实际上存储的是代码单元而不是代码点。

Two things:

  1. There is no "Microsoft STL implementation". The C++ Standard Library shipped with Visual C++ is licensed from Dinkumware.
  2. The current C++ Standard knows nothing about Unicode and its encoding forms. std::wstring is merely a container for wchar_t units which happen to be 16-bit on Windows. In practice, if you want to store a UTF-16 encoded string into a wstring, just take into account that you are really storing code units and not code points.
幸福丶如此 2024-10-06 08:05:15

MSVC 将 wchar_t 存储在 wstring 中。这些可以被解释为 unicode 16 位字,或者其他任何东西。

如果您想访问 unicode 字符或字形,则必须按照 unicode 标准处理所述原始字符串。您可能还想在不破坏的情况下处理常见的极端情况。

这是这样一个图书馆的草图。它的内存效率大约是它的一半,但它确实可以让您就地访问 std::string 中的 unicode 字形。它依赖于拥有一个像样的 array_view 类,但无论如何您都想编写其中一个类:

struct unicode_char : array_view<wchar_t const> {
  using array_view<wchar_t const>::array_view<wchar_t const>;

  uint32_t value() const {
    if (size()==1)
      return front();
    Assert(size()==2);
    if (size()==2)
    {
      wchar_t high = front()-0xD800;
      wchar_T low = back()-0xDC00;
      return (uint32_t(high)<<10) + uint32_t(low);
    }
    return 0; // error
  }
  static bool is_high_surrogate( wchar_t c ) {
    return (c >= 0xD800 && c <= 0xDBFF);
  }
  static bool is_low_surrogate( wchar_t c ) {
    return (c >= 0xDC00 && c <= 0xDFFF);
  }
  static unicode_char extract( array_view<wchar_t const> raw )
  {
    if (raw.empty())
      return {};
    if (raw.size()==1)
      return raw;
    if (is_high_surrogate(raw.front()) && is_low_surrogate(*std::next(raw.begin())))
      return {raw.begin(), raw.begin()+2);
    return {raw.begin(), std::next(raw.begin())};
  }
};
static std::vector<unicode_char> as_unicode_chars( array_view<wchar_t> raw )
{
  std::vector<unicode_char> retval;
  retval.reserve( raw.size() ); // usually 1:1
  while(!raw.empty())
  {
    retval.push_back( unicode_char::extract(raw) );
    Assert( retval.back().size() <= raw.size() );
    raw = {raw.begin() + retval.back().size(), raw.end()};
  }
  return retval;
}
struct unicode_glyph {
  std::array< unicode_char, 3 > buff;
  std::size_t count=0;
  unicode_char const* begin() const {
    return buff.begin();
  }
  unicode_char const* end() const {
    return buff.begin()+count;
  }
  std::size_t size() const { return count; }
  bool empty() { return size()==0; }
  unicode_char const& front() const { return *begin(); }
  unicode_char const& back() const { return *std::prev(end()); }
  array_view< unicode_char const > chars() const { return {begin(), end()}; }
  array_view< wchar_t const > wchars() const {
    if (empty()) return {};
    return { front().begin(), back().end() };
  }

  void append( unicode_char next ) {
    Assert(count<3);
    buff[count++] = next;
  }
  unicode_glyph() {}

  static bool is_diacrit(unicode_char c) const {
    auto v = c.value();
    return is_diacrit(v);
  }
  static bool is_diacrit(uint32_t v) const {
    return
      ((v >= 0x0300) && (v <= 0x0360))
    || ((v >= 0x1AB0) && (v <= 0x1AFF))
    || ((v >= 0x1DC0) && (v <= 0x1DFF))
    || ((v >= 0x20D0) && (v <= 0x20FF))
    || ((v >= 0xFE20) && (v <= 0xFE2F));
  }
  static size_t diacrit_count(unicode_char c) const {
    auto v = c.value();
    if (is_diacrit(v))
      return 1 + ((v >= 0x035C)&&(v<=0x0362));
    else
      return 0;
  }
  static unicode_glyph extract( array_view<const unicode_char> raw ) {
    unicode_glyph retval;
    if (raw.empty())
      return retval;
    if (raw.size()==1)
    {
      retval.append(raw.front());
      return retval;
    }
    retval.count = diacrit_count( *std::next(raw.begin()) )+1;
    std::copy( raw.begin(), raw.begin()+retval.count, retval.buff.begin() );
    return retval;
  }
};
static std::vector<unicode_glyph> as_unicode_glyphs( array_view<unicode_char> raw )
{
  std::vector<unicode_glyph> retval;
  retval.reserve( raw.size() ); // usually 1:1
  while(!raw.empty())
  {
    retval.push_back( unicode_glyph::extract(raw) );
    Assert( retval.back().size() <= raw.size() );
    raw = {raw.begin() + retval.back().size(), raw.end()};
  }
  return retval;
}
static std::vector<unicode_glyph> as_unicode_glyphs( array_view<wchar_t> raw )
{
  return as_unicode_glyphs( as_unicode_chars( raw ) );
}

更智能的代码位将生成 unicode_charunicode_glyph 使用某种工厂迭代器动态运行。更紧凑的实现将跟踪前一个的结束指针和下一个的开始指针始终相同的事实,并将它们别名在一起。另一种优化是基于大多数字形是一个字符的假设,对字形使用小对象优化,如果它们是两个字符,则使用动态分配。

请注意,我将 CGJ 视为标准变音符号,并将双变音符号视为形成一个 (unicode) 的一组 3 个字符,但半变音符号不会将内容合并到一个字形中。这些都是值得商榷的选择。

这是在失眠期间写下的。希望它至少能起到一定的作用。

MSVC stores wchar_t in wstrings. These can be interpreted as unicode 16 bit words, or anything else really.

If you want to get access to unicode characters or glyphs, you'll have to process said raw string by the unicode standard. You probably also want to handle common corner cases without breaking.

Here is a sketch of such a library. It is about half as memory efficient as it could be, but it does give you in-place access to unicode glyphs in a std::string. It relies on having a decent array_view class, but you want to write one of those anyhow:

struct unicode_char : array_view<wchar_t const> {
  using array_view<wchar_t const>::array_view<wchar_t const>;

  uint32_t value() const {
    if (size()==1)
      return front();
    Assert(size()==2);
    if (size()==2)
    {
      wchar_t high = front()-0xD800;
      wchar_T low = back()-0xDC00;
      return (uint32_t(high)<<10) + uint32_t(low);
    }
    return 0; // error
  }
  static bool is_high_surrogate( wchar_t c ) {
    return (c >= 0xD800 && c <= 0xDBFF);
  }
  static bool is_low_surrogate( wchar_t c ) {
    return (c >= 0xDC00 && c <= 0xDFFF);
  }
  static unicode_char extract( array_view<wchar_t const> raw )
  {
    if (raw.empty())
      return {};
    if (raw.size()==1)
      return raw;
    if (is_high_surrogate(raw.front()) && is_low_surrogate(*std::next(raw.begin())))
      return {raw.begin(), raw.begin()+2);
    return {raw.begin(), std::next(raw.begin())};
  }
};
static std::vector<unicode_char> as_unicode_chars( array_view<wchar_t> raw )
{
  std::vector<unicode_char> retval;
  retval.reserve( raw.size() ); // usually 1:1
  while(!raw.empty())
  {
    retval.push_back( unicode_char::extract(raw) );
    Assert( retval.back().size() <= raw.size() );
    raw = {raw.begin() + retval.back().size(), raw.end()};
  }
  return retval;
}
struct unicode_glyph {
  std::array< unicode_char, 3 > buff;
  std::size_t count=0;
  unicode_char const* begin() const {
    return buff.begin();
  }
  unicode_char const* end() const {
    return buff.begin()+count;
  }
  std::size_t size() const { return count; }
  bool empty() { return size()==0; }
  unicode_char const& front() const { return *begin(); }
  unicode_char const& back() const { return *std::prev(end()); }
  array_view< unicode_char const > chars() const { return {begin(), end()}; }
  array_view< wchar_t const > wchars() const {
    if (empty()) return {};
    return { front().begin(), back().end() };
  }

  void append( unicode_char next ) {
    Assert(count<3);
    buff[count++] = next;
  }
  unicode_glyph() {}

  static bool is_diacrit(unicode_char c) const {
    auto v = c.value();
    return is_diacrit(v);
  }
  static bool is_diacrit(uint32_t v) const {
    return
      ((v >= 0x0300) && (v <= 0x0360))
    || ((v >= 0x1AB0) && (v <= 0x1AFF))
    || ((v >= 0x1DC0) && (v <= 0x1DFF))
    || ((v >= 0x20D0) && (v <= 0x20FF))
    || ((v >= 0xFE20) && (v <= 0xFE2F));
  }
  static size_t diacrit_count(unicode_char c) const {
    auto v = c.value();
    if (is_diacrit(v))
      return 1 + ((v >= 0x035C)&&(v<=0x0362));
    else
      return 0;
  }
  static unicode_glyph extract( array_view<const unicode_char> raw ) {
    unicode_glyph retval;
    if (raw.empty())
      return retval;
    if (raw.size()==1)
    {
      retval.append(raw.front());
      return retval;
    }
    retval.count = diacrit_count( *std::next(raw.begin()) )+1;
    std::copy( raw.begin(), raw.begin()+retval.count, retval.buff.begin() );
    return retval;
  }
};
static std::vector<unicode_glyph> as_unicode_glyphs( array_view<unicode_char> raw )
{
  std::vector<unicode_glyph> retval;
  retval.reserve( raw.size() ); // usually 1:1
  while(!raw.empty())
  {
    retval.push_back( unicode_glyph::extract(raw) );
    Assert( retval.back().size() <= raw.size() );
    raw = {raw.begin() + retval.back().size(), raw.end()};
  }
  return retval;
}
static std::vector<unicode_glyph> as_unicode_glyphs( array_view<wchar_t> raw )
{
  return as_unicode_glyphs( as_unicode_chars( raw ) );
}

a smarter bit of code would generate the unicode_chars and unicode_glyphs on the fly with a factory iterator of some kind. A more compact implementation would keep track of the fact that the end pointer of the previous and begin pointer of the next are always identical, and alias them together. Another optimization would be to use a small object optimization on glyph based off the assumption that most glyphs are one character, and use dynamic allocation if they are two.

Note that I treat CGJ as a standard diacrit, and the double-diacrits as a set of 3 characters that form one (unicode), but half-diacrits don't merge things into one glyph. These are all questionable choices.

This was written in a bout of insomnia. Hope it at least somewhat works.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文