Microsoft 如何处理 UTF-16 在其 C++ 中是可变长度编码这一事实？标准库实现

发布于 2024-09-29 08:05:15 字数 1151 浏览 8 评论 0原文

标准中间接禁止使用可变长度编码。

所以我有几个问题：

标准的以下部分是如何处理的？

17.3.2.1.3.3 宽字符序列
宽字符序列是一个数组对象 (8.3.4) A，可以声明为 TA[N]，其中 T 是 wchar_t 类型 (3.9.1)，可以选择通过 const 或 volatile 的任意组合进行限定。数组的初始元素已定义内容，直到并包括由某个谓词确定的元素。字符序列可以通过指定其第一个元素的指针值 S 来指定。
NTWCS 的长度是终止空宽字符之前的元素数。空 NTWCS 的长度为零。

问题：

basic_string

operator[] 是如何实现的以及它返回什么？
- 标准：如果 pos < size()，返回 data()[pos]。否则，如果 pos == size()，则 const 版本返回 charT()。否则，行为未定义。
size() 返回元素数量还是字符串长度？
- 标准：返回：字符串中当前类字符对象的数量。
resize()如何工作？
- 与标准无关，只是做什么
如何处理 insert()、erase() 等中的位置？

cwctype

几乎所有内容都在这里。变量编码是如何处理的？

cwchar

getwchar() 显然无法返回整个平台字符，那么它是如何工作的呢？

加上其余所有的角色功能（主题是一样的）。

编辑：我将开立赏金以获得一些确认。我想要得到一些明确的答案，或者至少是更清晰的选票分配。

编辑：这开始变得毫无意义。这充满了完全矛盾的答案。你们中的一些人谈论外部编码（我不关心这些，一旦读入字符串，UTF-8编码仍将存储为UTF-16，输出相同），其余的只是相互矛盾。 :-/

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

三生一梦 2024-10-06 08:05:15

以下是 Microsoft 的 STL 实现处理可变长度编码的方式：

basic_string::operator[])( 可以单独返回低位或高位代理项。

basic_string： :size() 返回 wchar_t 对象的数量。代理对（一个 Unicode 字符）使用两个 wchar_t，因此将

basic_string:: resize() 可以截断代理对中间的字符串

basic_string::insert() 可以在代理对中间插入字符串

。 wchar_t>::erase() 可以擦除代理项对的任意一半。

一般来说，模式应该很清楚：STL 不假设 std::wstring 是 UTF 格式。 -16，也不强制它保持 UTF-16。

回复收藏 0 原文

思慕 2024-10-06 08:05:15

STL 将字符串简单地视为字符数组的包装器，因此 STL 字符串上的 size() 或 length() 将告诉您它包含多少个 char 或 wchar_t 元素，而不一定是字符串中可打印字符的数量。

回复收藏 0 原文

风向决定发型 2024-10-06 08:05:15

假设您正在讨论 wstring 类型，则不会处理编码 - 它只处理 wchar_t 元素，而不了解有关编码的任何信息。它只是一个 wchar_t 序列。您需要使用其他函数的功能来处理编码问题。

回复收藏 0 原文

ぇ气 2024-10-06 08:05:15

有两件事：

没有“Microsoft STL 实现”。 Visual C++ 附带的 C++ 标准库由 Dinkumware 授权。
当前的 C++ 标准对 Unicode 及其编码形式一无所知。 std::wstring 只是 wchar_t 单元的容器，在 Windows 上恰好是 16 位。实际上，如果您想将 UTF-16 编码的字符串存储到 wstring 中，只需考虑到您实际上存储的是代码单元而不是代码点。

回复收藏 0 原文

幸福丶如此 2024-10-06 08:05:15

MSVC 将 wchar_t 存储在 wstring 中。这些可以被解释为 unicode 16 位字，或者其他任何东西。

如果您想访问 unicode 字符或字形，则必须按照 unicode 标准处理所述原始字符串。您可能还想在不破坏的情况下处理常见的极端情况。

这是这样一个图书馆的草图。它的内存效率大约是它的一半，但它确实可以让您就地访问 std::string 中的 unicode 字形。它依赖于拥有一个像样的 array_view 类，但无论如何您都想编写其中一个类：

struct unicode_char : array_view<wchar_t const> {
  using array_view<wchar_t const>::array_view<wchar_t const>;

  uint32_t value() const {
    if (size()==1)
      return front();
    Assert(size()==2);
    if (size()==2)
    {
      wchar_t high = front()-0xD800;
      wchar_T low = back()-0xDC00;
      return (uint32_t(high)<<10) + uint32_t(low);
    }
    return 0; // error
  }
  static bool is_high_surrogate( wchar_t c ) {
    return (c >= 0xD800 && c <= 0xDBFF);
  }
  static bool is_low_surrogate( wchar_t c ) {
    return (c >= 0xDC00 && c <= 0xDFFF);
  }
  static unicode_char extract( array_view<wchar_t const> raw )
  {
    if (raw.empty())
      return {};
    if (raw.size()==1)
      return raw;
    if (is_high_surrogate(raw.front()) && is_low_surrogate(*std::next(raw.begin())))
      return {raw.begin(), raw.begin()+2);
    return {raw.begin(), std::next(raw.begin())};
  }
};
static std::vector<unicode_char> as_unicode_chars( array_view<wchar_t> raw )
{
  std::vector<unicode_char> retval;
  retval.reserve( raw.size() ); // usually 1:1
  while(!raw.empty())
  {
    retval.push_back( unicode_char::extract(raw) );
    Assert( retval.back().size() <= raw.size() );
    raw = {raw.begin() + retval.back().size(), raw.end()};
  }
  return retval;
}
struct unicode_glyph {
  std::array< unicode_char, 3 > buff;
  std::size_t count=0;
  unicode_char const* begin() const {
    return buff.begin();
  }
  unicode_char const* end() const {
    return buff.begin()+count;
  }
  std::size_t size() const { return count; }
  bool empty() { return size()==0; }
  unicode_char const& front() const { return *begin(); }
  unicode_char const& back() const { return *std::prev(end()); }
  array_view< unicode_char const > chars() const { return {begin(), end()}; }
  array_view< wchar_t const > wchars() const {
    if (empty()) return {};
    return { front().begin(), back().end() };
  }

  void append( unicode_char next ) {
    Assert(count<3);
    buff[count++] = next;
  }
  unicode_glyph() {}

  static bool is_diacrit(unicode_char c) const {
    auto v = c.value();
    return is_diacrit(v);
  }
  static bool is_diacrit(uint32_t v) const {
    return
      ((v >= 0x0300) && (v <= 0x0360))
    || ((v >= 0x1AB0) && (v <= 0x1AFF))
    || ((v >= 0x1DC0) && (v <= 0x1DFF))
    || ((v >= 0x20D0) && (v <= 0x20FF))
    || ((v >= 0xFE20) && (v <= 0xFE2F));
  }
  static size_t diacrit_count(unicode_char c) const {
    auto v = c.value();
    if (is_diacrit(v))
      return 1 + ((v >= 0x035C)&&(v<=0x0362));
    else
      return 0;
  }
  static unicode_glyph extract( array_view<const unicode_char> raw ) {
    unicode_glyph retval;
    if (raw.empty())
      return retval;
    if (raw.size()==1)
    {
      retval.append(raw.front());
      return retval;
    }
    retval.count = diacrit_count( *std::next(raw.begin()) )+1;
    std::copy( raw.begin(), raw.begin()+retval.count, retval.buff.begin() );
    return retval;
  }
};
static std::vector<unicode_glyph> as_unicode_glyphs( array_view<unicode_char> raw )
{
  std::vector<unicode_glyph> retval;
  retval.reserve( raw.size() ); // usually 1:1
  while(!raw.empty())
  {
    retval.push_back( unicode_glyph::extract(raw) );
    Assert( retval.back().size() <= raw.size() );
    raw = {raw.begin() + retval.back().size(), raw.end()};
  }
  return retval;
}
static std::vector<unicode_glyph> as_unicode_glyphs( array_view<wchar_t> raw )
{
  return as_unicode_glyphs( as_unicode_chars( raw ) );
}

更智能的代码位将生成 unicode_char 和 unicode_glyph 使用某种工厂迭代器动态运行。更紧凑的实现将跟踪前一个的结束指针和下一个的开始指针始终相同的事实，并将它们别名在一起。另一种优化是基于大多数字形是一个字符的假设，对字形使用小对象优化，如果它们是两个字符，则使用动态分配。

请注意，我将 CGJ 视为标准变音符号，并将双变音符号视为形成一个 (unicode) 的一组 3 个字符，但半变音符号不会将内容合并到一个字形中。这些都是值得商榷的选择。

这是在失眠期间写下的。希望它至少能起到一定的作用。

MSVC stores wchar_t in wstrings. These can be interpreted as unicode 16 bit words, or anything else really.

If you want to get access to unicode characters or glyphs, you'll have to process said raw string by the unicode standard. You probably also want to handle common corner cases without breaking.

Here is a sketch of such a library. It is about half as memory efficient as it could be, but it does give you in-place access to unicode glyphs in a std::string. It relies on having a decent array_view class, but you want to write one of those anyhow:

struct unicode_char : array_view<wchar_t const> {
  using array_view<wchar_t const>::array_view<wchar_t const>;

  uint32_t value() const {
    if (size()==1)
      return front();
    Assert(size()==2);
    if (size()==2)
    {
      wchar_t high = front()-0xD800;
      wchar_T low = back()-0xDC00;
      return (uint32_t(high)<<10) + uint32_t(low);
    }
    return 0; // error
  }
  static bool is_high_surrogate( wchar_t c ) {
    return (c >= 0xD800 && c <= 0xDBFF);
  }
  static bool is_low_surrogate( wchar_t c ) {
    return (c >= 0xDC00 && c <= 0xDFFF);
  }
  static unicode_char extract( array_view<wchar_t const> raw )
  {
    if (raw.empty())
      return {};
    if (raw.size()==1)
      return raw;
    if (is_high_surrogate(raw.front()) && is_low_surrogate(*std::next(raw.begin())))
      return {raw.begin(), raw.begin()+2);
    return {raw.begin(), std::next(raw.begin())};
  }
};
static std::vector<unicode_char> as_unicode_chars( array_view<wchar_t> raw )
{
  std::vector<unicode_char> retval;
  retval.reserve( raw.size() ); // usually 1:1
  while(!raw.empty())
  {
    retval.push_back( unicode_char::extract(raw) );
    Assert( retval.back().size() <= raw.size() );
    raw = {raw.begin() + retval.back().size(), raw.end()};
  }
  return retval;
}
struct unicode_glyph {
  std::array< unicode_char, 3 > buff;
  std::size_t count=0;
  unicode_char const* begin() const {
    return buff.begin();
  }
  unicode_char const* end() const {
    return buff.begin()+count;
  }
  std::size_t size() const { return count; }
  bool empty() { return size()==0; }
  unicode_char const& front() const { return *begin(); }
  unicode_char const& back() const { return *std::prev(end()); }
  array_view< unicode_char const > chars() const { return {begin(), end()}; }
  array_view< wchar_t const > wchars() const {
    if (empty()) return {};
    return { front().begin(), back().end() };
  }

  void append( unicode_char next ) {
    Assert(count<3);
    buff[count++] = next;
  }
  unicode_glyph() {}

  static bool is_diacrit(unicode_char c) const {
    auto v = c.value();
    return is_diacrit(v);
  }
  static bool is_diacrit(uint32_t v) const {
    return
      ((v >= 0x0300) && (v <= 0x0360))
    || ((v >= 0x1AB0) && (v <= 0x1AFF))
    || ((v >= 0x1DC0) && (v <= 0x1DFF))
    || ((v >= 0x20D0) && (v <= 0x20FF))
    || ((v >= 0xFE20) && (v <= 0xFE2F));
  }
  static size_t diacrit_count(unicode_char c) const {
    auto v = c.value();
    if (is_diacrit(v))
      return 1 + ((v >= 0x035C)&&(v<=0x0362));
    else
      return 0;
  }
  static unicode_glyph extract( array_view<const unicode_char> raw ) {
    unicode_glyph retval;
    if (raw.empty())
      return retval;
    if (raw.size()==1)
    {
      retval.append(raw.front());
      return retval;
    }
    retval.count = diacrit_count( *std::next(raw.begin()) )+1;
    std::copy( raw.begin(), raw.begin()+retval.count, retval.buff.begin() );
    return retval;
  }
};
static std::vector<unicode_glyph> as_unicode_glyphs( array_view<unicode_char> raw )
{
  std::vector<unicode_glyph> retval;
  retval.reserve( raw.size() ); // usually 1:1
  while(!raw.empty())
  {
    retval.push_back( unicode_glyph::extract(raw) );
    Assert( retval.back().size() <= raw.size() );
    raw = {raw.begin() + retval.back().size(), raw.end()};
  }
  return retval;
}
static std::vector<unicode_glyph> as_unicode_glyphs( array_view<wchar_t> raw )
{
  return as_unicode_glyphs( as_unicode_chars( raw ) );
}

a smarter bit of code would generate the unicode_chars and unicode_glyphs on the fly with a factory iterator of some kind. A more compact implementation would keep track of the fact that the end pointer of the previous and begin pointer of the next are always identical, and alias them together. Another optimization would be to use a small object optimization on glyph based off the assumption that most glyphs are one character, and use dynamic allocation if they are two.

Note that I treat CGJ as a standard diacrit, and the double-diacrits as a set of 3 characters that form one (unicode), but half-diacrits don't merge things into one glyph. These are all questionable choices.

This was written in a bout of insomnia. Hope it at least somewhat works.

回复收藏 0 原文

~没有更多了~