char8_t 和 unsigned char 的转义序列
尝试使用转义序列构建 char8_t 字符串(不依赖文件/编译器编码),我遇到了 MSVC 问题。
我想知道这是否是一个错误,或者是否依赖于实现。
有解决方法吗?
constexpr char8_t s1[] = u8"\xe3\x82\xb3 \xe3\x83\xb3 \xe3\x83\x8b \xe3\x83\x81 \xe3\x83\x8f";
constexpr unsigned char s2[] = "\xe3\x82\xb3 \xe3\x83\xb3 \xe3\x83\x8b \xe3\x83\x81 \xe3\x83\x8f";
//constexpr char8_t s3[] = u8"コ ン ニ チ ハ";
static_assert(std::equal(std::begin(s1), std::end(s1),
std::begin(s2), std::end(s2))); // Fail on msvc
注意: 最终目标是替换 std::filesystem::u8path(s2)
(std::filesystem::u8path 自 C++20 起已被 std::filesystem::path(s1)
弃用;
Trying to use escape sequences to construct a char8_t
string (to not rely on file/compiler encoding), I got issue with MSVC.
I wonder if it is a bug, or if it is implemention dependent.
Is there a workaround?
constexpr char8_t s1[] = u8"\xe3\x82\xb3 \xe3\x83\xb3 \xe3\x83\x8b \xe3\x83\x81 \xe3\x83\x8f";
constexpr unsigned char s2[] = "\xe3\x82\xb3 \xe3\x83\xb3 \xe3\x83\x8b \xe3\x83\x81 \xe3\x83\x8f";
//constexpr char8_t s3[] = u8"コ ン ニ チ ハ";
static_assert(std::equal(std::begin(s1), std::end(s1),
std::begin(s2), std::end(s2))); // Fail on msvc
Note:
Final goal is to replace std::filesystem::u8path(s2)
(std::filesystem::u8path is deprecated since C++20) by std::filesystem::path(s1)
;
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这是 MSVC 中的一个错误,我希望在 Microsoft 实现 C++23 期间的某个时候修复该错误。
从历史上看,C++ 标准中没有很好地指定字符和字符串文字中的数字转义序列,这导致了许多核心问题。这些问题已由 P2029 解决; 2020 年 11 月 C++23 采用的一篇论文。所报告的 MSVC 错误(以及与不可编码字符相关的附加错误)在 本文的“实施影响” 部分。
正如其他评论者所提到的,使用像
\u1234
这样的通用字符名称(UCN)将是避免依赖源文件编码的首选解决方案。This is a bug in MSVC that I expect to be fixed at some point during Microsoft's implementation of C++23.
Historically, numeric escape sequences in character and string literals were not well specified in the C++ standard and this lead to a number of core issues. These issues were addressed by P2029; a paper adopted for C++23 in November of 2020. The reported MSVC bug (along with an additional one related to non-encodeable characters) is discussed in the "Implementation impact" section of the paper.
As mentioned by other commenters, use of universal-character-names (UCNs) like
\u1234
would be the preferred solution to avoid a dependency on source file encoding.