C++0x 是否支持 std::wstring 与 UTF-8 字节序列之间的转换?

发布于 2024-07-15 05:51:42 字数 165 浏览 11 评论 0原文

我看到 C++0x 将添加对 UTF-8、UTF-16 和 UTF-32 文字的支持。 但是这三种表示形式之间的转换又如何呢?

我计划在代码中的任何地方使用 std::wstring 。 但我在处理文件和网络时还需要操作UTF-8编码的数据。 C++0x 也会提供对这些操作的支持吗?

I saw that C++0x will add support for UTF-8, UTF-16 and UTF-32 literals. But what about conversions between the three representations ?

I plan to use std::wstring everywhere in my code. But I also need to manipulate UTF-8 encoded data when dealing with files and network. Will C++0x provide also support for these operations ?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

马蹄踏│碎落叶 2024-07-22 05:51:42

在 C++0x 中,char16_tchar32_t 将用于存储 UTF-16 和 UTF-32,而不是 wchar_t

来自草案 n2798:

22.2.1.4 类模板codecvt

2 codecvt 类用于从一种代码集转换为另一种代码集时使用,例如从宽字符到多字节字符或在宽字符编码(例如 Unicode 和 Unicode)之间转换。
EUC。

3 表 76 (22.1.1.1.1) 中所需的专业化转换了实现-
定义的本机字符集。 codecvt 实现了退化
转换; 它根本不转换。 专门化codecvt 在 UTF-16 和 UTF-8 编码方案之间进行转换,并且
专门化 codecvt 在 UTF-32 和
UTF-8 编码方案。 codecvt 在原生之间进行转换
窄字符和宽字符的字符集。 mbstate_t 执行的专​​业化
库实现者已知的编码之间的转换。

可以通过专门处理用户定义的 stateT 类型来转换其他编码。 stateT 对象可以包含任何对于与专用 do_in 或从专用 do_in 进行通信有用的状态
do_out 成员。

关于 wchar_t事情是它不会为您提供有关所使用的编码的任何保证。 它是一种可以容纳多字节字符的类型。 时期。 如果您现在要编写软件,您就必须接受这种妥协。 与 C++0x 兼容的编译器还有很长的路要走。 您始终可以尝试一下 VC2010 CTP 和 g++ 编译器,看看它是否值得。 此外,wchar_t 在不同平台上具有不同的大小,这是另一件事需要注意(VS/Windows 上为 2 字节,GCC/Mac 上为 4 字节等)。 然后,GCC 的 -fshort-wchar 等选项使问题进一步复杂化。

因此,最好的解决方案是使用现有的库。 追踪 UNICODE 错误并不是精力/时间的最佳利用方式。 我建议你看一下:

有关 C++0x Unicode 字符串文字的更多信息 此处

In C++0x, char16_t and char32_t will be used to store UTF-16 and UTF-32 and not wchar_t.

From the draft n2798:

22.2.1.4 Class template codecvt

2 The class codecvt is for use when converting from one codeset to another, such as from wide characters to multibyte characters or between wide character encodings such as Unicode and
EUC.

3 The specializations required in Table 76 (22.1.1.1.1) convert the implementation-
defined native character set. codecvt implements a degenerate
conversion; it does not convert at all. The specialization codecvt<char16_t, char,
mbstate_t>
converts between the UTF-16 and UTF-8 encodings schemes, and the
specialization codecvt <char32_t, char, mbstate_t> converts between the UTF-32 and
UTF-8 encodings schemes. codecvt<wchar_t,char,mbstate_t> converts between the native
character sets for narrow and wide characters. Specializations on mbstate_t perform
conversion between encodings known to the library implementor.

Other encodings can be converted by specializing on a user-defined stateT type. The stateT object can contain any state that is useful to communicate to or from the specialized do_in or
do_out members.

The thing about wchar_t is that it does not give you any guarantees about the encoding used. It is a type that can hold a multibyte character. Period. If you are going to write software now, you have to live with this compromise. C++0x compliant compilers are yet a far cry. You can always give the VC2010 CTP and g++ compilers a try for what it is worth. Moreover, wchar_t has different sizes on different platforms which is another thing to watch out for (2 bytes on VS/Windows, 4 bytes on GCC/Mac and so on). There is then options like -fshort-wchar for GCC to further complicate the issue.

The best solution therefore is to use an existing library. Chasing UNICODE bugs around isn't the best possible use of effort/time. I'd suggest you take a look at:

More on C++0x Unicode string literals here

早乙女 2024-07-22 05:51:42

暗暗地谢谢你。 我尚未注册,因此无法投票或直接回复评论。

我通过 codecvt 学到了一些东西。 我知道您建议的库,以下资源也可能有用 http://www. unicode.org/Public/PROGRAMS/CVTUTF/

该项目是一个应该开源的库。 我更喜欢最小化与外部库的依赖关系。 我已经依赖 libgc 和 boost,尽管后来我只使用线程。 我真的更愿意坚持 C++ 标准,而且我对 GC 支持以某种方式被放弃感到有点失望。

显然 VC++ Express 2008 据称支持大部分 C++0x 标准以及 icc。 由于我目前使用VC++进行开发,距离发布库还需要一段时间,所以我想尝试一下使用codecvt和char32_t字符串。

有谁知道如何做到这一点 ? 我应该提出另一个问题吗?

Thank you dirkgently. I'm not yet registered, so I can't upvote or respond directly as a comment.

I've learned something with codecvt. I knew about the libraries you suggest and the following resource may also be useful http://www.unicode.org/Public/PROGRAMS/CVTUTF/.

The project is for a library that should be open source. I would prefer minimizing the dependencies with external libraries. I already have a dependency with libgc and boost, though for the later I only use threads. I would really prefer to stick to the C++ standard and I'm a bit disappointed that GC supported has been somehow dropped.

Apparently VC++ express 2008 is said to support most of the C++0x standard as well as icc. Since I currently develop with VC++ and it will still take some time until the library would be released, I'd like to give a try to use codecvt and char32_t strings.

Does anyone know how to do this ? Should I post another question ?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文