C++ 中的跨平台字符串(和 Unicode)
所以我终于回到了我的主要任务 - 将一个相当大的 C++ 项目从 Windows 移植到 Mac。
我立刻就遇到了一个问题:wchar_t 在 Windows 上是 16 位,但在 Mac 上是 32 位。这是一个问题,因为所有字符串都由 wchar_t 表示,并且 Windows 和 Mac 机器之间会来回字符串数据(以磁盘数据和网络数据形式)。由于它的工作方式,在发送和接收数据之前将字符串转换为某种通用格式并不完全简单。
最近我们也确实开始支持更多的语言,因此我们开始处理大量的 Unicode 数据(以及处理从右到左的语言)。
现在,我可能会在这里混淆多种想法,并给自己带来比需要的更多的问题,这就是我问这个问题的原因。我们认为将所有内存中字符串数据存储为 UTF-8 很有意义。它解决了 wchar_t 大小不同的问题,这意味着我们可以轻松支持多种语言,并且还大大减少了我们的内存占用(我们加载了很多 - 主要是英语 - 字符串) - 但似乎没有很多人在这样做这。我们还缺少什么吗?您必须处理一个明显的问题,即字符串长度可能小于存储该字符串数据的内存大小。
或者使用 UTF-16 是一个更好的主意吗?或者我们应该坚持使用 wchar_t 并编写代码在 wchar_t 和例如 Unicode 之间进行转换,在我们读/写磁盘或网络的地方?
我意识到这与征求意见是危险的 - 但我们很紧张,因为我们忽略了一些明显的东西,因为似乎没有很多 Unicode 字符串类(例如) - 但仍然有大量的代码可以转换为/来自 Unicode,如 boost::locale、iconv、utf-cpp 和 ICU。
So I've finally gotten back to my main task - porting a rather large C++ project from Windows to the Mac.
Straight away I've been hit by the problem where wchar_t is 16-bits on Windows but 32-bits on the Mac. This is a problem because all of the strings are represented by wchar_t and there will be string data going back and forth between Windows and Mac machines (in both on-disk data and network data forms). Because of the way in which it works it wouldn't be totally straightforward to convert the strings into some common format before sending and receiving the data.
We've also really started to support a lot more languages recently and so we're starting to deal with a lot of Unicode data (as well as dealing with right-to-left languages).
Now, I could be conflating multiple ideas here and causing more problems for myself than needed which is why I'm asking this question. We're thinking that storing all of our in-memory string data as UTF-8 makes a lot of sense. It solves the wchar_t being different sizes problem, it means we can easily support multiple languages and it also dramatically reduces our memory footprint (we have a LOT of - mostly English - strings loaded) - but it doesn't seem like many people are doing this. Is there something we're missing? There's the obvious problem you have to deal with where string length can be less than the memory size storing that string data.
Or is using UTF-16 a better idea? Or should we stick to wchar_t and write code to convert between wchar_t and, say, Unicode in places where we read/write to the disk or the network?
I realize this is dangerously close to asking for opinions - but we're nervous that we're overlooking something obvious because it doesn't seem like there are many Unicode string classes (for example) - but yet there's plenty of code for converting to/from Unicode like in boost::locale, iconv, utf-cpp and ICU.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
当涉及文件或网络连接时,始终使用按字节定义的协议。不要依赖 C++ 编译器如何在内存中存储任何内容。对于 Unicode 文本,这意味着同时选择编码和字节顺序(好吧,UTF-8 不关心字节顺序)。即使您当前想要支持的平台具有类似的架构,另一个具有不同行为的流行平台,甚至是适用于您现有平台之一的新操作系统也可能会出现,您会很高兴编写了可移植代码。
Always use a protocol defined to the byte when a file or network connection is involved. Do not rely on how a C++ compiler stores anything in memory. For Unicode text, this means choosing both an encoding and a byte order (okay, UTF-8 doesn't care about byte order). Even if the platforms you currently want to support have similar architectures, another popular platform with different behavior or even a new OS for one of your existing platforms will likely come along, and you'll be glad you wrote portable code.
我倾向于使用 UTF-8 作为内部表示。您只会丢失字符串长度检查,无论如何都没有真正的用处。对于 Windows API 转换,我使用我自己的 Win32 转换函数我在这里设计。正如 Mac 和 Linux 一样(大部分支持标准 UTF-8,无需在那里转换任何内容)。您获得的免费奖励:
std::string
。I tend to use UTF-8 as the internal representation. You only lose string length checking, with isn't really useful anyways. For Windows API conversion, I use my own Win32 conversion functions I devised here. As Mac and linux are (for the most part standard UTF-8-aware, no need to convert anything there). Free bonuses you get:
std::string
.根据经验:UTF-16 用于处理,UTF-8 用于通信和传输。贮存。
当然,任何规则都可以被打破,而这条规则也不是一成不变的。
但你必须知道什么时候可以打破它。
例如,如果您使用的环境需要其他东西,那么使用其他东西可能是个好主意。但 Mac OS X API 使用 UTF-16,与 Windows 相同。所以UTF-16更有意义。
在将东西放到网上/获取东西之前进行转换(因为您可能在 2-3 个例程中完成)比执行所有转换以调用操作系统 API 更直接。
您开发的应用程序类型也很重要。
如果它的文本处理很少,并且对系统的调用很少(例如电子邮件服务器,主要移动内容而不更改它们),那么 UTF-8 可能是一个不错的选择。
因此,尽管您可能讨厌这个答案,但“这取决于情况”。
As a rule of thumb: UTF-16 for processing, UTF-8 for communication & storage.
Sure, any rule can be broken and this one is not carved in stone.
But you have to know when it is ok to break it.
For instance it might be a good idea to use something else if the environment you are using wants something else. But Mac OS X APIs use UTF-16, same as Windows. So UTF-16 makes more sense.
It is more straightforward to convert before you put/get things on the net (because you probably do it in 2-3 routines) than doing all the conversions to call OS APIs.
It also matter the type of application you develop.
If it is something with very little text processing, and very little calls to the system (something like an email server that mostly moves things around without changing them), then UTF-8 might be a good choice.
So, as much as you might hate this answer, "it depends".
ICU有一个C++字符串类,UnicodeString
ICU has a C++ string class, UnicodeString