波兰语 std::string 中的字符
我有一个问题。我正在为 Linux 编写一个用波兰语(当然还有波兰语字符)的应用程序,在编译时收到 80 条警告。这些只是“警告:多字符字符常量”和“警告:案例标签值超过类型的最大值”。我正在使用 std::string。
如何替换 std::string 类?
请帮忙。 提前致谢。 问候。
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
std::string
没有定义特定的编码。因此,您可以在其中存储任何字节序列。有一些微妙之处需要注意:.c_str()
将返回一个以 null 结尾的缓冲区。如果您的字符集允许空字节,请勿将此字符串传递给采用不带长度的const char*
参数的函数,否则您的数据将被截断。char
并不代表一个字符,而是代表一个**字节。恕我直言,这是计算历史上最有问题的术语。请注意,wchar_t
也必然包含完整字符,具体取决于 UTF-16 规范化。.size()
和.length()
将返回字节数,而不是字符数。[edit] 关于
case
标签的警告与问题 (2) 相关。您正在使用包含多字节字符的switch
语句,该语句使用char
类型,该类型不能容纳超过一个字节。[/edit]因此,您可以在应用程序中使用
std::string
,前提是您遵守这三个规则。 STL 涉及一些微妙之处,包括由此产生的 std::find() 。由于规范化形式,您需要使用一些更聪明的字符串匹配算法来正确支持 Unicode。但是,当使用任何使用非 ASCII 字符的语言编写应用程序时(如果您偏执,请考虑
[0, 128)
之外的任何内容),您需要了解不同来源的编码文本数据。任何特定的字符串类都无法解决这两个问题。您只需将所有外部源转换为内部编码即可。我一直建议使用 UTF-8,尤其是在 Linux 上,因为它有原生支持。我强烈建议将字符串文字放在消息文件中,以忘记问题 (1),只处理问题 (2)。
我不建议在 Linux 上使用
std::wstring
,因为 100% 的本机 API 使用带有const char*
的函数签名,并且有直接支持对于 UTF-8。如果您使用任何基于wchar_t
的字符串类,您将需要不间断地与std::wstring
进行转换,最终会出现问题,除了使所有内容都正确之外慢点)。如果您正在为 Windows 编写应用程序,我建议完全相反,因为所有本机 API 都使用 const wchar_t* 签名。此类函数的 ANSI 版本执行与
const wchar_t*
之间的内部转换。一些“可移植”库/语言根据平台使用不同的表示形式。他们在 Linux 上使用 UTF-8 和
char
,在 Windows 上使用 UTF-16 和wchar_t
。我记得在 Python 参考实现中读过这个技巧,但这篇文章相当旧了。我不确定这是否是真的。std::string
does not define a particular encoding. You can thus store any sequence of bytes in it. There are subtleties to be aware of:.c_str()
will return a null-terminated buffer. If your character set allows null bytes, don't pass this string to functions that take aconst char*
parameter without a lenght, or your data will be truncated.char
does not represent a character, but a **byte. IMHO, this is the most problematic nomenclature in computing history. Note thatwchar_t
does necessarily hold a full character either, depending on UTF-16 normalization..size()
and.length()
will return the number of bytes, not the number of characters.[edit] The warnings about
case
labels is related to issue (2). You are using aswitch
statement with multi-byte characters using typechar
which can not hold more than one byte.[/edit]Therefore, you can use
std::string
in your application, provided that you respect these three rules. There are subtleties involving the STL, includingstd::find()
that are consequences of this. You need to use some more clever string matching algorithms to properly support Unicode because of normalization forms.However, when writing applications in any language that uses non-ASCII characters (if you're paranoid, consider this anything outside
[0, 128)
), you need to be aware of encodings in different sources of textual data.These two issues are not addressed by any particular string class. You just need to convert all any external source to your internal encoding. I suggest UTF-8 all the time, but especially so on Linux because of native support. I strongly recommend to place your string literals in a message file to forget about issue (1) and only deal with issue (2).
I don't suggest using
std::wstring
on Linux because 100% of native APIs use function signatures withconst char*
and have direct support for UTF-8. If you use any string class based onwchar_t
, you will need to convert to/fromstd::wstring
non-stop and eventually get something wrong, on top of making everything slow(er).If you were writing an application for Windows, I'd suggest exactly the opposite because all native APIs use
const wchar_t*
signatures. The ANSI versions of such functions perform an internal conversion to/fromconst wchar_t*
.Some "portable" libraries/languages use different representations based on the platform. They use UTF-8 with
char
on Linux and UTF-16 withwchar_t
on Windows. I recall reading bout that trick in the Python reference implementation but the article was quite old. I'm not sure if that is true anymore.在 Linux 上,您应该使用您使用的框架提供的多字节字符串类。
我推荐来自 glibmm 框架的 Glib::ustring,它以 UTF-8 编码存储字符串。
如果您的源文件采用 UTF-8 格式,那么在代码中使用多字节字符串文字就很简单:
但是您无法使用
char
在多字节字符上构建 switch/case 语句。我建议使用一系列if
。您可以使用 Glibmm 的gunichar
,但它的可读性不太好(您可以使用 维基百科中关于波兰字母表的文章):您可以使用以下命令进行编译:
On linux you should use multibyte string class provided by a framework you use.
I'd recommend Glib::ustring, from glibmm framework, which stores strings in UTF-8 encoding.
If your source files are in UTF-8, then using multibyte string literal in code is as easy as:
But you can not build a switch/case statement on multibyte characters using
char
. I'd recommend using a series ofif
s. You can use Glibmm'sgunichar
, but it's not very readable (You can get correct unicode values for characters using a table from article on Polish alphabet in Wikipedia):You can compile this using:
std::string
用于 ASCII 字符串。由于您的波兰语字符串不适合,因此您应该使用std::wstring
。std::string
is for ASCII strings. Since your polish strings don't fit in, you should usestd::wstring
.