当前位置：文江博客话题详情

波兰语 std::string 中的字符

发布于 2024-10-03 14:56:36 字数 169 浏览 2 评论 0 原文

我有一个问题。我正在为 Linux 编写一个用波兰语（当然还有波兰语字符）的应用程序，在编译时收到 80 条警告。这些只是“警告：多字符字符常量”和“警告：案例标签值超过类型的最大值”。我正在使用 std::string。

如何替换 std::string 类？

请帮忙。提前致谢。问候。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

月依秋水 2024-10-10 14:56:36

std::string没有定义特定的编码。因此，您可以在其中存储任何字节序列。有一些微妙之处需要注意：

.c_str() 将返回一个以 null 结尾的缓冲区。如果您的字符集允许空字节，请勿将此字符串传递给采用不带长度的 const char* 参数的函数，否则您的数据将被截断。
char并不代表一个字符，而是代表一个**字节。恕我直言，这是计算历史上最有问题的术语。请注意，wchar_t 也必然包含完整字符，具体取决于 UTF-16 规范化。
.size() 和 .length() 将返回字节数，而不是字符数。

[edit] 关于 case 标签的警告与问题 (2) 相关。您正在使用包含多字节字符的 switch 语句，该语句使用 char 类型，该类型不能容纳超过一个字节。[/edit]

因此，您可以在应用程序中使用std::string，前提是您遵守这三个规则。 STL 涉及一些微妙之处，包括由此产生的 std::find() 。由于规范化形式，您需要使用一些更聪明的字符串匹配算法来正确支持 Unicode。

但是，当使用任何使用非 ASCII 字符的语言编写应用程序时（如果您偏执，请考虑 [0, 128) 之外的任何内容），您需要了解不同来源的编码文本数据。

可能未指定源文件编码，并且可能会使用编译器选项进行更改。任何字符串文字都将遵守此规则。我想这就是您收到警告的原因。
您将从外部源（文件、用户输入等）获得各种字符编码。当该源指定编码或者您可以从某些外部源获取它（即询问导入数据的用户）时，这会更容易。除非另有说明，许多（较新的）互联网协议都强制使用 ASCII 或 UTF-8。

任何特定的字符串类都无法解决这两个问题。您只需将所有外部源转换为内部编码即可。我一直建议使用 UTF-8，尤其是在 Linux 上，因为它有原生支持。我强烈建议将字符串文字放在消息文件中，以忘记问题 (1)，只处理问题 (2)。

我不建议在 Linux 上使用 std::wstring，因为 100% 的本机 API 使用带有 const char* 的函数签名，并且有直接支持对于 UTF-8。如果您使用任何基于 wchar_t 的字符串类，您将需要不间断地与 std::wstring 进行转换，最终会出现问题，除了使所有内容都正确之外慢点）。

如果您正在为 Windows 编写应用程序，我建议完全相反，因为所有本机 API 都使用 const wchar_t* 签名。此类函数的 ANSI 版本执行与 const wchar_t* 之间的内部转换。

一些“可移植”库/语言根据平台使用不同的表示形式。他们在 Linux 上使用 UTF-8 和 char，在 Windows 上使用 UTF-16 和 wchar_t。我记得在 Python 参考实现中读过这个技巧，但这篇文章相当旧了。我不确定这是否是真的。

std::stringdoes not define a particular encoding. You can thus store any sequence of bytes in it. There are subtleties to be aware of:

.c_str() will return a null-terminated buffer. If your character set allows null bytes, don't pass this string to functions that take a const char* parameter without a lenght, or your data will be truncated.
A char does not represent a character, but a **byte. IMHO, this is the most problematic nomenclature in computing history. Note that wchar_t does necessarily hold a full character either, depending on UTF-16 normalization.
.size() and .length() will return the number of bytes, not the number of characters.

[edit] The warnings about case labels is related to issue (2). You are using a switch statement with multi-byte characters using type char which can not hold more than one byte.[/edit]

Therefore, you can use std::string in your application, provided that you respect these three rules. There are subtleties involving the STL, including std::find() that are consequences of this. You need to use some more clever string matching algorithms to properly support Unicode because of normalization forms.

However, when writing applications in any language that uses non-ASCII characters (if you're paranoid, consider this anything outside [0, 128)), you need to be aware of encodings in different sources of textual data.

The source-file encoding might not be specified, and might be subject to change using compiler options. Any string literal will be subject to this rule. I guess this is why you are getting warnings.
You will get a variety of character encodings from external sources (files, user input, etc.). When that source specifies the encoding or you can get it from some external source (i.e. asking the user that imports the data), then this is easier. A lot of (newer) internet protocols impose ASCII or UTF-8 unless otherwise specified.

These two issues are not addressed by any particular string class. You just need to convert all any external source to your internal encoding. I suggest UTF-8 all the time, but especially so on Linux because of native support. I strongly recommend to place your string literals in a message file to forget about issue (1) and only deal with issue (2).

I don't suggest using std::wstring on Linux because 100% of native APIs use function signatures with const char* and have direct support for UTF-8. If you use any string class based on wchar_t, you will need to convert to/from std::wstring non-stop and eventually get something wrong, on top of making everything slow(er).

If you were writing an application for Windows, I'd suggest exactly the opposite because all native APIs use const wchar_t* signatures. The ANSI versions of such functions perform an internal conversion to/from const wchar_t*.

Some "portable" libraries/languages use different representations based on the platform. They use UTF-8 with char on Linux and UTF-16 with wchar_t on Windows. I recall reading bout that trick in the Python reference implementation but the article was quite old. I'm not sure if that is true anymore.

回复收藏 0 原文

幻梦 2024-10-10 14:56:36

在 Linux 上，您应该使用您使用的框架提供的多字节字符串类。

我推荐来自 glibmm 框架的 Glib::ustring，它以 UTF-8 编码存储字符串。
如果您的源文件采用 UTF-8 格式，那么在代码中使用多字节字符串文字就很简单：

ustring alphabet("aąbcćdeęfghijklłmnńoóprsśtuwyzźż");

但是您无法使用 char 在多字节字符上构建 switch/case 语句。我建议使用一系列 if。您可以使用 Glibmm 的 gunichar，但它的可读性不太好（您可以使用维基百科中关于波兰字母表的文章）：

#include <glibmm.h>
#include <iostream>

using namespace std;

int main()
{
        Glib::ustring alphabet("aąbcćdeęfghijklłmnńoóprsśtuwyzźż");
        int small_polish_vovels_with_diacritics_count = 0;
        for ( int i=0; i<alphabet.size(); i++ ) {
                switch (alphabet[i]) {
                        case 0x0105: // ą
                        case 0x0119: // ę
                        case 0x00f3: // ó
                                small_polish_vovels_with_diacritics_count++;
                                break;
                        default:
                                break;
                }
        }
        cout << "There are " << small_polish_vovels_with_diacritics_count
                << " small polish vovels with diacritics in this string.\n"; 
        return 0;
}

您可以使用以下命令进行编译：

g++ `pkg-config --cflags --libs glibmm-2.4` progname.cc -o progname

On linux you should use multibyte string class provided by a framework you use.

I'd recommend Glib::ustring, from glibmm framework, which stores strings in UTF-8 encoding.
If your source files are in UTF-8, then using multibyte string literal in code is as easy as:

ustring alphabet("aąbcćdeęfghijklłmnńoóprsśtuwyzźż");

But you can not build a switch/case statement on multibyte characters using char. I'd recommend using a series of ifs. You can use Glibmm's gunichar, but it's not very readable (You can get correct unicode values for characters using a table from article on Polish alphabet in Wikipedia):

#include <glibmm.h>
#include <iostream>

using namespace std;

int main()
{
        Glib::ustring alphabet("aąbcćdeęfghijklłmnńoóprsśtuwyzźż");
        int small_polish_vovels_with_diacritics_count = 0;
        for ( int i=0; i<alphabet.size(); i++ ) {
                switch (alphabet[i]) {
                        case 0x0105: // ą
                        case 0x0119: // ę
                        case 0x00f3: // ó
                                small_polish_vovels_with_diacritics_count++;
                                break;
                        default:
                                break;
                }
        }
        cout << "There are " << small_polish_vovels_with_diacritics_count
                << " small polish vovels with diacritics in this string.\n"; 
        return 0;
}

You can compile this using:

g++ `pkg-config --cflags --libs glibmm-2.4` progname.cc -o progname

回复收藏 0 原文

好听的两个字的网名 2024-10-10 14:56:36

std::string 用于 ASCII 字符串。由于您的波兰语字符串不适合，因此您应该使用 std::wstring。

回复收藏 0 原文

~没有更多了~

关于作者

蔚蓝源自深海

暂无简介

0 文章

0 评论

23 人气

关注发私信

友情链接

文江博客

波兰语 std::string 中的字符

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

1CH1MKgiKxn9p

ゞ记忆︶ㄣ

JackDx

信远

yaoduoduo1995

霞映澄塘

友情链接

波兰语 std::string 中的字符

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

1CH1MKgiKxn9p

ゞ记忆︶ㄣ

JackDx

信远

yaoduoduo1995

霞映澄塘

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。