如何在 Linux/OS X 上打印 wstring?
如何在控制台/屏幕上打印这样的字符串:€áa¢cée£
?我尝试过这个:
#include <iostream>
#include <string>
using namespace std;
wstring wStr = L"€áa¢cée£";
int main (void)
{
wcout << wStr << " : " << wStr.length() << endl;
return 0;
}
这不起作用。甚至令人困惑的是,如果我从字符串中删除 €
,打印输出将如下所示: ?a?c?e? : 7
但字符串中有 €
时,在 €
字符之后不会打印任何内容。
如果我在 python 中编写相同的代码:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
wStr = u"€áa¢cée£"
print u"%s" % wStr
它会在同一个控制台上正确打印出字符串。我在 C++ 中缺少什么(好吧,我只是一个菜鸟)?干杯!!
Update 1: based on n.m.'s suggestion
#include <iostream>
#include <string>
using namespace std;
string wStr = "€áa¢cée£";
char *pStr = 0;
int main (void)
{
cout << wStr << " : " << wStr.length() << endl;
pStr = &wStr[0];
for (unsigned int i = 0; i < wStr.length(); i++) {
cout << "char "<< i+1 << " # " << *pStr << " => " << pStr << endl;
pStr++;
}
return 0;
}
首先,它报告 14
作为字符串的长度: €áa¢cée£ : 14
是因为它每个字符计算 2 个字节吗?
我得到的只是:
char 1 # ? => €áa¢cée£
char 2 # ? => ??áa¢cée£
char 3 # ? => ?áa¢cée£
char 4 # ? => áa¢cée£
char 5 # ? => ?a¢cée£
char 6 # a => a¢cée£
char 7 # ? => ¢cée£
char 8 # ? => ?cée£
char 9 # c => cée£
char 10 # ? => ée£
char 11 # ? => ?e£
char 12 # e => e£
char 13 # ? => £
char 14 # ? => ?
作为最后的 cout 输出。所以,我相信,实际问题仍然存在。干杯!!
更新2:基于nm的第二个建议
#include <iostream>
#include <string>
using namespace std;
wchar_t wStr[] = L"€áa¢cée£";
int iStr = sizeof(wStr) / sizeof(wStr[0]); // length of the string
wchar_t *pStr = 0;
int main (void)
{
setlocale (LC_ALL,"");
wcout << wStr << " : " << iStr << endl;
pStr = &wStr[0];
for (int i = 0; i < iStr; i++) {
wcout << *pStr << " => " << static_cast<void*>(pStr) << " => " << pStr << endl;
pStr++;
}
return 0;
}
这就是我得到的结果:
€áa¢cée£ : 9
€ => 0x1000010e8 => €áa¢cée£
á => 0x1000010ec => áa¢cée£
a => 0x1000010f0 => a¢cée£
¢ => 0x1000010f4 => ¢cée£
c => 0x1000010f8 => cée£
é => 0x1000010fc => ée£
e => 0x100001100 => e£
£ => 0x100001104 => £
=> 0x100001108 =>
为什么它被报告为9
而不是8
?或者这就是我应该期待的?干杯!!
How can I print a string like this: €áa¢cée£
on the console/screen? I tried this:
#include <iostream>
#include <string>
using namespace std;
wstring wStr = L"€áa¢cée£";
int main (void)
{
wcout << wStr << " : " << wStr.length() << endl;
return 0;
}
which is not working. Even confusing, if I remove €
from the string, the print out comes like this: ?a?c?e? : 7
but with €
in the string, nothing gets printed after the €
character.
If I write the same code in python:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
wStr = u"€áa¢cée£"
print u"%s" % wStr
it prints out the string correctly on the very same console. What am I missing in c++ (well, I'm just a noob)? Cheers!!
Update 1: based on n.m.'s suggestion
#include <iostream>
#include <string>
using namespace std;
string wStr = "€áa¢cée£";
char *pStr = 0;
int main (void)
{
cout << wStr << " : " << wStr.length() << endl;
pStr = &wStr[0];
for (unsigned int i = 0; i < wStr.length(); i++) {
cout << "char "<< i+1 << " # " << *pStr << " => " << pStr << endl;
pStr++;
}
return 0;
}
First of all, it reports 14
as the length of the string: €áa¢cée£ : 14
Is it because it's counting 2 byte per character?
And all I get this:
char 1 # ? => €áa¢cée£
char 2 # ? => ??áa¢cée£
char 3 # ? => ?áa¢cée£
char 4 # ? => áa¢cée£
char 5 # ? => ?a¢cée£
char 6 # a => a¢cée£
char 7 # ? => ¢cée£
char 8 # ? => ?cée£
char 9 # c => cée£
char 10 # ? => ée£
char 11 # ? => ?e£
char 12 # e => e£
char 13 # ? => £
char 14 # ? => ?
as the last cout output. So, actual problem still remains, I believe. Cheers!!
Update 2: based on n.m.'s second suggestion
#include <iostream>
#include <string>
using namespace std;
wchar_t wStr[] = L"€áa¢cée£";
int iStr = sizeof(wStr) / sizeof(wStr[0]); // length of the string
wchar_t *pStr = 0;
int main (void)
{
setlocale (LC_ALL,"");
wcout << wStr << " : " << iStr << endl;
pStr = &wStr[0];
for (int i = 0; i < iStr; i++) {
wcout << *pStr << " => " << static_cast<void*>(pStr) << " => " << pStr << endl;
pStr++;
}
return 0;
}
And this is what I get as my result:
€áa¢cée£ : 9
€ => 0x1000010e8 => €áa¢cée£
á => 0x1000010ec => áa¢cée£
a => 0x1000010f0 => a¢cée£
¢ => 0x1000010f4 => ¢cée£
c => 0x1000010f8 => cée£
é => 0x1000010fc => ée£
e => 0x100001100 => e£
£ => 0x100001104 => £
=> 0x100001108 =>
Why there it's reported as 9
than 8
? Or this is what I should expect? Cheers!!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
在字符串文字之前删除
L
。使用std::string
,而不是std::wstring
。UPD:有一个更好(正确)的解决方案。保留 wchar_t、wstring 和 L,并在程序开头调用
setlocale(LC_ALL,"")
。无论如何,您应该在程序开头调用
setlocale(LC_ALL,"")
。这指示您的程序使用您环境的区域设置,而不是默认的“C”区域设置。您的环境具有 UTF-8 环境,因此一切都应该可以正常工作。如果不调用
setlocale(LC_ALL,"")
,程序将使用 UTF-8 序列,而不会“意识到”它们是 UTF-8。如果终端上打印了正确的 UTF-8 序列,它将被解释为 UTF-8 并且一切看起来都很好。如果您使用string
和char
就会发生这种情况:gcc 使用 UTF-8 作为字符串的默认编码,并且 ostream 会很高兴地打印它们而不应用任何转换。它认为它有一个 ASCII 字符序列。但是,当您使用
wchar_t
时,一切都会中断:gcc 使用 UTF-32,未应用正确的重新编码(因为区域设置为“C”)并且输出是垃圾。当您调用
setlocale(LC_ALL,"")
时,程序知道它应该将 UTF-32 重新编码为 UTF-8,然后一切又恢复正常了。这一切都假设我们只想使用 UTF-8。使用任意区域设置和编码超出了本答案的范围。
Drop the
L
before the string literal. Usestd::string
, notstd::wstring
.UPD: There's a better (correct) solution. keep wchar_t, wstring and the L, and call
setlocale(LC_ALL,"")
in the beginning of your program.You should call
setlocale(LC_ALL,"")
in the beginning of your program anyway. This instructs your program to work with your environment's locale, instead of the default "C" locale. Your environment has a UTF-8 one so everything should work.Without calling
setlocale(LC_ALL,"")
, the program works with UTF-8 sequences without "realizing" that they are UTF-8. If a correct UTF-8 sequence is printed on the terminal, it will be interpreted as UTF-8 and everything will look fine. That's what happens if you usestring
andchar
: gcc uses UTF-8 as a default encoding for strings, and the ostream happily prints them without applying any conversion. It thinks it has a sequence of ASCII characters.But when you use
wchar_t
, everything breaks: gcc uses UTF-32, the correct re-encoding is not applied (because the locale is "C") and the output is garbage.When you call
setlocale(LC_ALL,"")
the program knows it should recode UTF-32 to UTF-8, and everything is fine and dandy again.This all assumes that we only ever want to work with UTF-8. Using arbitrary locales and encodings is beyond the scope of this answer.