C++: std::string 问题
我有这个简单的代码:
#include <iostream>
#include <fstream>
using namespace std;
int main(void)
{
ifstream in("file.txt");
string line;
while (getline(in, line))
{
cout << line << " starts with char: " << line.at(0) << " " << (int) line.at(0) << endl;
}
in.close();
return 0;
}
它打印:
0.000000 0.000000 0.010909 0.200000 starts with char: 32
A 0.023636 0.000000 0.014545 0.200000 starts with char: A 65
B 0.050909 0.000000 0.014545 0.200000 starts with char: B 66
C 0.078182 0.000000 0.014545 0.200000 starts with char: C 67
...
, 0.152727 0.400000 0.003636 0.200000 starts with char: , 44
< 0.169091 0.400000 0.005455 0.200000 starts with char: < 60
. 0.187273 0.400000 0.003636 0.200000 starts with char: . 46
> 0.203636 0.400000 0.005455 0.200000 starts with char: > 62
/ 0.221818 0.400000 0.010909 0.200000 starts with char: / 47
? 0.245455 0.400000 0.009091 0.200000 starts with char: ? 63
¡ 0.267273 0.400000 0.005455 0.200000 starts with char: � -62
£ 0.285455 0.400000 0.012727 0.200000 starts with char: � -62
¥ 0.310909 0.400000 0.012727 0.200000 starts with char: � -62
§ 0.336364 0.400000 0.009091 0.200000 starts with char: � -62
© 0.358182 0.400000 0.016364 0.200000 starts with char: � -62
® 0.387273 0.400000 0.018182 0.200000 starts with char: � -62
¿ 0.418182 0.400000 0.009091 0.200000 starts with char: � -62
À 0.440000 0.400000 0.012727 0.200000 starts with char: � -61
Á 0.465455 0.400000 0.014545 0.200000 starts with char: � -61
奇怪...我怎样才能真正获得字符串
的第一个字符?
提前致谢!
I have this simple code:
#include <iostream>
#include <fstream>
using namespace std;
int main(void)
{
ifstream in("file.txt");
string line;
while (getline(in, line))
{
cout << line << " starts with char: " << line.at(0) << " " << (int) line.at(0) << endl;
}
in.close();
return 0;
}
which prints:
0.000000 0.000000 0.010909 0.200000 starts with char: 32
A 0.023636 0.000000 0.014545 0.200000 starts with char: A 65
B 0.050909 0.000000 0.014545 0.200000 starts with char: B 66
C 0.078182 0.000000 0.014545 0.200000 starts with char: C 67
...
, 0.152727 0.400000 0.003636 0.200000 starts with char: , 44
< 0.169091 0.400000 0.005455 0.200000 starts with char: < 60
. 0.187273 0.400000 0.003636 0.200000 starts with char: . 46
> 0.203636 0.400000 0.005455 0.200000 starts with char: > 62
/ 0.221818 0.400000 0.010909 0.200000 starts with char: / 47
? 0.245455 0.400000 0.009091 0.200000 starts with char: ? 63
¡ 0.267273 0.400000 0.005455 0.200000 starts with char: � -62
£ 0.285455 0.400000 0.012727 0.200000 starts with char: � -62
¥ 0.310909 0.400000 0.012727 0.200000 starts with char: � -62
§ 0.336364 0.400000 0.009091 0.200000 starts with char: � -62
© 0.358182 0.400000 0.016364 0.200000 starts with char: � -62
® 0.387273 0.400000 0.018182 0.200000 starts with char: � -62
¿ 0.418182 0.400000 0.009091 0.200000 starts with char: � -62
À 0.440000 0.400000 0.012727 0.200000 starts with char: � -61
Á 0.465455 0.400000 0.014545 0.200000 starts with char: � -61
Strange... How can I get really the first character of the string
?
Thanks in advance!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您正在获取字符串中的第一个字符。
但看起来该字符串是 UTF-8 字符串(或者可能是其他一些多字节字符格式)。
这意味着操作系统打印的每个符号(字形)都由 1 个(或更多字符)组成。
如果是 UTF-8,则 ASCII (0-127) 范围之外的任何字符实际上都是由 2 个(或更多字符)组成,并且字符串打印代码可以正确解释这一点。但是字符打印代码不可能正确解码大于 127 的单个字符。
我个人认为动态宽度字符格式在程序内部使用不是一个好主意(它们对于传输和存储来说是可以的) )因为它们使字符串操作变得更加复杂。我建议您将字符串转换为固定宽度格式以进行内部处理,然后将其转换回 UTF-8 进行存储。
就我个人而言,我会在内部使用UTF-16(或UTF-32,具体取决于wchar_t是什么)(是的,我从技术上知道UTF-16不是固定宽度的,但在所有合理的教学环境中,它是固定宽度的(当我们包括沙脚本时)我们可能需要使用 UTF-32))。您只需为输入/输出流注入适当的 codecvt 方面即可进行自动翻译。在内部,可以使用 wchar_t 类型将代码作为单个字符进行操作。
You are getting the first character in the string.
But it looks like the string is a UTF-8 string (or possibly some other multibyte character format).
This means each symbol (glyph) that os printed is made of 1 (or more characters).
If it is UTF-8 then any character that is outside the ASCII (0-127) range is actually made up of 2 (or more characters) and the string printing code is correctly interpreting this. But it is not possible for the character printing code to correctly de-code a single character that is greater than 127.
Personally I think dynamic width character formats are not a good idea to use internally in a program (they are OK for transport and storage) as they make string manipulation much more complex. I would recommend that you convert the string into a fixed width format for internal processing then convert it back to UTF-8 for storage.
Personally I would use UTF-16 (or UTF-32 depending on what wchar_t is) internally (yes I know technically that UTF-16 is not fixed width but in all reasonable teaching circumstances it is fixed width (when we include sand-script then we may need to use UTF-32)). You just need to imbue the input/output stream with the appropriate codecvt facet for the automatic translation. Internally the code can then be manipulated as single characters use wchar_t type.
该文件采用 UTF-8 编码。使用 Unicode 库(例如 ICU 来访问代码点:
The file is UTF-8 encoded. Use a Unicode library such as ICU to get access to the code points:
我认为最后一个字符属于扩展 ASCII 表,C++ 不支持
ASCII 表
Edit1 : No快速查看底部的字符似乎也不是扩展 ASCII 格式的。也许看看马丁·约克说了什么。
I think the last characters belong to the extended ASCII table, something which C++ does not support
ASCII Table
Edit1 : No from a fast look the characters on the bottom do not appear to be in Extended ASCII as well. maybe check what Martin York said.
string是char的容器,char只有一个字节。它只能用于 Ascii 字符串或二进制数据。
任何不属于这种情况的内容都应该使用 Unicode,使用 wstring(wchar_t 的容器)。
但是 Unicode 文本编码方式的问题仍然存在,为此,请参阅上面的答案。
string is a container for char, which is only one byte. It should only be used for Ascii strings or binary data.
Anything that's not in this case should use Unicode, using wstring, a container for wchar_t.
But the problem of how your Unicode text is encoded still exists, for that, see answers above.