C++: std::string 问题

发布于 2024-09-14 16:47:46 字数 1731 浏览 7 评论 0原文

我有这个简单的代码：

#include <iostream>
#include <fstream>

using namespace std;

int main(void)
{
    ifstream in("file.txt");
    string line;
    while (getline(in, line))
    {
        cout << line << "    starts with char: " << line.at(0) << " " << (int) line.at(0) << endl;
    }
    in.close();
    return 0;
}

它打印：

  0.000000 0.000000 0.010909 0.200000    starts with char:   32
A 0.023636 0.000000 0.014545 0.200000    starts with char: A 65
B 0.050909 0.000000 0.014545 0.200000    starts with char: B 66
C 0.078182 0.000000 0.014545 0.200000    starts with char: C 67

...

, 0.152727 0.400000 0.003636 0.200000    starts with char: , 44
< 0.169091 0.400000 0.005455 0.200000    starts with char: < 60
. 0.187273 0.400000 0.003636 0.200000    starts with char: . 46
> 0.203636 0.400000 0.005455 0.200000    starts with char: > 62
/ 0.221818 0.400000 0.010909 0.200000    starts with char: / 47
? 0.245455 0.400000 0.009091 0.200000    starts with char: ? 63
¡ 0.267273 0.400000 0.005455 0.200000    starts with char: � -62
£ 0.285455 0.400000 0.012727 0.200000    starts with char: � -62
¥ 0.310909 0.400000 0.012727 0.200000    starts with char: � -62
§ 0.336364 0.400000 0.009091 0.200000    starts with char: � -62
© 0.358182 0.400000 0.016364 0.200000    starts with char: � -62
® 0.387273 0.400000 0.018182 0.200000    starts with char: � -62
¿ 0.418182 0.400000 0.009091 0.200000    starts with char: � -62
À 0.440000 0.400000 0.012727 0.200000    starts with char: � -61
Á 0.465455 0.400000 0.014545 0.200000    starts with char: � -61

奇怪...我怎样才能真正获得字符串的第一个字符？

提前致谢！

原文

I have this simple code:

#include <iostream>
#include <fstream>

using namespace std;

int main(void)
{
    ifstream in("file.txt");
    string line;
    while (getline(in, line))
    {
        cout << line << "    starts with char: " << line.at(0) << " " << (int) line.at(0) << endl;
    }
    in.close();
    return 0;
}

which prints:

  0.000000 0.000000 0.010909 0.200000    starts with char:   32
A 0.023636 0.000000 0.014545 0.200000    starts with char: A 65
B 0.050909 0.000000 0.014545 0.200000    starts with char: B 66
C 0.078182 0.000000 0.014545 0.200000    starts with char: C 67

...

, 0.152727 0.400000 0.003636 0.200000    starts with char: , 44
< 0.169091 0.400000 0.005455 0.200000    starts with char: < 60
. 0.187273 0.400000 0.003636 0.200000    starts with char: . 46
> 0.203636 0.400000 0.005455 0.200000    starts with char: > 62
/ 0.221818 0.400000 0.010909 0.200000    starts with char: / 47
? 0.245455 0.400000 0.009091 0.200000    starts with char: ? 63
¡ 0.267273 0.400000 0.005455 0.200000    starts with char: � -62
£ 0.285455 0.400000 0.012727 0.200000    starts with char: � -62
¥ 0.310909 0.400000 0.012727 0.200000    starts with char: � -62
§ 0.336364 0.400000 0.009091 0.200000    starts with char: � -62
© 0.358182 0.400000 0.016364 0.200000    starts with char: � -62
® 0.387273 0.400000 0.018182 0.200000    starts with char: � -62
¿ 0.418182 0.400000 0.009091 0.200000    starts with char: � -62
À 0.440000 0.400000 0.012727 0.200000    starts with char: � -61
Á 0.465455 0.400000 0.014545 0.200000    starts with char: � -61

Strange... How can I get really the first character of the string?

Thanks in advance!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

陌生 2024-09-21 16:47:46

您正在获取字符串中的第一个字符。

但看起来该字符串是 UTF-8 字符串（或者可能是其他一些多字节字符格式）。

这意味着操作系统打印的每个符号（字形）都由 1 个（或更多字符）组成。
如果是 UTF-8，则 ASCII (0-127) 范围之外的任何字符实际上都是由 2 个（或更多字符）组成，并且字符串打印代码可以正确解释这一点。但是字符打印代码不可能正确解码大于 127 的单个字符。

我个人认为动态宽度字符格式在程序内部使用不是一个好主意（它们对于传输和存储来说是可以的））因为它们使字符串操作变得更加复杂。我建议您将字符串转换为固定宽度格式以进行内部处理，然后将其转换回 UTF-8 进行存储。

就我个人而言，我会在内部使用UTF-16（或UTF-32，具体取决于wchar_t是什么）（是的，我从技术上知道UTF-16不是固定宽度的，但在所有合理的教学环境中，它是固定宽度的（当我们包括沙脚本时）我们可能需要使用 UTF-32))。您只需为输入/输出流注入适当的 codecvt 方面即可进行自动翻译。在内部，可以使用 wchar_t 类型将代码作为单个字符进行操作。

回复收藏 0 原文

夜唯美灬不弃 2024-09-21 16:47:46

该文件采用 UTF-8 编码。使用 Unicode 库（例如 ICU 来访问代码点：

#include <iostream>
#include <fstream>
#include <utility>

#include "unicode/utf.h"

using namespace std;

const pair<UChar32, int32_t>
getFirstUTF8CodePoint(const string& str) {
  const uint8_t* ptr = reinterpret_cast<const uint8_t*>(str.data());
  const int32_t length = str.length();
  int32_t offset = 0;
  UChar32 cp = 0;
  U8_NEXT(ptr, offset, length, cp);
  return make_pair(cp, offset);
}

int main(void)
{
    ifstream in("file.txt");
    string line;
    while (getline(in, line))
    {
      pair<UChar32, string::size_type> cp = getFirstUTF8CodePoint(line);
      cout << line << "    starts with char: " << line.substr(0, cp.second) << " " << static_cast<unsigned long>(cp.first) << endl;
    }
    in.close();
    return 0;
}

The file is UTF-8 encoded. Use a Unicode library such as ICU to get access to the code points:

#include <iostream>
#include <fstream>
#include <utility>

#include "unicode/utf.h"

using namespace std;

const pair<UChar32, int32_t>
getFirstUTF8CodePoint(const string& str) {
  const uint8_t* ptr = reinterpret_cast<const uint8_t*>(str.data());
  const int32_t length = str.length();
  int32_t offset = 0;
  UChar32 cp = 0;
  U8_NEXT(ptr, offset, length, cp);
  return make_pair(cp, offset);
}

int main(void)
{
    ifstream in("file.txt");
    string line;
    while (getline(in, line))
    {
      pair<UChar32, string::size_type> cp = getFirstUTF8CodePoint(line);
      cout << line << "    starts with char: " << line.substr(0, cp.second) << " " << static_cast<unsigned long>(cp.first) << endl;
    }
    in.close();
    return 0;
}

回复收藏 0 原文