如何在 C++ 中将 (char *) 从 ISO-8859-1 转换为 UTF-8多平台？

发布于 2024-10-30 18:30:09 字数 281 浏览 2 评论 0原文

我正在更改 C++ 中的软件，该软件处理 ISO Latin 1 格式的文本，以将数据存储在 SQLite 的数据库中。
问题是 SQLite 以 UTF-8 工作...而使用相同数据库的 Java 模块以 UTF-8 工作。

我希望有一种方法可以在存储到数据库之前将 ISO Latin 1 字符转换为 UTF-8 字符。我需要它在 Windows 和 Mac 上工作。

我听说ICU会这么做，但我觉得这太臃肿了。我只需要一个简单的转换系统（最好是来回）来转换这两个字符集。

我该怎么做呢？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

潜移默化 2024-11-06 18:30:09

ISO-8859-1 被纳入 ISO/IEC 10646 和 Unicode 的前 256 个代码点。所以转换非常简单。

对于每个字符：

uint8_t ch = code_point; /* assume that code points above 0xff are impossible since latin-1 is 8-bit */

if(ch < 0x80) {
    append(ch);
} else {
    append(0xc0 | (ch & 0xc0) >> 6); /* first byte, simplified since our range is only 8-bits */
    append(0x80 | (ch & 0x3f));
}

请参阅 http://en.wikipedia.org/wiki/UTF-8 #Description 了解更多详细信息。

编辑：根据ninjalj的评论，latin-1将直接翻译为前256个unicode代码点，所以上面的算法应该可以工作。

ISO-8859-1 was incorporated as the first 256 code points of ISO/IEC 10646 and Unicode. So the conversion is pretty simple.

for each char:

uint8_t ch = code_point; /* assume that code points above 0xff are impossible since latin-1 is 8-bit */

if(ch < 0x80) {
    append(ch);
} else {
    append(0xc0 | (ch & 0xc0) >> 6); /* first byte, simplified since our range is only 8-bits */
    append(0x80 | (ch & 0x3f));
}

See http://en.wikipedia.org/wiki/UTF-8#Description for more details.

EDIT: according to a comment by ninjalj, latin-1 translates direclty to the first 256 unicode code points, so the above algorithm should work.

回复收藏 0 原文

就此别过 2024-11-06 18:30:09

对于 c++ 我用这个：

std::string iso_8859_1_to_utf8(std::string &str)
{
    string strOut;
    for (std::string::iterator it = str.begin(); it != str.end(); ++it)
    {
        uint8_t ch = *it;
        if (ch < 0x80) {
            strOut.push_back(ch);
        }
        else {
            strOut.push_back(0xc0 | ch >> 6);
            strOut.push_back(0x80 | (ch & 0x3f));
        }
    }
    return strOut;
}

TO c++ i use this:

std::string iso_8859_1_to_utf8(std::string &str)
{
    string strOut;
    for (std::string::iterator it = str.begin(); it != str.end(); ++it)
    {
        uint8_t ch = *it;
        if (ch < 0x80) {
            strOut.push_back(ch);
        }
        else {
            strOut.push_back(0xc0 | ch >> 6);
            strOut.push_back(0x80 | (ch & 0x3f));
        }
    }
    return strOut;
}

回复收藏 0 原文

猫瑾少女 2024-11-06 18:30:09

如果通用字符集框架（如 iconv）对您来说太臃肿，请自行构建。

编写一个静态翻译表（char 到 UTF-8 序列），将您自己的翻译放在一起。根据您使用的字符串存储方式（字符缓冲区、std::string 或其他什么），它看起来会有所不同，但想法是 - 滚动源字符串，用 UTF-8 替换每个超过 127 的代码对应字符串。由于这可能会增加字符串长度，因此就地执行会相当不方便。为了获得额外的好处，您可以分两遍完成：第一遍确定必要的目标字符串大小，第二遍执行翻译。

回复收藏 0 原文