如何在 C++ 中将 (char *) 从 ISO-8859-1 转换为 UTF-8多平台?
我正在更改 C++ 中的软件,该软件处理 ISO Latin 1 格式的文本,以将数据存储在 SQLite 的数据库中。
问题是 SQLite 以 UTF-8 工作...而使用相同数据库的 Java 模块以 UTF-8 工作。
我希望有一种方法可以在存储到数据库之前将 ISO Latin 1 字符转换为 UTF-8 字符。我需要它在 Windows 和 Mac 上工作。
我听说ICU会这么做,但我觉得这太臃肿了。我只需要一个简单的转换系统(最好是来回)来转换这两个字符集。
我该怎么做呢?
I'm changing a software in C++, wich process texts in ISO Latin 1 format, to store data in a database in SQLite.
The problem is that SQLite works in UTF-8... and the Java modules that use same database work in UTF-8.
I wanted to have a way to convert the ISO Latin 1 characters to UTF-8 characters before storing in the database. I need it to work in Windows and Mac.
I heard ICU would do that, but I think it's too bloated. I just need a simple convertion system(preferably back and forth) for these 2 charsets.
How would I do that?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
ISO-8859-1 被纳入 ISO/IEC 10646 和 Unicode 的前 256 个代码点。所以转换非常简单。
对于每个字符:
请参阅 http://en.wikipedia.org/wiki/UTF-8 #Description 了解更多详细信息。
编辑:根据ninjalj的评论,latin-1将直接翻译为前256个unicode代码点,所以上面的算法应该可以工作。
ISO-8859-1 was incorporated as the first 256 code points of ISO/IEC 10646 and Unicode. So the conversion is pretty simple.
for each char:
See http://en.wikipedia.org/wiki/UTF-8#Description for more details.
EDIT: according to a comment by ninjalj, latin-1 translates direclty to the first 256 unicode code points, so the above algorithm should work.
对于 c++ 我用这个:
TO c++ i use this:
如果通用字符集框架(如 iconv)对您来说太臃肿,请自行构建。
编写一个静态翻译表(char 到 UTF-8 序列),将您自己的翻译放在一起。根据您使用的字符串存储方式(字符缓冲区、std::string 或其他什么),它看起来会有所不同,但想法是 - 滚动源字符串,用 UTF-8 替换每个超过 127 的代码对应字符串。由于这可能会增加字符串长度,因此就地执行会相当不方便。为了获得额外的好处,您可以分两遍完成:第一遍确定必要的目标字符串大小,第二遍执行翻译。
If general-purpose charset frameworks (like iconv) are too bloated for you, roll your own.
Compose a static translation table (char to UTF-8 sequence), put together your own translation. Depending on what do you use for string storage (char buffers, or std::string or what) it would look somewhat differently, but the idea is - scroll through the source string, replace each character with code over 127 with its UTF-8 counterpart string. Since this can potentially increase string length, doing it in place would be rather inconvenient. For added benefit, you can do it in two passes: pass one determines the necessary target string size, pass two performs the translation.
如果您不介意进行额外的复制,则可以将 ISO Latin 1 字符“扩展”为 16 位字符,从而获得 UTF-16。然后你可以使用类似 UTF8-CPP 将其转换为 UTF-8。
事实上,我认为 UTF8-CPP 甚至可以直接将 ISO Latin 1 转换为 UTF-8(utf16to8 函数),但您可能会收到警告。
当然,它需要是真正的 ISO Latin 1,而不是 Windows CP 1232。
If you don't mind doing an extra copy, you can just "widen" your ISO Latin 1 chars to 16-bit characters and thus get UTF-16. Then you can use something like UTF8-CPP to convert it to UTF-8.
In fact, I think UTF8-CPP could even convert ISO Latin 1 to UTF-8 directly (utf16to8 function) but you may get a warning.
Of course, it needs to be real ISO Latin 1, not Windows CP 1232.