在 C++ 中将 UTF-8 转换为 ANSI
我在任何地方都找不到这个问题的答案。
如何在 C++ 中将字符串从 UTF-8 转换为 ANSI(扩展 ASCII)?
I can't find an answer to this question anywhere.
How can I convert a string from UTF-8 to ANSI (extended ASCII) in C++?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
通常,人们使用 libiconv(网页),它是可移植的并且可以在大多数平台上运行。正如 KerrekSB 提到的,如果您将字符集视为“扩展 ASCII”,您将陷入大麻烦——我确信至少有一百个字符集可以称为“扩展 ASCII”,包括 UTF-8。
另外,请确保您知道所需的编码:ISO-8859-1 或 CP1252。 Windows 版本用附加的打印字符替换了 C1 控制代码。
Generally, one uses libiconv (webpage), which is portable and runs on most platforms. As KerrekSB mentioned, you will get in deep trouble if you think of a character set as "extended ASCII" -- I'm sure there are at least a hundred character sets that could be called "extended ASCII", including UTF-8.
Also, make sure you know which encoding you want: ISO-8859-1 or CP1252. The Windows version replaces the C1 control codes with additional printing characters.
仅限 Windows:
Windows only:
假设“ANSI”实际上是指 ISO 8859 变体之一,我们应该从几点开始。
首先,并非每个字符串都可以从 UTF-8(或一般的 Unicode,无论使用何种转换)转换为 ISO 8859。Unicode 对于地球上每种语言中的几乎每个字符都有一个唯一的代码点。
ISO 8859 支持的语言要少得多,并且为其支持的每种语言都有单独的字符集;相同的代码在不同的语言中代表不同的字符。
这意味着 UTF-8 输入字符串很容易包含根本无法用任何 ISO 8859 变体表示的字符,并且也很容易包含需要不同 ISO 8859 变体来表示的字符。
第二,即使在最好的情况下,这种转变也可能是相当重要的。如果可能的话,您几乎肯定希望使用库(例如 libiconv)来完成此任务。举个例子,Unicode 有一个名为“组合变音符号”的功能,它可以让您将诸如“带锐音符号的 A”之类的内容编码为单个代码点或两个单独的代码点(一个代表“A”,另一个代表重音)。要将其编码为 ISO 8859,您必须将它们全部转换为一种形式(通常是预组合形式)。
在对 Unicode 进行任何重要工作之前,您通常还需要将 UTF-8 转换为 UCS-4。
因此,序列将类似于:
根据您喜欢的处理方式,您可以将 3 和 4 合并到一个步骤中,随时转换字符,例如,抛出如果遇到无法在目标字符集中表示的字符,则会出现异常。
Assuming that by "ANSI" you really mean one of the ISO 8859 variants, we should start with a couple of points.
The first is that not every string can be converted from UTF-8 (or Unicode in general, regardless of the transformation used) into ISO 8859. Unicode has a unique code point for virtually every character in every language on earth.
ISO 8859 supports far fewer languages, and has a separate character set for each language it does supports; the same codes represent different characters in different languages.
This means it's quite easy for a UTF-8 input string to contain characters that can't be represented in any ISO 8859 variant at all, and it's also easy for it to contain characters that require different ISO 8859 variants to represent.
The second is that even at best, the transformation may be quite non-trivial. If at all possible, you almost certainly want to use a library (e.g., libiconv) for this task. Just for example, Unicode has a...feature called "combining diacritical marks", which lets you encode something like an "A with acute accent" as either a single code point or two separate code points (one for the "A" and the other for the accent). To encode that in ISO 8859, you'll have to convert those all to one form (normally the pre-combined form).
Before you do any significant work with the Unicode, you also normally want to convert the UTF-8 to UCS-4.
So, the sequence would be something like:
Depending on the way you prefer to do things, you might combine 3 and 4 into a single step, converting characters as you go and, for example, throwing an exception if you encounter a character that can't be represented in the target character set.