ASCII、ISO 8859-1、Unicode 在 C 中如何工作?
好吧,我真的很怀疑,C 如何使用编码,首先我有一个 C 文件,用 ISO 8859-1 编码保存,包含 test.c 内容,运行程序时,字符 ÿ 无法正确显示Linux控制台,我知道默认情况下它使用utf-8,但是如果utf-8使用与ISO 8859-1相同的256个字符,为什么程序不能正确显示'ÿ'字符?还有一个问题,为什么test2能正确显示‘ÿ’字符?其中 test2.c 文件是 UTF-8 并且 file.txt 也是 UTF-8 ?换句话说,编译器不是会抱怨宽度是多字符吗?
test1.c
// ISO 8859-1
#include <stdio.h>
int main(void)
{
unsigned char c = 'ÿ';
putchar(c);
return 0;
}
$ gcc -o test1 test1.c
$ ./test1
$ ▒
test2.c
// ASCII
#include <stdio.h>
int main(void)
{
FILE *fp = fopen("file.txt", "r+");
int c;
while((c = fgetc(fp)) != EOF)
putchar(c);
return 0;
}
file.txt: UTF-8 abcdefÿghi
$ gcc -o test2 test2.c
$ ./test2
$ abcdefÿghi
好吧,就是这样,如果你能帮我提供有关它的详细信息,我将非常感激,:)
Well, I'm really in doubt, how does C work with encodings, well first I have a C file, saved with ISO 8859-1 encoding, with test.c content, when running the program the character ÿ is not displayed correctly on the linux console, I know that by default it uses utf-8, but if utf-8 uses the same 256 characters as ISO 8859-1, why doesn't the program correctly display the 'ÿ' character? Another question, why does test2 correctly display the 'ÿ' character? where the test2.c file is a UTF-8 and also the file.txt is a UTF-8 ? In other words, wasn't the compiler to complain about the width being multi-character?
test1.c
// ISO 8859-1
#include <stdio.h>
int main(void)
{
unsigned char c = 'ÿ';
putchar(c);
return 0;
}
$ gcc -o test1 test1.c
$ ./test1
$ ▒
test2.c
// ASCII
#include <stdio.h>
int main(void)
{
FILE *fp = fopen("file.txt", "r+");
int c;
while((c = fgetc(fp)) != EOF)
putchar(c);
return 0;
}
file.txt: UTF-8
abcdefÿghi
$ gcc -o test2 test2.c
$ ./test2
$ abcdefÿghi
well, that's it, if you can help me giving details about it I would be very grateful, :)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
由于多种原因,字符编码可能会令人困惑。以下是一些解释:
在 ISO 8859-1 编码中,字符 y 带有分音符 ÿ(最初是 i 和 i 的连字) >j) 被编码为字节值
0xFF
(255)。 Unicode do 中的前 256 个代码点对应于 ISO 8859-1 中的相同字符,但流行的 Unicode UTF-8 编码对大于 127 的代码点使用 2 个字节,因此 < code>ÿ 在 UTF-8 中编码为0xC3 0xBF
。当您读取文件 file.txt 时,您的程序一次读取一个字节并将其原封不动地输出到控制台(旧系统上的行结尾除外),
ÿ
被读取为 2 个单独的字节,逐个输出,并且终端显示ÿ
,因为为终端选择的区域设置也使用 UTF-8 编码。更令人困惑的是,如果源文件使用 UTF-8 编码,则
"ÿ"
是长度为 2 的字符串,并且'ÿ'
被解析为多字节字符常量。多字节字符常量非常混乱且不可移植(该值可以是 0xC3BF 或 0xBFC3,具体取决于系统),强烈建议不要使用它们,并且编译器应配置为在看到它时发出警告(gcc -Wall -韦斯特拉
)。更令人困惑的是:在许多系统上,默认情况下都会对
char
类型进行签名。在本例中,字符常量'ÿ'
(ISO 8859-1 中的单个字节)的值为-1
且类型为int
,无论您在源代码中如何编写:'\377'
和'\xff'
也将具有-1
值。这样做的原因是与"ÿ"[0]
的值一致,即值为-1
的char
。这也是宏EOF
最常见的值。在所有系统上,
getchar()
以及getc()
和fgetc()
等类似函数返回0
之间的值和UCHAR_MAX
或EOF
的特殊负值,因此返回字符ÿ
编码为 ISO 8859-1 的文件中的字节 0xFF作为值0xFF
或255
,如果char
有符号,则与'ÿ'
比较不同,也与不同>'ÿ'
如果源代码采用 UTF-8 格式。根据经验,不要在字符常量中使用非 ASCII 字符,不要对用于字符串和文件内容的字符编码进行假设,并将编译器配置为默认使
char
无符号 (<代码>-funsigned-char)。如果您处理外语,强烈建议对所有文本内容(包括源代码)使用 UTF-8。请注意,使用此编码将非 ASCII 字符编码为多个字节。研究UTF-8编码,它相当简单优雅,并且使用用于处理文本转换(例如大写)的库。
Character encodings can be confusing for many reasons. Here are some explanations:
In the ISO 8859-1 encoding, the character y with a diaeresis ÿ (originally a ligature of i and j) is encoded as a byte value of
0xFF
(255). The first 256 code points in Unicode do correspond to the same characters as the ones from ISO 8859-1, but the popular UTF-8 encoding for Unicode uses 2 bytes for code points larger than 127, soÿ
is encoded in UTF-8 as0xC3 0xBF
.When you read the file file.txt, your program reads one byte at a time and outputs it to the console unchanged (except for line endings on legacy systems), the
ÿ
is read as 2 separate bytes which are output one after the other, and the terminal displaysÿ
because the locale selected for the terminal also uses the UTF-8 encoding.Adding to confusion, if the source file uses UTF-8 encoding,
"ÿ"
is a string of length 2 and'ÿ'
is parsed as a multibyte character constant. Multibyte character constants are very confusing and non portable (the value can be 0xC3BF or 0xBFC3 depending on the system), using them is strongly discouraged and the compiler should be configured to issue a warning when it sees one (gcc -Wall -Wextra
).Even more confusing is this: on many systems the type
char
signed by default. In this case, the character constant'ÿ'
(a single byte in ISO 8859-1) has a value of-1
and typeint
, no matter how you write it in the source code:'\377'
and'\xff'
will also have a value of-1
. The reason for this is consistency with the value of"ÿ"[0]
, achar
with the value-1
. This is also the most common value of the macroEOF
.On all systems,
getchar()
and similar functions likegetc()
andfgetc()
return values between0
andUCHAR_MAX
or the special negative value ofEOF
, so the byte 0xFF from a file where characterÿ
in encoded as ISO 8859-1 is returned as the value0xFF
or255
, which compares different from'ÿ'
ifchar
is signed, and also different from'ÿ'
if the source code is in UTF-8.As a rule of thumb, do not use non-ASCII characters in character constants, do not make assumptions about the character encoding used for strings and file contents and configure the compiler to make
char
unsigned by default (-funsigned-char
).If you deal with foreign languages, using UTF-8 is highly recommended for all textual contents, including source code. Be aware that non-ASCII characters are encoded as multiple bytes with this encoding. Study the UTF-8 encoding, it is quite simple and elegant, and use libraries to handle textual transformations such as uppercasing.
这里的问题是
unsigned char
表示大小为 8 位(从 0 到 255)的无符号整数。 C 使用 ASCII 值来表示字符。 ASCII 字符只是 0 到 127 之间的整数。例如,A
是 65。当您使用
'A'
时,编译器会理解65
。但是,'ÿ'
不是 ASCII 字符,它是扩展的 ASCII 字符(值为 152)。从技术上讲,它可以放入unsigned char
中,但 C 标准要求语法''
包含标准 ASCII 字符。这就是第一个例子不起作用的原因。
现在来说第二个。非 ASCII 字符无法放入单个字符中。处理有限 ASCII 集之外的字符的方法是使用多个字符。当您将
ÿ
写入文件时,您实际上是在写入该字符的二进制表示形式。如果您使用UTF-8
表示法,这意味着您的文件中有两个 8 位数字0xC3
和0xBF
。当您在
test2.c
的 while 循环中读取文件时,在某个时刻,c
将获取值0xC3
,然后下一次迭代时为 0xBF
。这两个值将被赋予putc
。然后,在显示时,这两个值将被解释为ÿ
。当 putc 最终写入字符时,它们最终会被终端应用程序读取。如果支持
UTF-8
编码,则可以理解0xC3
后面跟着0xBF
的含义,并显示ÿ
。因此,在第一个示例中,您没有看到
ÿ
的原因是代码中c
的值实际上(可能)是0xC3 不代表任何字符。
更具体的示例:
这将显示
ÿ
但如您所见,需要 2 个字符才能完成此操作。The issue here is that
unsigned char
represents an unsigned integer of size 8 bits (from 0 to 255). C uses ASCII values to represent characters. An ASCII character is simply an integer from 0 to 127. For example,A
is 65.When you use
'A'
, the compiler understands65
. But,'ÿ'
is not an ASCII character, it is an extended ASCII character (with a value of 152). Technically, it can fit inside anunsigned char
but the C standard requires that the syntax''
contains a standard ASCII character.So that's why the first example didn't work.
Now for the second one. A non ASCII character cannot fit into a single char. The way you can handle characters outside the limited ASCII set is by using several chars. When you write
ÿ
into a file, you are actually writing a binary representation of this character. If you are using theUTF-8
reprensentation, this means that in you file you have two 8-bit numbers0xC3
and0xBF
.When you read your file in the while loop of
test2.c
, at some point,c
will take the value0xC3
and then0xBF
on the next iteration. These two values will be given toputc
. And then, when displayed, the two values together will be interpreted asÿ
.When
putc
finally writes the characters, they eventually are read by your terminal application. If it supportsUTF-8
encoding, it can understand the meaning of0xC3
followed by0xBF
and display aÿ
.So the reason why, in the first example, you didn't see
ÿ
is that the value ofc
in your code is actually (probably)0xC3
which doesn't reprensent any character.A more concrete example:
This will display
ÿ
but as you can see, it takes 2 chars to do that.如果 utf-8 使用与 ISO 8859-1 相同的 256 个字符。不,这里有一个混乱。在 ISO-8859-1(又名 Latin1)中,256 个字符确实具有相应 Unicode 字符的代码点值。但是 utf-8 对 0x7f 以上的所有字符都有特殊的编码,并且代码点在 0x80 和 0xff 之间的所有字符都表示为 2 个字节。例如,字符
é
U+00e9 在 ISO-8859-1 中表示为单字节 0xe9,但在 utf-8 中表示为 2 个字节 0xc3 0xa9。更多参考请参见维基百科页面。
if utf-8 uses the same 256 characters as ISO 8859-1. No there is a confusion here. In ISO-8859-1 (aka Latin1) the 256 characters have indeed the code point value of the corresponding Unicode character. But utf-8 have a special encoding for all characters above 0x7f and all characters having a code point between 0x80 and 0xff are represented as 2 bytes. For example the character
é
U+00e9 is represented as the single byte 0xe9 in ISO-8859-1, but is represented as the 2 bytes 0xc3 0xa9 in utf-8.More references on the wikipedia page.
在 MacOS 上使用 clang 很难重现:
MacOS 上的 clang 默认使用 UTF-8。
以 UTF-8 编码:
调试所有警告和错误,我们得到一个具有正确字符串文字和字节数组的解决方案:
十进制
195
是十六进制C3
,这正是字符ÿ
的UTF-8字节序列的第一个字节:现在我们知道我们必须输出2个字节和代码:
程序
test2.c
只是读取字节并输出它们。如果输入是 UTF-8,则输出也是 UTF-8。这只是保留编码。要将 Latin-1 转换为 UTF-8,我们需要以特殊方式对其进行打包。对于 UTF-8 的两个字节,我们需要一个开始字节
110x xxxx
(开始处的位数是序列的字节长度)和一个连续字节10xx xxxx
。我们现在可以编码:
这只适用于 ISO-8859-1(“真正的”Latin-1)。许多名为“Latin-1”的文件都在 Windows/Microsoft CP1252 中编码。
It's hard to reproduce on MacOS with clang:
clang on MacOS has UTF-8 as default.
Encoded in UTF-8:
Debugging all warnings and errors we get a solution with the correct string literal and an array of bytes:
Decimal
195
is hexadecimalC3
, which is exactly the first byte of the UTF-8 byte sequence of the characterÿ
:Now we know that we must output 2 bytes and code:
Program
test2.c
just reads bytes and outputs them. If the input is UTF-8 then the output is also UTF-8. This just keeps the encoding.To convert Latin-1 to UTF-8 we need to pack it in a special way. For two bytes of UTF-8 we need a begin byte
110x xxxx
(number of bits at the begin is the length of the sequence in bytes) and a continuation byte10xx xxxx
.We can code now:
This works only for ISO-8859-1 ("true" Latin-1). Many files called "Latin-1" are encoded in Windows/Microsoft CP1252.