ASCII、ISO 8859-1、Unicode 在 C 中如何工作？

发布于 2025-01-11 20:55:40 字数 846 浏览 4 评论 0原文

好吧，我真的很怀疑，C 如何使用编码，首先我有一个 C 文件，用 ISO 8859-1 编码保存，包含 test.c 内容，运行程序时，字符 ÿ 无法正确显示Linux控制台，我知道默认情况下它使用utf-8，但是如果utf-8使用与ISO 8859-1相同的256个字符，为什么程序不能正确显示'ÿ'字符？还有一个问题，为什么test2能正确显示‘ÿ’字符？其中 test2.c 文件是 UTF-8 并且 file.txt 也是 UTF-8 ？换句话说，编译器不是会抱怨宽度是多字符吗？

test1.c

  // ISO 8859-1
  #include <stdio.h>

  int main(void)
  {
    unsigned char c = 'ÿ';
    putchar(c);
    return 0;
  }

  $ gcc -o test1 test1.c
  $ ./test1
  $ ▒

test2.c

  // ASCII
  #include <stdio.h>

  int main(void) 
  {

     FILE *fp = fopen("file.txt", "r+");
     int c;

     while((c = fgetc(fp)) != EOF)
        putchar(c);
     return 0;
 }

file.txt: UTF-8 abcdefÿghi

  $ gcc -o test2 test2.c
  $ ./test2
  $ abcdefÿghi

好吧，就是这样，如果你能帮我提供有关它的详细信息，我将非常感激，:)

原文

Well, I'm really in doubt, how does C work with encodings, well first I have a C file, saved with ISO 8859-1 encoding, with test.c content, when running the program the character ÿ is not displayed correctly on the linux console, I know that by default it uses utf-8, but if utf-8 uses the same 256 characters as ISO 8859-1, why doesn't the program correctly display the 'ÿ' character? Another question, why does test2 correctly display the 'ÿ' character? where the test2.c file is a UTF-8 and also the file.txt is a UTF-8 ? In other words, wasn't the compiler to complain about the width being multi-character?

test1.c

  // ISO 8859-1
  #include <stdio.h>

  int main(void)
  {
    unsigned char c = 'ÿ';
    putchar(c);
    return 0;
  }

  $ gcc -o test1 test1.c
  $ ./test1
  $ ▒

test2.c

  // ASCII
  #include <stdio.h>

  int main(void) 
  {

     FILE *fp = fopen("file.txt", "r+");
     int c;

     while((c = fgetc(fp)) != EOF)
        putchar(c);
     return 0;
 }

file.txt: UTF-8
abcdefÿghi

  $ gcc -o test2 test2.c
  $ ./test2
  $ abcdefÿghi

well, that's it, if you can help me giving details about it I would be very grateful, :)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

自由范儿 2025-01-18 20:55:40

由于多种原因，字符编码可能会令人困惑。以下是一些解释：

在 ISO 8859-1 编码中，字符 y 带有分音符 ÿ（最初是 i 和 i 的连字） >j) 被编码为字节值 0xFF (255)。 Unicode do 中的前 256 个代码点对应于 ISO 8859-1 中的相同字符，但流行的 Unicode UTF-8 编码对大于 127 的代码点使用 2 个字节，因此 < code>ÿ 在 UTF-8 中编码为 0xC3 0xBF。

当您读取文件 file.txt 时，您的程序一次读取一个字节并将其原封不动地输出到控制台（旧系统上的行结尾除外），ÿ被读取为 2 个单独的字节，逐个输出，并且终端显示 ÿ，因为为终端选择的区域设置也使用 UTF-8 编码。

更令人困惑的是，如果源文件使用 UTF-8 编码，则 "ÿ" 是长度为 2 的字符串，并且 'ÿ' 被解析为多字节字符常量。多字节字符常量非常混乱且不可移植（该值可以是 0xC3BF 或 0xBFC3，具体取决于系统），强烈建议不要使用它们，并且编译器应配置为在看到它时发出警告（gcc -Wall -韦斯特拉）。

更令人困惑的是：在许多系统上，默认情况下都会对 char 类型进行签名。在本例中，字符常量 'ÿ'（ISO 8859-1 中的单个字节）的值为 -1 且类型为 int ，无论您在源代码中如何编写：'\377' 和 '\xff' 也将具有 -1 值。这样做的原因是与 "ÿ"[0] 的值一致，即值为 -1 的 char。这也是宏EOF最常见的值。

在所有系统上，getchar() 以及 getc() 和 fgetc() 等类似函数返回 0 之间的值和 UCHAR_MAX 或 EOF 的特殊负值，因此返回字符 ÿ 编码为 ISO 8859-1 的文件中的字节 0xFF作为值0xFF 或 255，如果 char 有符号，则与 'ÿ' 比较不同，也与 不同>'ÿ' 如果源代码采用 UTF-8 格式。

根据经验，不要在字符常量中使用非 ASCII 字符，不要对用于字符串和文件内容的字符编码进行假设，并将编译器配置为默认使 char 无符号 (<代码>-funsigned-char）。

如果您处理外语，强烈建议对所有文本内容（包括源代码）使用 UTF-8。请注意，使用此编码将非 ASCII 字符编码为多个字节。研究UTF-8编码，它相当简单优雅，并且使用用于处理文本转换（例如大写）的库。

Character encodings can be confusing for many reasons. Here are some explanations:

In the ISO 8859-1 encoding, the character y with a diaeresis ÿ (originally a ligature of i and j) is encoded as a byte value of 0xFF (255). The first 256 code points in Unicode do correspond to the same characters as the ones from ISO 8859-1, but the popular UTF-8 encoding for Unicode uses 2 bytes for code points larger than 127, so ÿ is encoded in UTF-8 as 0xC3 0xBF.

When you read the file file.txt, your program reads one byte at a time and outputs it to the console unchanged (except for line endings on legacy systems), the ÿ is read as 2 separate bytes which are output one after the other, and the terminal displays ÿ because the locale selected for the terminal also uses the UTF-8 encoding.

Adding to confusion, if the source file uses UTF-8 encoding, "ÿ" is a string of length 2 and 'ÿ' is parsed as a multibyte character constant. Multibyte character constants are very confusing and non portable (the value can be 0xC3BF or 0xBFC3 depending on the system), using them is strongly discouraged and the compiler should be configured to issue a warning when it sees one (gcc -Wall -Wextra).

Even more confusing is this: on many systems the type char signed by default. In this case, the character constant 'ÿ' (a single byte in ISO 8859-1) has a value of -1 and type int, no matter how you write it in the source code: '\377' and '\xff' will also have a value of -1. The reason for this is consistency with the value of "ÿ"[0], a char with the value -1. This is also the most common value of the macro EOF.

On all systems, getchar() and similar functions like getc() and fgetc() return values between 0 and UCHAR_MAX or the special negative value of EOF, so the byte 0xFF from a file where character ÿ in encoded as ISO 8859-1 is returned as the value 0xFF or 255, which compares different from 'ÿ' if char is signed, and also different from 'ÿ' if the source code is in UTF-8.

As a rule of thumb, do not use non-ASCII characters in character constants, do not make assumptions about the character encoding used for strings and file contents and configure the compiler to make char unsigned by default (-funsigned-char).

If you deal with foreign languages, using UTF-8 is highly recommended for all textual contents, including source code. Be aware that non-ASCII characters are encoded as multiple bytes with this encoding. Study the UTF-8 encoding, it is quite simple and elegant, and use libraries to handle textual transformations such as uppercasing.

回复收藏 0 原文

如梦初醒的夏天 2025-01-18 20:55:40

这里的问题是 unsigned char 表示大小为 8 位（从 0 到 255）的无符号整数。 C 使用 ASCII 值来表示字符。 ASCII 字符只是 0 到 127 之间的整数。例如，A 是 65。

当您使用 'A' 时，编译器会理解 65。但是，'ÿ' 不是 ASCII 字符，它是扩展的 ASCII 字符（值为 152）。从技术上讲，它可以放入 unsigned char 中，但 C 标准要求语法 '' 包含标准 ASCII 字符。

这就是第一个例子不起作用的原因。

现在来说第二个。非 ASCII 字符无法放入单个字符中。处理有限 ASCII 集之外的字符的方法是使用多个字符。当您将 ÿ 写入文件时，您实际上是在写入该字符的二进制表示形式。如果您使用 UTF-8 表示法，这意味着您的文件中有两个 8 位数字 0xC3 和 0xBF。

当您在 test2.c 的 while 循环中读取文件时，在某个时刻，c 将获取值 0xC3，然后 下一次迭代时为 0xBF。这两个值将被赋予putc。然后，在显示时，这两个值将被解释为 ÿ。

当 putc 最终写入字符时，它们最终会被终端应用程序读取。如果支持 UTF-8 编码，则可以理解 0xC3 后面跟着 0xBF 的含义，并显示 ÿ 。

因此，在第一个示例中，您没有看到 ÿ 的原因是代码中 c 的值实际上（可能）是 0xC3 不代表任何字符。

更具体的示例：

#include <stdio.h>

int main()
{
    char y[3] = { 0xC3, 0xBF, '\0' };
    printf("%s\n", y);
}

这将显示 ÿ 但如您所见，需要 2 个字符才能完成此操作。

The issue here is that unsigned char represents an unsigned integer of size 8 bits (from 0 to 255). C uses ASCII values to represent characters. An ASCII character is simply an integer from 0 to 127. For example, A is 65.

When you use 'A', the compiler understands 65. But, 'ÿ' is not an ASCII character, it is an extended ASCII character (with a value of 152). Technically, it can fit inside an unsigned char but the C standard requires that the syntax '' contains a standard ASCII character.

So that's why the first example didn't work.

Now for the second one. A non ASCII character cannot fit into a single char. The way you can handle characters outside the limited ASCII set is by using several chars. When you write ÿ into a file, you are actually writing a binary representation of this character. If you are using the UTF-8 reprensentation, this means that in you file you have two 8-bit numbers 0xC3 and 0xBF.

When you read your file in the while loop of test2.c, at some point, c will take the value 0xC3 and then 0xBF on the next iteration. These two values will be given to putc. And then, when displayed, the two values together will be interpreted as ÿ.

When putc finally writes the characters, they eventually are read by your terminal application. If it supports UTF-8 encoding, it can understand the meaning of 0xC3 followed by 0xBF and display a ÿ.

So the reason why, in the first example, you didn't see ÿ is that the value of c in your code is actually (probably) 0xC3 which doesn't reprensent any character.

A more concrete example:

#include <stdio.h>

int main()
{
    char y[3] = { 0xC3, 0xBF, '\0' };
    printf("%s\n", y);
}

This will display ÿ but as you can see, it takes 2 chars to do that.

回复收藏 0 原文

浴红衣 2025-01-18 20:55:40

如果 utf-8 使用与 ISO 8859-1 相同的 256 个字符。不，这里有一个混乱。在 ISO-8859-1（又名 Latin1）中，256 个字符确实具有相应 Unicode 字符的代码点值。但是 utf-8 对 0x7f 以上的所有字符都有特殊的编码，并且代码点在 0x80 和 0xff 之间的所有字符都表示为 2 个字节。例如，字符 é U+00e9 在 ISO-8859-1 中表示为单字节 0xe9，但在 utf-8 中表示为 2 个字节 0xc3 0xa9。

更多参考请参见维基百科页面。

回复收藏 0 原文

季末如歌 2025-01-18 20:55:40

在 MacOS 上使用 clang 很难重现：

$ gcc -o test1 test1.c
test1.c:6:23: warning: illegal character encoding in character literal [-Winvalid-source-encoding]
    unsigned char c = '<FF>';
                      ^
1 warning generated.

$ ./test1
?

$ gcc -finput-charset=iso-8859-1 -o test1 test1.c
clang: error: invalid value 'iso-8859-1' in '-finput-charset=iso-8859-1'

MacOS 上的 clang 默认使用 UTF-8。

以 UTF-8 编码：

$ gcc -o test1 test1.c
test1.c:6:23: error: character too large for enclosing character literal type
    unsigned char c = 'ÿ';
                      ^
1 error generated.

调试所有警告和错误，我们得到一个具有正确字符串文字和字节数组的解决方案：

// UTF-8
  #include <stdio.h>

// needed for correct strings
  #include <string.h>

  int main(void)
  {
    char c[] = "ÿ";
    int len  = strlen(c);
    printf("len: %u c[0]: %u \n", len, (unsigned char)c[0] );

    putchar(c[0]);
    return 0;
  }

$ ./test1
len: 2 c[0]: 195
?

十进制 195 是十六进制 C3，这正是字符ÿ的UTF-8字节序列的第一个字节：

$ uni identify ÿ
     cpoint  dec    utf-8       html       name
'ÿ'  U+00FF  255    c3 bf       ÿ     LATIN SMALL LETTER Y WITH DIAERESIS (Lowercase_Letter)
                    ^^ <-- HERE

现在我们知道我们必须输出2个字节和代码：

    char c[] = "ÿ";
    int len  = strlen(c);

    for (int i=0; i < len; i++) {
        putchar(c[i]);
    }
    printf("\n");

$ ./test1 
ÿ

程序test2.c只是读取字节并输出它们。如果输入是 UTF-8，则输出也是 UTF-8。这只是保留编码。

要将 Latin-1 转换为 UTF-8，我们需要以特殊方式对其进行打包。对于 UTF-8 的两个字节，我们需要一个开始字节 110x xxxx （开始处的位数是序列的字节长度）和一个连续字节 10xx xxxx 。

我们现在可以编码：

  #include <stdio.h>
  #include <string.h>
  #include <stdint.h>

  int main(void)
  {
    uint8_t latin1 = 255; // code point of 'ÿ'  U+00FF  255

    uint8_t byte1 = 0b11000000 | ((latin1 & 0b11000000) >> 6);
    uint8_t byte2 = 0b10000000 |  (latin1 & 0b00111111);

    putchar(byte1);
    putchar(byte2);

    printf("\n");

    return 0;
  }

$ ./test1
ÿ

这只适用于 ISO-8859-1（“真正的”Latin-1）。许多名为“Latin-1”的文件都在 Windows/Microsoft CP1252 中编码。

It's hard to reproduce on MacOS with clang:

$ gcc -o test1 test1.c
test1.c:6:23: warning: illegal character encoding in character literal [-Winvalid-source-encoding]
    unsigned char c = '<FF>';
                      ^
1 warning generated.

$ ./test1
?

$ gcc -finput-charset=iso-8859-1 -o test1 test1.c
clang: error: invalid value 'iso-8859-1' in '-finput-charset=iso-8859-1'

clang on MacOS has UTF-8 as default.

Encoded in UTF-8:

$ gcc -o test1 test1.c
test1.c:6:23: error: character too large for enclosing character literal type
    unsigned char c = 'ÿ';
                      ^
1 error generated.

Debugging all warnings and errors we get a solution with the correct string literal and an array of bytes:

// UTF-8
  #include <stdio.h>

// needed for correct strings
  #include <string.h>

  int main(void)
  {
    char c[] = "ÿ";
    int len  = strlen(c);
    printf("len: %u c[0]: %u \n", len, (unsigned char)c[0] );

    putchar(c[0]);
    return 0;
  }

$ ./test1
len: 2 c[0]: 195
?

Decimal 195 is hexadecimal C3, which is exactly the first byte of the UTF-8 byte sequence of the character ÿ:

$ uni identify ÿ
     cpoint  dec    utf-8       html       name
'ÿ'  U+00FF  255    c3 bf       ÿ     LATIN SMALL LETTER Y WITH DIAERESIS (Lowercase_Letter)
                    ^^ <-- HERE

Now we know that we must output 2 bytes and code:

    char c[] = "ÿ";
    int len  = strlen(c);

    for (int i=0; i < len; i++) {
        putchar(c[i]);
    }
    printf("\n");

$ ./test1 
ÿ

Program test2.c just reads bytes and outputs them. If the input is UTF-8 then the output is also UTF-8. This just keeps the encoding.

To convert Latin-1 to UTF-8 we need to pack it in a special way. For two bytes of UTF-8 we need a begin byte 110x xxxx (number of bits at the begin is the length of the sequence in bytes) and a continuation byte 10xx xxxx.

We can code now:

  #include <stdio.h>
  #include <string.h>
  #include <stdint.h>

  int main(void)
  {
    uint8_t latin1 = 255; // code point of 'ÿ'  U+00FF  255

    uint8_t byte1 = 0b11000000 | ((latin1 & 0b11000000) >> 6);
    uint8_t byte2 = 0b10000000 |  (latin1 & 0b00111111);

    putchar(byte1);
    putchar(byte2);

    printf("\n");

    return 0;
  }

$ ./test1
ÿ

This works only for ISO-8859-1 ("true" Latin-1). Many files called "Latin-1" are encoded in Windows/Microsoft CP1252.

回复收藏 0 原文

~没有更多了~