用于学习不同类型的字符编码以及它们之间转换的良好资源

发布于 2024-08-04 01:13:42 字数 346 浏览 4 评论 0原文

我从未真正理解的一件事是字符编码的概念。在内存和代码中处理编码的方式经常让我感到困惑，因为我只是从互联网上复制一个示例，而没有真正理解它的作用。我觉得这是一个非常重要且容易被忽视的主题，更多的人应该花时间来解决这个问题（包括我自己）。

我正在寻找一些好的、切题的资源来学习不同类型的字符编码以及它们之间的转换（最好是在 C# 中）。欢迎书籍和在线资源。

谢谢。

编辑1：

感谢您到目前为止的回复。我特别寻找一些涉及 .NET 如何处理编码的更多信息。我知道这可能看起来很模糊，但我真的不知道要问什么。我想我很好奇编码是如何在 C# 字符串类中表示的，以及该类本身是否可以管理不同的编码类型，或者有单独的类吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

童话 2024-08-11 01:13:42

我将从这个问题开始：什么是字符？

逻辑标识：代码点。 Unicode 为每个字符分配一个数字，该数字不一定与任何位/字节形式相关。编码（如 UTF-8）定义了到字节值的映射。
位和字节：编码形式。每个代码点一个或多个字节，值由所使用的编码确定。
您在屏幕上看到的东西：字素。字素是从一个或多个代码点创建的。这是演示结束时的内容。

此代码将 in.txt 从 windows-1252 转换为 UTF-8 并将其保存为 out.txt。

using System;
using System.IO;
using System.Text;
public class Enc {
  public static void Main(String[] args) {
    Encoding win1252 = Encoding.GetEncoding(1252);
    Encoding utf8 = Encoding.UTF8;
    using(StreamReader reader = new StreamReader("in.txt", win1252)) {
      using(StreamWriter writer = new StreamWriter("out.txt", false, utf8)) {
        char[] buffer = new char[1024];
        while(reader.Peek() > 0) {
          int r = reader.Read(buffer, 0, buffer.Length);
          writer.Write(buffer, 0, r); 
        }
      }
    }
  }
}

这里发生了两个转变。首先，字节从 windows-1252 解码为 UTF-16（我认为是小尾数）到 char 缓冲区中。然后缓冲区被转换为UTF-8。

代码点

一些代码点示例：

U+0041 是拉丁大写字母 A (A)
U+00A3 是 POUND SIGN (£)
U+042F是 西里尔大写字母 YA (Я)
U+1D50A 是 数学 FRAKTUR CAPITAL G (

I'd start with this question: what is a character?

The logical identity: a codepoint. Unicode assigns a number to each character that isn't necessarily related to any bit/byte form. Encodings (like UTF-8) define the mapping to byte values.
The bits and bytes: the encoded form. One or more bytes per codepoint, values determined by the encoding used.
Thing you see on the screen: a grapheme. The grapheme is created from one or more codepoints. This is the stuff at the presentation end of things.

This code transforms in.txt from windows-1252 to UTF-8 and saves it as out.txt.

using System;
using System.IO;
using System.Text;
public class Enc {
  public static void Main(String[] args) {
    Encoding win1252 = Encoding.GetEncoding(1252);
    Encoding utf8 = Encoding.UTF8;
    using(StreamReader reader = new StreamReader("in.txt", win1252)) {
      using(StreamWriter writer = new StreamWriter("out.txt", false, utf8)) {
        char[] buffer = new char[1024];
        while(reader.Peek() > 0) {
          int r = reader.Read(buffer, 0, buffer.Length);
          writer.Write(buffer, 0, r); 
        }
      }
    }
  }
}

Two transformations happen here. First, the bytes are decoded from windows-1252 to UTF-16 (little endian, I think) into the char buffer. Then the buffer is transformed into UTF-8.

Codepoints

Some example code points:

U+0041 is LATIN CAPITAL LETTER A (A)
U+00A3 is POUND SIGN (£)
U+042F is CYRILLIC CAPITAL LETTER YA (Я)
U+1D50A is MATHEMATICAL FRAKTUR CAPITAL G (𝔊)

Encodings

Anywhere you work with characters, it'll be in an encoding of some form. C# uses UTF-16 for its char type, which it defines as 16 bits wide.

You can think of an encoding as a tabular mapping between codepoints and byte representations.

CODEPOINT       UTF-16BE        UTF-8     WINDOWS-1252
U+0041 (A)         00 41           41               41
U+00A3 (£)         00 A3        C2 A3               A3
U+042F (Ya)        04 2F        D0 AF                -
U+1D50A      D8 35 DD 0A  F0 9D 94 8A                -

The System.Text.Encoding class exposes types/methods to perform the transformations.

Graphemes

The grapheme you see on the screen may be constructed from more than one codepoint. The character e-acute (é) can be represented with two codepoints, LATIN SMALL LETTER E U+0065 and COMBINING ACUTE ACCENT U+0301.

('é' is more usually represented by the single codepoint U+00E9. You can switch between them using normalization. Not all combining sequences have a single character equivalent, though.)

Conclusions

When you encode a C# string to an encoding, you are performing a transformation from UTF-16 to that encoding.
Encoding can be a lossy transformation - most non-Unicode encodings can only encode a subset of existing characters.
Since not all codepoints can fit into a single C# char, the number of chars in string may be more than the number of codepoints and the number of codepoints may be greater than the number of rendered graphemes.
The "length" of a string is context-sensitive, so you need to know what meaning you're applying and use the appropriate algorithm. How this is handled is defined by the programming language you're using.
Giving Latin-1 characters identical values in many encodings gives some people delusions of ASCII.

(This is a little more long-winded than I intended, and probably more than you wanted, so I'll stop. I wrote an even more long-winded post on Java encoding here.)

回复收藏 0 原文