用于学习不同类型的字符编码以及它们之间转换的良好资源

发布于 2024-08-04 01:13:42 字数 346 浏览 4 评论 0原文

我从未真正理解的一件事是字符编码的概念。在内存和代码中处理编码的方式经常让我感到困惑,因为我只是从互联网上复制一个示例,而没有真正理解它的作用。我觉得这是一个非常重要且容易被忽视的主题,更多的人应该花时间来解决这个问题(包括我自己)。

我正在寻找一些好的、切题的资源来学习不同类型的字符编码以及它们之间的转换(最好是在 C# 中)。欢迎书籍和在线资源。

谢谢。


编辑1:

感谢您到目前为止的回复。我特别寻找一些涉及 .NET 如何处理编码的更多信息。我知道这可能看起来很模糊,但我真的不知道要问什么。我想我很好奇编码是如何在 C# 字符串类中表示的,以及该类本身是否可以管理不同的编码类型,或者有单独的类吗?

One thing I have never truly understood is the concept of character encoding. The way encoding is handled in memory and code often baffles me in that I just copy an example from the internet without truly understanding what it does. I feel it's a really important and much overlooked subject that more people should take the time to get right (including myself).

I am looking for some good, to the point, resources for learning the different types of character encoding and converting between them (preferably in C#). Both books and online resources are welcome.

Thanks.


Edit 1:

Thanks for the responses so far. I am especially looking for some more info involving how .NET handles encoding. I know this may seem vague but I don't really know what to ask for. I guess I am curious as to how encoding is represented say in a C# string class and whether the class itself can manage different encoding types or there are seperate classes for this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

童话 2024-08-11 01:13:42

我将从这个问题开始:什么是字符?

  • 逻辑标识:代码点。 Unicode 为每个字符分配一个数字,该数字不一定与任何位/字节形式相关。编码(如 UTF-8)定义了到字节值的映射。
  • 位和字节:编码形式。每个代码点一个或多个字节,值由所使用的编码确定。
  • 您在屏幕上看到的东西:字素。字素是从一个或多个代码点创建的。这是演示结束时的内容。

此代码将 in.txtwindows-1252 转换为 UTF-8 并将其保存为 out.txt

using System;
using System.IO;
using System.Text;
public class Enc {
  public static void Main(String[] args) {
    Encoding win1252 = Encoding.GetEncoding(1252);
    Encoding utf8 = Encoding.UTF8;
    using(StreamReader reader = new StreamReader("in.txt", win1252)) {
      using(StreamWriter writer = new StreamWriter("out.txt", false, utf8)) {
        char[] buffer = new char[1024];
        while(reader.Peek() > 0) {
          int r = reader.Read(buffer, 0, buffer.Length);
          writer.Write(buffer, 0, r); 
        }
      }
    }
  }
}

这里发生了两个转变。首先,字节从 windows-1252 解码为 UTF-16(我认为是小尾数)到 char 缓冲区中。然后缓冲区被转换为UTF-8

代码点

一些代码点示例:

  • U+0041 是拉丁大写字母 A (A)
  • U+00A3 是 POUND SIGN (£)
  • U+042F是 西里尔大写字母 YA (Я)
  • U+1D50A 是 数学 FRAKTUR CAPITAL G (

I'd start with this question: what is a character?

  • The logical identity: a codepoint. Unicode assigns a number to each character that isn't necessarily related to any bit/byte form. Encodings (like UTF-8) define the mapping to byte values.
  • The bits and bytes: the encoded form. One or more bytes per codepoint, values determined by the encoding used.
  • Thing you see on the screen: a grapheme. The grapheme is created from one or more codepoints. This is the stuff at the presentation end of things.

This code transforms in.txt from windows-1252 to UTF-8 and saves it as out.txt.

using System;
using System.IO;
using System.Text;
public class Enc {
  public static void Main(String[] args) {
    Encoding win1252 = Encoding.GetEncoding(1252);
    Encoding utf8 = Encoding.UTF8;
    using(StreamReader reader = new StreamReader("in.txt", win1252)) {
      using(StreamWriter writer = new StreamWriter("out.txt", false, utf8)) {
        char[] buffer = new char[1024];
        while(reader.Peek() > 0) {
          int r = reader.Read(buffer, 0, buffer.Length);
          writer.Write(buffer, 0, r); 
        }
      }
    }
  }
}

Two transformations happen here. First, the bytes are decoded from windows-1252 to UTF-16 (little endian, I think) into the char buffer. Then the buffer is transformed into UTF-8.

Codepoints

Some example code points:

  • U+0041 is LATIN CAPITAL LETTER A (A)
  • U+00A3 is POUND SIGN (£)
  • U+042F is CYRILLIC CAPITAL LETTER YA (Я)
  • U+1D50A is MATHEMATICAL FRAKTUR CAPITAL G (𝔊)

Encodings

Anywhere you work with characters, it'll be in an encoding of some form. C# uses UTF-16 for its char type, which it defines as 16 bits wide.

You can think of an encoding as a tabular mapping between codepoints and byte representations.

CODEPOINT       UTF-16BE        UTF-8     WINDOWS-1252
U+0041 (A)         00 41           41               41
U+00A3 (£)         00 A3        C2 A3               A3
U+042F (Ya)        04 2F        D0 AF                -
U+1D50A      D8 35 DD 0A  F0 9D 94 8A                -

The System.Text.Encoding class exposes types/methods to perform the transformations.

Graphemes

The grapheme you see on the screen may be constructed from more than one codepoint. The character e-acute (é) can be represented with two codepoints, LATIN SMALL LETTER E U+0065 and COMBINING ACUTE ACCENT U+0301.

('é' is more usually represented by the single codepoint U+00E9. You can switch between them using normalization. Not all combining sequences have a single character equivalent, though.)

Conclusions

  • When you encode a C# string to an encoding, you are performing a transformation from UTF-16 to that encoding.
  • Encoding can be a lossy transformation - most non-Unicode encodings can only encode a subset of existing characters.
  • Since not all codepoints can fit into a single C# char, the number of chars in string may be more than the number of codepoints and the number of codepoints may be greater than the number of rendered graphemes.
  • The "length" of a string is context-sensitive, so you need to know what meaning you're applying and use the appropriate algorithm. How this is handled is defined by the programming language you're using.
  • Giving Latin-1 characters identical values in many encodings gives some people delusions of ASCII.

(This is a little more long-winded than I intended, and probably more than you wanted, so I'll stop. I wrote an even more long-winded post on Java encoding here.)

吃素的狼 2024-08-11 01:13:42

维基百科对字符编码有一个很好的解释: http://en.wikipedia.org/wiki/字符编码

如果您正在寻找 UTF-8(最流行的字符编码之一)的详细信息,您应该阅读 UTF-8 和 Unicode 常见问题解答

而且,正如已经指出的那样,“每个软件开发人员绝对必须了解 Unicode 的绝对最低限度”和字符集(没有借口!)” 是一个非常好的初学者教程。

Wikipedia has a pretty good explanation of character encoding in general: http://en.wikipedia.org/wiki/Character_encoding.

If you are looking for details of UTF-8, which is one of the most popular characters encodings, you should read the UTF-8 and Unicode FAQ.

And, as was already pointed out, "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" is a very good beginners tutorial.

人间☆小暴躁 2024-08-11 01:13:42

有一篇著名的 Joel 文章“每个软件开发人员绝对必须了解 Unicode 和字符集的绝对最低限度(没有任何借口!)”
http://www.joelonsoftware.com/articles/Unicode.html

编辑:虽然这更多是关于文本格式的,重新阅读时我猜你对 html 编码和 url 编码之类的东西更感兴趣?用于转义在 html 或 url 中具有重要含义的特殊字符(例如 html 中的 < 和 >,或 url 中的 ? 和 =)

There's the famous Joel article "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)"
http://www.joelonsoftware.com/articles/Unicode.html

Edit: Although that's more about text formats, On re-reading I guess you're more interested in things like html encoding and url encoding? Which are for escaping special characters which have significant meanings within html or urls (eg < and > in html, or ? and = in urls)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文