用于学习不同类型的字符编码以及它们之间转换的良好资源
我从未真正理解的一件事是字符编码的概念。在内存和代码中处理编码的方式经常让我感到困惑,因为我只是从互联网上复制一个示例,而没有真正理解它的作用。我觉得这是一个非常重要且容易被忽视的主题,更多的人应该花时间来解决这个问题(包括我自己)。
我正在寻找一些好的、切题的资源来学习不同类型的字符编码以及它们之间的转换(最好是在 C# 中)。欢迎书籍和在线资源。
谢谢。
编辑1:
感谢您到目前为止的回复。我特别寻找一些涉及 .NET 如何处理编码的更多信息。我知道这可能看起来很模糊,但我真的不知道要问什么。我想我很好奇编码是如何在 C# 字符串类中表示的,以及该类本身是否可以管理不同的编码类型,或者有单独的类吗?
One thing I have never truly understood is the concept of character encoding. The way encoding is handled in memory and code often baffles me in that I just copy an example from the internet without truly understanding what it does. I feel it's a really important and much overlooked subject that more people should take the time to get right (including myself).
I am looking for some good, to the point, resources for learning the different types of character encoding and converting between them (preferably in C#). Both books and online resources are welcome.
Thanks.
Edit 1:
Thanks for the responses so far. I am especially looking for some more info involving how .NET handles encoding. I know this may seem vague but I don't really know what to ask for. I guess I am curious as to how encoding is represented say in a C# string class and whether the class itself can manage different encoding types or there are seperate classes for this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我将从这个问题开始:什么是字符?
此代码将
in.txt
从windows-1252
转换为UTF-8
并将其保存为out.txt
。这里发生了两个转变。首先,字节从
windows-1252
解码为UTF-16
(我认为是小尾数)到char
缓冲区中。然后缓冲区被转换为UTF-8
。代码点
一些代码点示例:
I'd start with this question: what is a character?
This code transforms
in.txt
fromwindows-1252
toUTF-8
and saves it asout.txt
.Two transformations happen here. First, the bytes are decoded from
windows-1252
toUTF-16
(little endian, I think) into thechar
buffer. Then the buffer is transformed intoUTF-8
.Codepoints
Some example code points:
Encodings
Anywhere you work with characters, it'll be in an encoding of some form. C# uses UTF-16 for its char type, which it defines as 16 bits wide.
You can think of an encoding as a tabular mapping between codepoints and byte representations.
The System.Text.Encoding class exposes types/methods to perform the transformations.
Graphemes
The grapheme you see on the screen may be constructed from more than one codepoint. The character e-acute (é) can be represented with two codepoints, LATIN SMALL LETTER E U+0065 and COMBINING ACUTE ACCENT U+0301.
('é' is more usually represented by the single codepoint U+00E9. You can switch between them using normalization. Not all combining sequences have a single character equivalent, though.)
Conclusions
(This is a little more long-winded than I intended, and probably more than you wanted, so I'll stop. I wrote an even more long-winded post on Java encoding here.)
维基百科对字符编码有一个很好的解释: http://en.wikipedia.org/wiki/字符编码。
如果您正在寻找 UTF-8(最流行的字符编码之一)的详细信息,您应该阅读 UTF-8 和 Unicode 常见问题解答。
而且,正如已经指出的那样,“每个软件开发人员绝对必须了解 Unicode 的绝对最低限度”和字符集(没有借口!)” 是一个非常好的初学者教程。
Wikipedia has a pretty good explanation of character encoding in general: http://en.wikipedia.org/wiki/Character_encoding.
If you are looking for details of UTF-8, which is one of the most popular characters encodings, you should read the UTF-8 and Unicode FAQ.
And, as was already pointed out, "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" is a very good beginners tutorial.
有一篇著名的 Joel 文章“每个软件开发人员绝对必须了解 Unicode 和字符集的绝对最低限度(没有任何借口!)”
http://www.joelonsoftware.com/articles/Unicode.html
编辑:虽然这更多是关于文本格式的,重新阅读时我猜你对 html 编码和 url 编码之类的东西更感兴趣?用于转义在 html 或 url 中具有重要含义的特殊字符(例如 html 中的 < 和 >,或 url 中的 ? 和 =)
There's the famous Joel article "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)"
http://www.joelonsoftware.com/articles/Unicode.html
Edit: Although that's more about text formats, On re-reading I guess you're more interested in things like html encoding and url encoding? Which are for escaping special characters which have significant meanings within html or urls (eg < and > in html, or ? and = in urls)