为什么.NET Framework StreamReader / Writer默认使用UTF8编码?
我只是查看 StreamReader / Writer 的构造函数,我注意到它默认使用 UTF8。 有谁知道这是为什么吗? 我本以为默认使用 Unicode 会更安全。
I'm just looking at the constructors for StreamReader / Writer and I note it uses UTF8 as default. Anyone know why this is? I would have presumed it would have been a safer bet to default to Unicode.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
UTF-8 适用于任何 ASCII 文档,并且通常比 UTF-16 更紧凑 - 但它仍然涵盖整个 Unicode。 我想说 UTF-8 比 UTF-16 更常见。 它也是 XML 的默认值(当没有 BOM 且未指定显式编码时)。
为什么您认为默认使用 UTF-16 会更好? (这就是
Encoding.Unicode
。)编辑:我怀疑您对 UTF-8 到底可以处理什么感到困惑。 此页面非常清楚地描述了它,包括如何任何特定的 Unicode 字符都会被编码。 它是一种可变宽度编码,但它涵盖了整个 Unicode。
UTF-8 will work with any ASCII document, and is typically more compact than UTF-16 - but it still covers the whole of Unicode. I'd say that UTF-8 is far more common than UTF-16. It's also the default for XML (when there's no BOM and no explicit encoding specified).
Why do you think it would be better to default to UTF-16? (That's what
Encoding.Unicode
is.)EDIT: I suspect you're confused about exactly what UTF-8 can handle. This page describes it pretty clearly, including how any particular Unicode character is encoded. It's a variable-width encoding, but it covers the whole of Unicode.
UTF8 是 Unicode,更具体地说是 Unicode 编码类型之一。
更重要的是它向后兼容 ASCII,而且它是 XML 和 HTML 的标准默认值
UTF8 is Unicode, more specifically one of the Unicode encoding types.
More importantly its backwards compatible with ASCII, plus it's the standard default for XML and HTML
“Unicode”是一个标准的名称,因此不存在“Unicode”这样的编码。 相反,有两种映射方法: UTF 和 UCS。
至于“为什么”部分,UTF-8 与 ASCII 具有最大的兼容性。
"Unicode" is the name of a standard, so there's no such encoding as "Unicode". Rather, there are two mapping methods: UTF and UCS.
As for "why" part, UTF-8 has maximum compatibility with ASCII.
正如其他人已经说过的,UTF-8 是 Unicode 中的一种编码标准。 UTF-8 使用可变数量的字节对所有 unicode 字符进行编码。
所有 ASCII 字符均按原样表示,因此现在可以轻松读取 ASCII 文件。 一旦流中的字节设置了第 8 位(最高位,> 127),就会触发读取器将其与下一个字节组合,直到该字节<128。 该组合被视为 1 个字符。
LATIN-1 (ANSII) 中的某些字符使用两个字符进行编码:例如 é 编码为 e 和 ´。 因此,Length('é') 为 2。Windows
在内部使用 UTF-16,这将可编码字符限制为 64K,这绝不是所有 Unicde 字符。 UTF-32暂时允许所有字符,但也受到人为限制。 并且两者都不向上兼容 ASCII,因为它们都有前导零:
还有小端和大端编码:
想象一下使用 UTF16 或 UTF32 来保存文件。 与 ASCII 和 UTF-8 相比(对于文本文件),它们的大小将是 ASCII 和 UTF-8 的两倍或四倍(如果仅使用 ASCII 字符,则为 UTF-8)。 UTF-8 不仅允许使用 unicode 标准中的所有字符(甚至可以用于未来的增强),而且还可以有效地节省空间。
通常,文件的前两个字节(BOM 或字节顺序标记)告诉您使用哪种编码标准。 如果省略,XML 和 StreamRedaer 将使用 UTF-8,正如您所发现的。 这又是有道理的,因为 ASCII 文件没有 BOM,因此在大多数情况下都可以正确读取。 对于使用全部 LATIN-1 的文件来说,情况可能并非如此。
As all the others already said, UTF-8 is an encoding standard within Unicode. UTF-8 uses a variable number of bytes to encode all unicode characters there are.
All ASCII characters are represented as is, such that ASCII files can be read with now further ado. As soon as a byte in the stream has its 8th bit (highest bit, > 127) set, this triggers the reader to combine it with the following byte until that is <128. The combination then is regarded as 1 character.
There are characters in LATIN-1 (ANSII), that are encoded using two characters: for example é is encoded as e and ´. Length('é') therefore is 2.
Windows uses UTF-16 internally, which limits the encodable characters to 64K, which is by no means all Unicde characters. UTF-32 for the time being allows for all characters, but is artificially limited too. And both are not upward compatible to ASCII, as the have leading zeros:
There are also little and big endian encodings:
Imagine using UTF16 or UTF32 to save your files. They would (for text files) double or quadrouple in size as compared to ASCII and UTF-8 ( UTF-8 if only ascii characters are used). UTF-8 not only allows for all characters in the unicode standard, even for future enhancements, but saves it space efficiently as well.
Usually the first two bytes of a file, the BOM or Byte Order Marker, tell you, which encoding standard is used. If omitted, XML and StreamRedaer use UTF-8,as you found out. This again makes sence, as ASCII files do not have a BOM and therefore in most cases are read correctly. This might not be true for files using all of LATIN-1.