当前位置：文江博客话题详情

为什么.NET Framework StreamReader / Writer默认使用UTF8编码？

发布于 2024-07-19 02:19:24 字数 93 浏览 10 评论 0原文

我只是查看 StreamReader / Writer 的构造函数，我注意到它默认使用 UTF8。有谁知道这是为什么吗？我本以为默认使用 Unicode 会更安全。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

生寂 2024-07-26 02:19:24

UTF-8 适用于任何 ASCII 文档，并且通常比 UTF-16 更紧凑 - 但它仍然涵盖整个 Unicode。我想说 UTF-8 比 UTF-16 更常见。它也是 XML 的默认值（当没有 BOM 且未指定显式编码时）。

为什么您认为默认使用 UTF-16 会更好？（这就是 Encoding.Unicode 。）

编辑：我怀疑您对 UTF-8 到底可以处理什么感到困惑。此页面非常清楚地描述了它，包括如何任何特定的 Unicode 字符都会被编码。它是一种可变宽度编码，但它涵盖了整个 Unicode。

回复收藏 0 原文

德意的啸 2024-07-26 02:19:24

UTF8 是 Unicode，更具体地说是 Unicode 编码类型之一。

更重要的是它向后兼容 ASCII，而且它是 XML 和 HTML 的标准默认值

回复收藏 0 原文

无声无音无过去 2024-07-26 02:19:24

“Unicode”是一个标准的名称，因此不存在“Unicode”这样的编码。相反，有两种映射方法： UTF 和 UCS。

至于“为什么”部分，UTF-8 与 ASCII 具有最大的兼容性。

回复收藏 0 原文

绻影浮沉 2024-07-26 02:19:24

正如其他人已经说过的，UTF-8 是 Unicode 中的一种编码标准。 UTF-8 使用可变数量的字节对所有 unicode 字符进行编码。

所有 ASCII 字符均按原样表示，因此现在可以轻松读取 ASCII 文件。一旦流中的字节设置了第 8 位（最高位，> 127），就会触发读取器将其与下一个字节组合，直到该字节<128。该组合被视为 1 个字符。

LATIN-1 (ANSII) 中的某些字符使用两个字符进行编码：例如 é 编码为 e 和 ´。因此，Length('é') 为 2。Windows

在内部使用 UTF-16，这将可编码字符限制为 64K，这绝不是所有 Unicde 字符。 UTF-32暂时允许所有字符，但也受到人为限制。并且两者都不向上兼容 ASCII，因为它们都有前导零：

A = ASCII h41 = UTF-8 h41 = UTF-16 h0041 = UTF-32 h00000041

还有小端和大端编码：

A = UTF-16 big endian h0041 = UTF-16 little endian h4100

想象一下使用 UTF16 或 UTF32 来保存文件。与 ASCII 和 UTF-8 相比（对于文本文件），它们的大小将是 ASCII 和 UTF-8 的两倍或四倍（如果仅使用 ASCII 字符，则为 UTF-8）。 UTF-8 不仅允许使用 unicode 标准中的所有字符（甚至可以用于未来的增强），而且还可以有效地节省空间。

通常，文件的前两个字节（BOM 或字节顺序标记）告诉您使用哪种编码标准。如果省略，XML 和 StreamRedaer 将使用 UTF-8，正如您所发现的。这又是有道理的，因为 ASCII 文件没有 BOM，因此在大多数情况下都可以正确读取。对于使用全部 LATIN-1 的文件来说，情况可能并非如此。

As all the others already said, UTF-8 is an encoding standard within Unicode. UTF-8 uses a variable number of bytes to encode all unicode characters there are.

All ASCII characters are represented as is, such that ASCII files can be read with now further ado. As soon as a byte in the stream has its 8th bit (highest bit, > 127) set, this triggers the reader to combine it with the following byte until that is <128. The combination then is regarded as 1 character.

There are characters in LATIN-1 (ANSII), that are encoded using two characters: for example é is encoded as e and ´. Length('é') therefore is 2.

Windows uses UTF-16 internally, which limits the encodable characters to 64K, which is by no means all Unicde characters. UTF-32 for the time being allows for all characters, but is artificially limited too. And both are not upward compatible to ASCII, as the have leading zeros:

A = ASCII h41 = UTF-8 h41 = UTF-16 h0041 = UTF-32 h00000041

There are also little and big endian encodings:

A = UTF-16 big endian h0041 = UTF-16 little endian h4100

Imagine using UTF16 or UTF32 to save your files. They would (for text files) double or quadrouple in size as compared to ASCII and UTF-8 ( UTF-8 if only ascii characters are used). UTF-8 not only allows for all characters in the unicode standard, even for future enhancements, but saves it space efficiently as well.

Usually the first two bytes of a file, the BOM or Byte Order Marker, tell you, which encoding standard is used. If omitted, XML and StreamRedaer use UTF-8,as you found out. This again makes sence, as ASCII files do not have a BOM and therefore in most cases are read correctly. This might not be true for files using all of LATIN-1.

回复收藏 0 原文

~没有更多了~