当前位置：文江博客话题详情

字符编码

发布于 2024-09-04 16:08:24 字数 139 浏览 7 评论 0原文

我的文本编辑器允许我以几种不同的字符格式进行编码：Ansi、UTF-8、UTF-8（无 BOM）、UTF-16LE 和 UTF-16BE。

它们之间有什么区别？

通常认为最好的格式是什么（如果有区别的话，我正在使用 Python）？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

酒几许 2024-09-11 16:08:24

“Ansi”是一个用词不当，通常指的是当前平台上默认的某种 8 位编码（在“西方”Windows 安装上通常是 Windows-1252）。它仅支持一小部分字符（最多 256 个不同的字符）。
UTF-8 是一种可变长度、ASCII 兼容的编码，能够存储任何和所有 Unicode 字符。对于西方文本来说这是一个非常好的选择，它应该支持所有 Unicode 字符，并且在一般情况下也是一个非常可行的选择。
“UTF-8（无 BOM）”是 Windows 为使用 UTF-8 而无需编写字节顺序标记。由于 UTF-8 不需要 BOM，因此不应使用它，这将是正确的选择（几乎其他人都将此版本简称为“UTF-8”！）。
UTF-16LE 和 UTF-16BE 是 Little Endian 和 Big Endian 版本的 UTF-16 编码。与 UTF-8 一样，UTF-16 能够表示任何 Unicode 字符，但它不兼容 ASCII。

一般来说，UTF-8 是一个很好的整体选择，并且具有广泛的兼容性（只需确保不要编写 BOM，因为这是大多数其他软件所期望的）。

如果大部分文本由非 ASCII 字符组成（即不使用基本拉丁字母），则 UTF-16 可能会占用更少的空间。

仅当您有特定需要与不支持 Unicode 的旧应用程序交互时才应使用“Ansi”。

对于任何编码来说，一个重要的事情是它们是除了数据之外还需要进行通信的元数据。这意味着您必须知道某些字节流的编码才能将其正确解释为文本。因此，您应该使用记录实际使用的编码的格式（XML 是这里的一个主要示例）或在给定上下文中标准化单一编码并仅使用该编码。

例如，如果您启动一个软件项目，那么您可以指定所有源代码都采用给定的编码（再次强调：我建议使用 UTF-8）并坚持下去。

特别是对于 Python 文件，有一种指定源文件编码的方法。

"Ansi" is a misnomer and usually refers to some 8-bit encoding that's the default on the current platform (on "western" Windows installations that's usually Windows-1252). It only supports a small set of characters (256 different characters at most).
UTF-8 is a variable-length, ASCII-compatible encoding capable of storing any and all Unicode characters. It's a pretty good choice for western text that should support all Unicode characters and a very viable choice in the general case.
"UTF-8 (no BOM)" is the name Windows gives to using UTF-8 without writing a Byte Order Marker. Since a BOM is not needed for UTF-8, it shouldn't be used and this would be the correct choice (pretty much everyone else calls this version simply "UTF-8"!).
UTF-16LE and UTF-16BE are the Little Endian and Big Endian versions of the UTF-16 encoding. As UTF-8, UTF-16 is capable of representing any Unicode character, however it is not ASCII-compatible.

Generally speaking UTF-8 is a great overall choice and has wide compatibility (just make sure not to write the BOM, because that's what most other software expects).

UTF-16 could take less space if the majority of your text is composed of non-ASCII characters (i.e. doesn't use the basic latin alphabet).

"Ansi" should only be used when you have a specific need to interact with a legacy application that doesn't support Unicode.

An important thing about any encoding is that they are meta-data that need to be communicated in addition to the data. This means that you must know the encoding of some byte stream to interpret it as a text correctly. So you should either use formats that document the actual encoding used (XML is a prime example here) or standardize on a single encoding in a given context and use only that.

For example, if you start a software project, then you can specify that all your source code is in a given encoding (again: I suggest UTF-8) and stick with that.

For Python files specifically, there's a way to specify the encoding of your source files.

回复收藏 0 原文