字符编码

发布于 2024-09-04 16:08:24 字数 139 浏览 1 评论 0原文

我的文本编辑器允许我以几种不同的字符格式进行编码:Ansi、UTF-8、UTF-8(无 BOM)、UTF-16LE 和 UTF-16BE。

它们之间有什么区别?

通常认为最好的格式是什么(如果有区别的话,我正在使用 Python)?

My text editor allows me to code in several different character formats Ansi, UTF-8, UTF-8(No BOM), UTF-16LE, and UTF-16BE.

What is the difference between them?

What is commonly regarded as the best format (I'm using Python if that makes a diffrence)?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

酒几许 2024-09-11 16:08:24
  • “Ansi”是一个用词不当,通常指的是当前平台上默认的某种 8 位编码(在“西方”Windows 安装上通常是 Windows-1252)。它仅支持一小部分字符(最多 256 个不同的字符)。
  • UTF-8 是一种可变长度、ASCII 兼容的编码,能够存储任何和所有 Unicode 字符。对于西方文本来说这是一个非常好的选择,它应该支持所有 Unicode 字符,并且在一般情况下也是一个非常可行的选择。
  • “UTF-8(无 BOM)”是 Windows 为使用 UTF-8 而无需编写 字节顺序标记。由于 UTF-8 不需要 BOM,因此不应使用它,这将是正确的选择(几乎其他人都将此版本简称为“UTF-8”!)。
  • UTF-16LE 和 UTF-16BE 是 Little Endian 和 Big Endian 版本的 UTF-16 编码。与 UTF-8 一样,UTF-16 能够表示任何 Unicode 字符,但它不兼容 ASCII。

一般来说,UTF-8 是一个很好的整体选择,并且具有广泛的兼容性(只需确保不要编写 BOM,因为这是大多数其他软件所期望的)。

如果大部分文本由非 ASCII 字符组成(即不使用基本拉丁字母),则 UTF-16 可能会占用更少的空间。

仅当您有特定需要与不支持 Unicode 的旧应用程序交互时才应使用“Ansi”。

对于任何编码来说,一个重要的事情是它们是除了数据之外还需要进行通信的元数据。这意味着您必须知道某些字节流的编码才能将其正确解释为文本。因此,您应该使用记录实际使用的编码的格式(XML 是这里的一个主要示例)在给定上下文中标准化单一编码并仅使用该编码。

例如,如果您启动一个软件项目,那么您可以指定所有源代码都采用给定的编码(再次强调:我建议使用 UTF-8)并坚持下去。

特别是对于 Python 文件,有一种指定源文件编码的方法

  • "Ansi" is a misnomer and usually refers to some 8-bit encoding that's the default on the current platform (on "western" Windows installations that's usually Windows-1252). It only supports a small set of characters (256 different characters at most).
  • UTF-8 is a variable-length, ASCII-compatible encoding capable of storing any and all Unicode characters. It's a pretty good choice for western text that should support all Unicode characters and a very viable choice in the general case.
  • "UTF-8 (no BOM)" is the name Windows gives to using UTF-8 without writing a Byte Order Marker. Since a BOM is not needed for UTF-8, it shouldn't be used and this would be the correct choice (pretty much everyone else calls this version simply "UTF-8"!).
  • UTF-16LE and UTF-16BE are the Little Endian and Big Endian versions of the UTF-16 encoding. As UTF-8, UTF-16 is capable of representing any Unicode character, however it is not ASCII-compatible.

Generally speaking UTF-8 is a great overall choice and has wide compatibility (just make sure not to write the BOM, because that's what most other software expects).

UTF-16 could take less space if the majority of your text is composed of non-ASCII characters (i.e. doesn't use the basic latin alphabet).

"Ansi" should only be used when you have a specific need to interact with a legacy application that doesn't support Unicode.

An important thing about any encoding is that they are meta-data that need to be communicated in addition to the data. This means that you must know the encoding of some byte stream to interpret it as a text correctly. So you should either use formats that document the actual encoding used (XML is a prime example here) or standardize on a single encoding in a given context and use only that.

For example, if you start a software project, then you can specify that all your source code is in a given encoding (again: I suggest UTF-8) and stick with that.

For Python files specifically, there's a way to specify the encoding of your source files.

谜兔 2024-09-11 16:08:24

这里。请注意,“ANSI”通常是 CP1252。

Here. Note that "ANSI" is usually CP1252.

秋叶绚丽 2024-09-11 16:08:24

使用 UTF-8 No BOM 可能会获得最大的实用性。忘记 ANSI 和 ASCII 的存在,它们已经过时了。

You'll probably get greatest utility with UTF-8 No BOM. Forget that ANSI and ASCII exist, they are deprecated dinosaurs.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文