字符编码
我的文本编辑器允许我以几种不同的字符格式进行编码:Ansi、UTF-8、UTF-8(无 BOM)、UTF-16LE 和 UTF-16BE。
它们之间有什么区别?
通常认为最好的格式是什么(如果有区别的话,我正在使用 Python)?
My text editor allows me to code in several different character formats Ansi, UTF-8, UTF-8(No BOM), UTF-16LE, and UTF-16BE.
What is the difference between them?
What is commonly regarded as the best format (I'm using Python if that makes a diffrence)?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
![扫码二维码加入Web技术交流群](/public/img/jiaqun_03.jpg)
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
一般来说,UTF-8 是一个很好的整体选择,并且具有广泛的兼容性(只需确保不要编写 BOM,因为这是大多数其他软件所期望的)。
如果大部分文本由非 ASCII 字符组成(即不使用基本拉丁字母),则 UTF-16 可能会占用更少的空间。
仅当您有特定需要与不支持 Unicode 的旧应用程序交互时才应使用“Ansi”。
对于任何编码来说,一个重要的事情是它们是除了数据之外还需要进行通信的元数据。这意味着您必须知道某些字节流的编码才能将其正确解释为文本。因此,您应该使用记录实际使用的编码的格式(XML 是这里的一个主要示例)或在给定上下文中标准化单一编码并仅使用该编码。
例如,如果您启动一个软件项目,那么您可以指定所有源代码都采用给定的编码(再次强调:我建议使用 UTF-8)并坚持下去。
特别是对于 Python 文件,有一种指定源文件编码的方法。
Generally speaking UTF-8 is a great overall choice and has wide compatibility (just make sure not to write the BOM, because that's what most other software expects).
UTF-16 could take less space if the majority of your text is composed of non-ASCII characters (i.e. doesn't use the basic latin alphabet).
"Ansi" should only be used when you have a specific need to interact with a legacy application that doesn't support Unicode.
An important thing about any encoding is that they are meta-data that need to be communicated in addition to the data. This means that you must know the encoding of some byte stream to interpret it as a text correctly. So you should either use formats that document the actual encoding used (XML is a prime example here) or standardize on a single encoding in a given context and use only that.
For example, if you start a software project, then you can specify that all your source code is in a given encoding (again: I suggest UTF-8) and stick with that.
For Python files specifically, there's a way to specify the encoding of your source files.
这里。请注意,“ANSI”通常是 CP1252。
Here. Note that "ANSI" is usually CP1252.
使用 UTF-8 No BOM 可能会获得最大的实用性。忘记 ANSI 和 ASCII 的存在,它们已经过时了。
You'll probably get greatest utility with UTF-8 No BOM. Forget that ANSI and ASCII exist, they are deprecated dinosaurs.