UTF-8 和带 BOM 的 UTF-8 有什么区别?

发布于 2024-08-20 15:33:56 字数 111 浏览 9 评论 0原文

UTF-8 和带有 BOM 的 UTF-8 有什么不同?

What's different between UTF-8 and UTF-8 with BOM?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(22

感情旳空白 2024-08-27 15:33:56

UTF-8 BOM 是文本流开头的一系列字节0xEF、0xBB、0xBF),它允许读者更可靠地猜测文件是以 UTF-8 编码。

通常,BOM 用于表示编码的字节序,但由于字节序与 UTF-8 无关,因此不需要 BOM。

根据 Unicode 标准UTF-8 的 BOM不推荐使用的文件

2.6 编码方案

... UTF-8 既不需要也不建议使用 BOM,但在从使用 BOM 的其他编码形式转换 UTF-8 数据或将 BOM 用作 UTF 的上下文中可能会遇到这种情况-8签名。请参阅第 16.8 节,特殊内容,了解更多信息。

The UTF-8 BOM is a sequence of bytes at the start of a text stream (0xEF, 0xBB, 0xBF) that allows the reader to more reliably guess a file as being encoded in UTF-8.

Normally, the BOM is used to signal the endianness of an encoding, but since endianness is irrelevant to UTF-8, the BOM is unnecessary.

According to the Unicode standard, the BOM for UTF-8 files is not recommended:

2.6 Encoding Schemes

... Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature. See the “Byte Order Mark” subsection in Section 16.8, Specials, for more information.

暖阳 2024-08-27 15:33:56

其他优秀的答案已经回答:

  • UTF-8 和 BOM-ed UTF-8 之间没有官方区别
  • BOM-ed UTF-8 字符串将从以下三个字节开始。 EF BB BF
  • 从文件/流中提取字符串时,这些字节(如果存在)必须被忽略。

但是,作为附加信息,UTF-8 的 BOM 可能是“嗅出”字符串是否以 UTF-8 编码的好方法...或者它可能是任何其他编码的合法字符串

...例如,数据 [EF BB BF 41 42 43] 可以是:

因此,虽然通过查看第一个字节来识别文件内容的编码可能很酷,但您不应该依赖于此,如上面的示例所示

编码应该是已知的,而不是猜测的。

The other excellent answers already answered that:

  • There is no official difference between UTF-8 and BOM-ed UTF-8
  • A BOM-ed UTF-8 string will start with the three following bytes. EF BB BF
  • Those bytes, if present, must be ignored when extracting the string from the file/stream.

But, as additional information to this, the BOM for UTF-8 could be a good way to "smell" if a string was encoded in UTF-8... Or it could be a legitimate string in any other encoding...

For example, the data [EF BB BF 41 42 43] could either be:

  • The legitimate ISO-8859-1 string "ABC"
  • The legitimate UTF-8 string "ABC"

So while it can be cool to recognize the encoding of a file content by looking at the first bytes, you should not rely on this, as show by the example above

Encodings should be known, not divined.

晚风撩人 2024-08-27 15:33:56

将 BOM 放入 UTF-8 编码的文件中至少存在三个问题。

  1. 不包含文本的文件不再为空,因为它们始终包含 BOM。
  2. 保存 UTF-8 ASCII 子集中文本的文件本身不再是 ASCII,因为 BOM 不是 ASCII,这使得一些现有工具无法使用,并且用户无法替换此类旧工具。
  3. 无法将多个文件连接在一起,因为现在每个文件的开头都有一个 BOM。

而且,正如其他人所提到的,使用 BOM 来检测某些内容是 UTF-8 既不充分也没有必要:

  • 这是不够的,因为任意字节序列可能碰巧以构成 BOM 的确切序列开始。
  • 没有必要,因为您可以像读取 UTF-8 一样读取字节;如果成功,则根据定义,它是有效的 UTF-8。

There are at least three problems with putting a BOM in UTF-8 encoded files.

  1. Files that hold no text are no longer empty because they always contain the BOM.
  2. Files that hold text within the ASCII subset of UTF-8 are no longer themselves ASCII because the BOM is not ASCII, which makes some existing tools break down, and it can be impossible for users to replace such legacy tools.
  3. It is not possible to concatenate several files together because each file now has a BOM at the beginning.

And, as others have mentioned, it is neither sufficient nor necessary to have a BOM to detect that something is UTF-8:

  • It is not sufficient because an arbitrary byte sequence can happen to start with the exact sequence that constitutes the BOM.
  • It is not necessary because you can just read the bytes as if they were UTF-8; if that succeeds, it is, by definition, valid UTF-8.
三人与歌 2024-08-27 15:33:56

以下是 BOM 用法的示例,它们实际上会导致实际问题,但许多人并不了解。

BOM 破坏脚本

Shell 脚本、Perl 脚本、Python 脚本、Ruby 脚本、Node.js 脚本或任何其他需要由解释器运行的可执行文件 - 全部以 shebang 行 看起来像其中之一:

#!/bin/sh
#!/usr/bin/python
#!/usr/local/bin/perl
#!/usr/bin/env node

它告诉系统在调用此类脚本时需要运行哪个解释器。如果脚本以 UTF-8 编码,人们可能会想在开头包含 BOM。但实际上是“#!”角色不仅仅是角色。它们实际上是一个幻数,恰好由两个 ASCII 字符组成。如果您在这些字符之前放置某些内容(例如 BOM),那么该文件将看起来像是具有不同的幻数,这可能会导致问题。

请参阅维基百科,文章:Shebang,部分:幻数

shebang 字符由相同的两个字节表示
扩展 ASCII 编码,包括 UTF-8,通常用于
当前类 Unix 系统上的脚本和其他文本文件。然而,
UTF-8 文件可以以可选的字节顺序标记 (BOM) 开头;如果
“exec”函数专门检测字节0x23和0x21,然后
shebang 之前存在 BOM (0xEF 0xBB 0xBF) 将阻止
脚本解释器不被执行。
一些权威人士建议
反对在 POSIX(类 Unix)脚本中使用字节顺序标记,[14]
出于这个原因以及更广泛的互操作性和哲学
的担忧。另外,UTF-8 中不需要字节顺序标记,
因为该编码不存在字节顺序问题;它仅用于
将编码识别为 UTF-8。 [已添加强调]

BOM 在 JSON 中是非法的

请参阅 RFC 7159,第 8.1 节

实现不得在 JSON 文本的开头添加字节顺序标记。

BOM 在 JSON 中是多余的

它在 JSON 中不仅非法,而且不合法需要来确定字符编码,因为有更可靠的方法可以明确确定任何 JSON 流中使用的字符编码和字节顺序(请参阅此答案了解详细信息)。

BOM 破坏了 JSON 解析器

不仅在 JSON 中它是非法并且不需要,它实际上破坏了使用 中介绍的方法确定编码的所有软件 RFC 4627

确定 JSON 的编码和字节顺序,检查 NUL 的前四个字节byte:

00 00 00 xx - UTF-32BE
00 xx 00 xx - UTF-16BE
xx 00 00 00 - UTF-32LE
xx 00 xx 00 - UTF-16LE
xx xx xx xx - UTF-8

现在,如果文件以 BOM 开头,它将如下所示:

00 00 FE FF - UTF-32BE
FE FF 00 xx - UTF-16BE
FF FE 00 00 - UTF-32LE
FF FE xx 00 - UTF-16LE
EF BB BF xx - UTF-8

请注意:

  1. UTF-32BE 不以三个 NUL 开头,因此不会被识别
  2. UTF-32LE 第一个字节后面不跟着三个 NUL,所以不会被识别
  3. UTF-16BE 前四个字节只有一个 NUL,所以不会被识别
  4. UTF-16LE 前四个字节只有一个 NUL,所以不会被

识别实现时,所有这些都可能被错误地解释为 UTF-8,然后被误解或拒绝为无效的 UTF-8,或者根本无法识别。

此外,如果实现按照我的建议测试有效的 JSON,它甚至会拒绝确实编码为 UTF-8 的输入,因为它不是以 ASCII 字符 < 开头。 128,因为它应该根据 RFC。

JSON 中的其他数据格式

BOM 是不需要的,是非法的,并且会破坏根据 RFC 正常工作的软件。那时不使用它应该是理所当然的事情,但总有人坚持通过使用 BOM、注释、不同的引用规则或不同的数据类型来破坏 JSON。当然,如果您需要的话,任何人都可以自由使用 BOM 之类的东西或其他任何东西 - 只是不要将其称为 JSON。

对于 JSON 以外的其他数据格式,请看看它的实际情况。如果唯一的编码是 UTF-* 并且第一个字符必须是低于 128 的 ASCII 字符,那么您已经拥有确定数据的编码和字节顺序所需的所有信息。即使将 BOM 添加为可选功能,也只会使其变得更加复杂且容易出错。

BOM 的其他用途

至于 JSON 或脚本之外的用途,我认为这里已经有很好的答案了。我想添加更多有关脚本和序列化的详细信息,因为它是导致实际问题的 BOM 字符的示例。

Here are examples of the BOM usage that actually cause real problems and yet many people don't know about it.

BOM breaks scripts

Shell scripts, Perl scripts, Python scripts, Ruby scripts, Node.js scripts or any other executable that needs to be run by an interpreter - all start with a shebang line which looks like one of those:

#!/bin/sh
#!/usr/bin/python
#!/usr/local/bin/perl
#!/usr/bin/env node

It tells the system which interpreter needs to be run when invoking such a script. If the script is encoded in UTF-8, one may be tempted to include a BOM at the beginning. But actually the "#!" characters are not just characters. They are in fact a magic number that happens to be composed out of two ASCII characters. If you put something (like a BOM) before those characters, then the file will look like it had a different magic number and that can lead to problems.

See Wikipedia, article: Shebang, section: Magic number:

The shebang characters are represented by the same two bytes in
extended ASCII encodings, including UTF-8, which is commonly used for
scripts and other text files on current Unix-like systems. However,
UTF-8 files may begin with the optional byte order mark (BOM); if the
"exec" function specifically detects the bytes 0x23 and 0x21, then the
presence of the BOM (0xEF 0xBB 0xBF) before the shebang will prevent
the script interpreter from being executed.
Some authorities recommend
against using the byte order mark in POSIX (Unix-like) scripts,[14]
for this reason and for wider interoperability and philosophical
concerns. Additionally, a byte order mark is not necessary in UTF-8,
as that encoding does not have endianness issues; it serves only to
identify the encoding as UTF-8. [emphasis added]

BOM is illegal in JSON

See RFC 7159, Section 8.1:

Implementations MUST NOT add a byte order mark to the beginning of a JSON text.

BOM is redundant in JSON

Not only it is illegal in JSON, it is also not needed to determine the character encoding because there are more reliable ways to unambiguously determine both the character encoding and endianness used in any JSON stream (see this answer for details).

BOM breaks JSON parsers

Not only it is illegal in JSON and not needed, it actually breaks all software that determine the encoding using the method presented in RFC 4627:

Determining the encoding and endianness of JSON, examining the first four bytes for the NUL byte:

00 00 00 xx - UTF-32BE
00 xx 00 xx - UTF-16BE
xx 00 00 00 - UTF-32LE
xx 00 xx 00 - UTF-16LE
xx xx xx xx - UTF-8

Now, if the file starts with BOM it will look like this:

00 00 FE FF - UTF-32BE
FE FF 00 xx - UTF-16BE
FF FE 00 00 - UTF-32LE
FF FE xx 00 - UTF-16LE
EF BB BF xx - UTF-8

Note that:

  1. UTF-32BE doesn't start with three NULs, so it won't be recognized
  2. UTF-32LE the first byte is not followed by three NULs, so it won't be recognized
  3. UTF-16BE has only one NUL in the first four bytes, so it won't be recognized
  4. UTF-16LE has only one NUL in the first four bytes, so it won't be recognized

Depending on the implementation, all of those may be interpreted incorrectly as UTF-8 and then misinterpreted or rejected as invalid UTF-8, or not recognized at all.

Additionally, if the implementation tests for valid JSON as I recommend, it will reject even the input that is indeed encoded as UTF-8, because it doesn't start with an ASCII character < 128 as it should according to the RFC.

Other data formats

BOM in JSON is not needed, is illegal and breaks software that works correctly according to the RFC. It should be a nobrainer to just not use it then and yet, there are always people who insist on breaking JSON by using BOMs, comments, different quoting rules or different data types. Of course anyone is free to use things like BOMs or anything else if you need it - just don't call it JSON then.

For other data formats than JSON, take a look at how it really looks like. If the only encodings are UTF-* and the first character must be an ASCII character lower than 128 then you already have all the information needed to determine both the encoding and the endianness of your data. Adding BOMs even as an optional feature would only make it more complicated and error prone.

Other uses of BOM

As for the uses outside of JSON or scripts, I think there are already very good answers here. I wanted to add more detailed info specifically about scripting and serialization, because it is an example of BOM characters causing real problems.

画尸师 2024-08-27 15:33:56

UTF-8 和无 BOM 的 UTF-8 有什么不同?

简短回答:在 UTF-8 中,BOM 被编码为文件开头的字节 EF BB BF

长答案:

最初,预计 Unicode 将以 UTF-16/UCS-2 编码。 BOM 就是针对这种编码形式而设计的。当您有 2 字节代码单元时,有必要指示这两个字节的顺序,执行此操作的常见约定是在数据开头包含字符 U+FEFF 作为“字节顺序标记”。字符 U+FFFE 永久未分配,因此它的存在可用于检测错误的字节顺序。

无论平台字节序如何,UTF-8 都具有相同的字节顺序,因此不需要字节顺序标记。但是,它可能会出现在从 UTF-16 转换为 UTF-8 的数据中(作为字节序列 EF BB FF),或者作为“签名”来指示数据是 UTF-8 。

哪个更好?

没有。正如 Martin Cote 回答的那样,Unicode 标准不推荐它。它会导致不支持 BOM 的软件出现问题。

检测文件是否为 UTF-8 的更好方法是执行有效性检查。 UTF-8 对于哪些字节序列有效有严格的规则,因此误报的概率可以忽略不计。如果字节序列看起来像 UTF-8,那么它很可能就是这样。

What's different between UTF-8 and UTF-8 without BOM?

Short answer: In UTF-8, a BOM is encoded as the bytes EF BB BF at the beginning of the file.

Long answer:

Originally, it was expected that Unicode would be encoded in UTF-16/UCS-2. The BOM was designed for this encoding form. When you have 2-byte code units, it's necessary to indicate which order those two bytes are in, and a common convention for doing this is to include the character U+FEFF as a "Byte Order Mark" at the beginning of the data. The character U+FFFE is permanently unassigned so that its presence can be used to detect the wrong byte order.

UTF-8 has the same byte order regardless of platform endianness, so a byte order mark isn't needed. However, it may occur (as the byte sequence EF BB FF) in data that was converted to UTF-8 from UTF-16, or as a "signature" to indicate that the data is UTF-8.

Which is better?

Without. As Martin Cote answered, the Unicode standard does not recommend it. It causes problems with non-BOM-aware software.

A better way to detect whether a file is UTF-8 is to perform a validity check. UTF-8 has strict rules about what byte sequences are valid, so the probability of a false positive is negligible. If a byte sequence looks like UTF-8, it probably is.

π浅易 2024-08-27 15:33:56

带 BOM 的 UTF-8 更好识别。我是经过艰难的过程才得出这个结论的。我正在开发一个项目,其中结果之一是 CSV 文件,包括 Unicode人物。

如果保存的 CSV 文件没有 BOM,Excel 会认为它是 ANSI 并显示乱码。一旦您在前面添加“EF BB BF”(例如,通过使用带有 UTF-8 的记事本重新保存它;或使用带有 BOM 的 UTF-8 的 Notepad++ 重新保存),Excel 就可以正常打开它。

RFC 3629 建议在 Unicode 文本文件中添加 BOM 字符:“UTF-8,ISO 10646 的转换格式”,2003 年 11 月
https://www.rfc-editor.org/rfc/rfc3629 (此最后的信息位于: http://www .herongyang.com/Unicode/Notepad-Byte-Order-Mark-BOM-FEFF-EFBBBF.html)

UTF-8 with BOM is better identified. I have reached this conclusion the hard way. I am working on a project where one of the results is a CSV file, including Unicode characters.

If the CSV file is saved without a BOM, Excel thinks it's ANSI and shows gibberish. Once you add "EF BB BF" at the front (for example, by re-saving it using Notepad with UTF-8; or Notepad++ with UTF-8 with BOM), Excel opens it fine.

Prepending the BOM character to Unicode text files is recommended by RFC 3629: "UTF-8, a transformation format of ISO 10646", November 2003
at https://www.rfc-editor.org/rfc/rfc3629 (this last info found at: http://www.herongyang.com/Unicode/Notepad-Byte-Order-Mark-BOM-FEFF-EFBBBF.html)

如梦亦如幻 2024-08-27 15:33:56

问题:UTF-8 和无 BOM 的 UTF-8 有什么不同?哪个更好?

以下是维基百科关于字节顺序标记 (BOM) 的文章的一些摘录,我相信这个问题可以得到一个可靠的答案。

关于BOM和UTF-8的含义:

Unicode 标准允许 UTF-8 中的 BOM,但不要求
或推荐其使用。字节顺序在 UTF-8 中没有意义,所以它
在 UTF-8 中唯一的用途是在开始时发出信号表明文本流是
以 UTF-8 编码。

支持 使用 BOM 的论点:

不使用 BOM 的主要动机是向后兼容性
使用不支持 Unicode 的软件...不支持的另一个动机
使用 BOM 是为了鼓励 UTF-8 作为“默认”编码。

参数 FOR 使用 BOM:

使用 BOM 的理由是,如果没有它,启发式分析就无法实现
需要确定文件正在使用什么字符编码。
从历史上看,为了区分各种 8 位编码,这种分析是
复杂、容易出错、有时速度很慢。多个图书馆
可以用来简化任务,例如 Mozilla 通用字符集
Unicode 检测器和国际组件。

程序员错误地认为 UTF-8 的检测等同于
困难(并不是因为绝大多数字节序列
是无效的 UTF-8,而这些库正在尝试的编码
区分允许所有可能的字节序列)。因此并非所有
Unicode 感知程序执行此类分析,而是依赖于
物料清单。

特别是Microsoft编译器和解释器,以及许多
Microsoft Windows 上的软件(例如记事本)不会
正确读取 UTF-8 文本,除非它只有 ASCII 字符或者它
以BOM开头,保存文本时会在开头添加BOM
作为 UTF-8。当 Microsoft Word 文档出现时,Google Docs 将添加 BOM
以纯文本文件形式下载。

哪个更好, 不带 物料清单:

IETF 建议,如果协议 (a) 始终使用 UTF-8,
或者 (b) 有其他方式来指示正在使用什么编码,
那么它“应该禁止使用 U+FEFF 作为签名。”

我的结论:

如果与软件应用程序的兼容性绝对必要,则使用 BOM。

另请注意,虽然引用的维基百科文章表明许多 Microsoft 应用程序依赖 BOM 来正确检测 UTF-8,但并非所有 Microsoft 应用程序都是如此。例如,正如 @barlop 所指出的,在使用 UTF-8 的 Windows 命令提示符时†,诸如 typemore 之类的命令不希望 BOM 存在。如果存在 BOM,则可能会出现问题,就像其他应用程序一样。


chcp 命令通过代码页提供对 UTF-8( BOM)的支持 65001

Question: What's different between UTF-8 and UTF-8 without a BOM? Which is better?

Here are some excerpts from the Wikipedia article on the byte order mark (BOM) that I believe offer a solid answer to this question.

On the meaning of the BOM and UTF-8:

The Unicode Standard permits the BOM in UTF-8, but does not require
or recommend its use. Byte order has no meaning in UTF-8, so its
only use in UTF-8 is to signal at the start that the text stream is
encoded in UTF-8.

Argument for NOT using a BOM:

The primary motivation for not using a BOM is backwards-compatibility
with software that is not Unicode-aware... Another motivation for not
using a BOM is to encourage UTF-8 as the "default" encoding.

Argument FOR using a BOM:

The argument for using a BOM is that without it, heuristic analysis is
required to determine what character encoding a file is using.
Historically such analysis, to distinguish various 8-bit encodings, is
complicated, error-prone, and sometimes slow. A number of libraries
are available to ease the task, such as Mozilla Universal Charset
Detector and International Components for Unicode.

Programmers mistakenly assume that detection of UTF-8 is equally
difficult (it is not because of the vast majority of byte sequences
are invalid UTF-8, while the encodings these libraries are trying to
distinguish allow all possible byte sequences). Therefore not all
Unicode-aware programs perform such an analysis and instead rely on
the BOM.

In particular, Microsoft compilers and interpreters, and many
pieces of software on Microsoft Windows such as Notepad will not
correctly read UTF-8 text unless it has only ASCII characters or it
starts with the BOM, and will add a BOM to the start when saving text
as UTF-8. Google Docs will add a BOM when a Microsoft Word document is
downloaded as a plain text file.

On which is better, WITH or WITHOUT the BOM:

The IETF recommends that if a protocol either (a) always uses UTF-8,
or (b) has some other way to indicate what encoding is being used,
then it “SHOULD forbid use of U+FEFF as a signature.”

My Conclusion:

Use the BOM only if compatibility with a software application is absolutely essential.

Also note that while the referenced Wikipedia article indicates that many Microsoft applications rely on the BOM to correctly detect UTF-8, this is not the case for all Microsoft applications. For example, as pointed out by @barlop, when using the Windows Command Prompt with UTF-8, commands such type and more do not expect the BOM to be present. If the BOM is present, it can be problematic as it is for other applications.


† The chcp command offers support for UTF-8 (without the BOM) via code page 65001.

夜司空 2024-08-27 15:33:56

这个问题已经有百万零一个答案,其中许多都非常好,但我想尝试澄清何时应该或不应该使用 BOM。

如前所述,任何使用 UTF BOM(字节顺序标记)来确定字符串是否为 UTF-8 的行为都是有根据的猜测。如果有适当的元数据可用(例如 charset="utf-8"),那么您已经知道应该使用什么,但否则您需要测试并做出一些假设。这涉及检查字符串来自的文件是否以十六进制字节代码 EF BB BF 开头。

如果找到与 UTF-8 BOM 对应的字节码,则概率足够高,可以假设它是 UTF-8,您可以从那里开始。然而,当被迫做出这种猜测时,在阅读时进行额外的错误检查仍然是一个好主意,以防出现乱码。仅当输入根据其源绝对不应该是 UTF-8 时,才应假设 BOM 不是 UTF-8(即 latin-1 或 ANSI)。但是,如果没有 BOM,您可以通过验证编码来简单地确定它是否应该是 UTF-8。

为什么不建议使用 BOM?

  1. 不支持 Unicode 或不太兼容的软件可能会假设它是 latin-1 或 ANSI,并且不会从字符串中删除 BOM,这显然会导致问题。
  2. 并不是真的需要(只需检查内容是否兼容,并且在找不到兼容编码时始终使用 UTF-8 作为后备)

何时应该使用 BOM 进行编码?

如果您无法以任何其他方式(通过字符集标记或文件系统元)记录元数据,并且使用的程序如 BOM,则应使用 BOM 进行编码。在 Windows 上尤其如此,通常认为没有 BOM 的任何内容都使用旧代码页。 BOM 告诉 Office 等程序,是的,该文件中的文本是 Unicode;这是使用的编码。

归根结底,我唯一真正遇到问题的文件是 CSV。根据程序的不同,它要么必须有 BOM,要么不能没有 BOM。例如,如果您在 Windows 上使用 Excel 2007+,如果您想顺利打开它而不需要导入数据,则必须使用 BOM 对其进行编码。

This question already has a million-and-one answers and many of them are quite good, but I wanted to try and clarify when a BOM should or should not be used.

As mentioned, any use of the UTF BOM (Byte Order Mark) in determining whether a string is UTF-8 or not is educated guesswork. If there is proper metadata available (like charset="utf-8"), then you already know what you're supposed to be using, but otherwise you'll need to test and make some assumptions. This involves checking whether the file a string comes from begins with the hexadecimal byte code, EF BB BF.

If a byte code corresponding to the UTF-8 BOM is found, the probability is high enough to assume it's UTF-8 and you can go from there. When forced to make this guess, however, additional error checking while reading would still be a good idea in case something comes up garbled. You should only assume a BOM is not UTF-8 (i.e. latin-1 or ANSI) if the input definitely shouldn't be UTF-8 based on its source. If there is no BOM, however, you can simply determine whether it's supposed to be UTF-8 by validating against the encoding.

Why is a BOM not recommended?

  1. Non-Unicode-aware or poorly compliant software may assume it's latin-1 or ANSI and won't strip the BOM from the string, which can obviously cause issues.
  2. It's not really needed (just check if the contents are compliant and always use UTF-8 as the fallback when no compliant encoding can be found)

When should you encode with a BOM?

If you're unable to record the metadata in any other way (through a charset tag or file system meta), and the programs being used like BOMs, you should encode with a BOM. This is especially true on Windows where anything without a BOM is generally assumed to be using a legacy code page. The BOM tells programs like Office that, yes, the text in this file is Unicode; here's the encoding used.

When it comes down to it, the only files I ever really have problems with are CSV. Depending on the program, it either must, or must not have a BOM. For example, if you're using Excel 2007+ on Windows, it must be encoded with a BOM if you want to open it smoothly and not have to resort to importing the data.

想你的星星会说话 2024-08-27 15:33:56

BOM 往往会在某个地方蓬勃发展(没有双关语(原文如此))。当它蓬勃发展时(例如,无法被浏览器、编辑器等识别),它会在文档开头显示为奇怪的字符  (例如,HTML文件,JSON 响应,RSS 等)并导致类似 最近奥巴马在 Twitter 上讲话时遇到的编码问题

当它出现在难以调试的地方或忽略测试时,这是非常烦人的。因此,除非必须使用它,否则最好避免使用它。

BOM tends to boom (no pun intended (sic)) somewhere, someplace. And when it booms (for example, doesn't get recognized by browsers, editors, etc.), it shows up as the weird characters  at the start of the document (for example, HTML file, JSON response, RSS, etc.) and causes the kind of embarrassments like the recent encoding issue experienced during the talk of Obama on Twitter.

It's very annoying when it shows up at places hard to debug or when testing is neglected. So it's best to avoid it unless you must use it.

永不分离 2024-08-27 15:33:56

不带 BOM 的 UTF-8 没有 BOM,这并不比带 BOM 的 UTF-8 更好,除非文件的使用者需要知道(或将从知道中受益)文件是否是 UTF-8 编码的或不。

BOM 通常可用于确定编码的字节顺序,但这对于大多数用例来说不是必需的。

此外,对于那些不了解或不关心 BOM 的消费者来说,BOM 可能会带来不必要的噪音/痛苦,并可能导致用户困惑。

UTF-8 without BOM has no BOM, which doesn't make it any better than UTF-8 with BOM, except when the consumer of the file needs to know (or would benefit from knowing) whether the file is UTF-8-encoded or not.

The BOM is usually useful to determine the endianness of the encoding, which is not required for most use cases.

Also, the BOM can be unnecessary noise/pain for those consumers that don't know or care about it, and can result in user confusion.

南城旧梦 2024-08-27 15:33:56

应该注意的是,对于某些文件,即使在 Windows 上也不能有 BOM。例如 SQL*plusVBScript 文件。如果此类文件包含 BOM,您在尝试执行它们时会收到错误消息。

It should be noted that for some files you must not have the BOM even on Windows. Examples are SQL*plus or VBScript files. In case such files contains a BOM you get an error when you try to execute them.

走过海棠暮 2024-08-27 15:33:56

在 BOM 的 Wikipedia 页面底部引用:http://en.wikipedia。 org/wiki/Byte-order_mark#cite_note-2

“UTF-8 既不需要也不建议使用 BOM,但在从使用 BOM 的其他编码形式转换 UTF-8 数据或将 BOM 用作 UTF-8 的情况下可能会遇到这种情况签名”

Quoted at the bottom of the Wikipedia page on BOM: http://en.wikipedia.org/wiki/Byte-order_mark#cite_note-2

"Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature"

晚风撩人 2024-08-27 15:33:56

仅当文件实际包含一些非 ASCII 字符时,带有 BOM 的 UTF-8 才有用。如果包含它但没有任何文件,那么它可能会破坏旧的应用程序,否则这些应用程序会将文件解释为纯 ASCII。这些应用程序在遇到非 ASCII 字符时肯定会失败,因此在我看来,只有当文件可以且不应该再被解释为纯 ASCII 时才应添加 BOM。

我想明确表示,我宁愿根本没有 BOM。如果一些旧的垃圾没有它就崩溃了,那么添加它,并且替换该遗留应用程序是不可行的。

不要做任何除了 UTF-8 的 BOM 之外的事情。

UTF-8 with BOM only helps if the file actually contains some non-ASCII characters. If it is included and there aren't any, then it will possibly break older applications that would have otherwise interpreted the file as plain ASCII. These applications will definitely fail when they come across a non ASCII character, so in my opinion the BOM should only be added when the file can, and should, no longer be interpreted as plain ASCII.

I want to make it clear that I prefer to not have the BOM at all. Add it in if some old rubbish breaks without it, and replacing that legacy application is not feasible.

Don't make anything expect a BOM for UTF-8.

瞎闹 2024-08-27 15:33:56

我从不同的角度看待这个问题。我认为 带有 BOM 的 UTF-8 更好,因为它提供了有关文件的更多信息。仅当遇到问题时我才使用无 BOM 的 UTF-8。

我在页面上长时间使用多种语言(甚至西里尔语),并且当文件保存时没有 BOM,我重新打开它们以使用编辑器进行编辑(正如 cherouvim 也指出的那样),某些字符已损坏。

请注意,Windows 的经典 记事本 会在以下情况下自动保存带有 BOM 的文件:您尝试使用 UTF-8 编码保存新创建的文件。

我个人使用 BOM 和 .html 保存服务器端脚本文件(.asp、.ini、.aspx)没有 BOM 的文件

I look at this from a different perspective. I think UTF-8 with BOM is better as it provides more information about the file. I use UTF-8 without BOM only if I face problems.

I am using multiple languages (even Cyrillic) on my pages for a long time and when the files are saved without BOM and I re-open them for editing with an editor (as cherouvim also noted), some characters are corrupted.

Note that Windows' classic Notepad automatically saves files with a BOM when you try to save a newly created file with UTF-8 encoding.

I personally save server side scripting files (.asp, .ini, .aspx) with BOM and .html files without BOM.

初心 2024-08-27 15:33:56

当您想要显示以 UTF-8 编码的信息时,您可能不会遇到问题。例如,将 HTML 文档声明为 UTF-8,您将在浏览器中显示文档正文中包含的所有内容。

但当我们在 Windows 上有文本、 CSV 和 XML 文件时,情况并非如此或Linux。

例如,Windows 或 Linux 中的文本文件,这是可以想象到的最简单的事情之一,它(通常)不是 UTF-8。

将其另存为 XML 并将其声明为 UTF-8:

<?xml version="1.0" encoding="UTF-8"?>

即使声明为 UTF-8,它也不会正确显示(不会被读取)。

我有一串包含法语字母的数据,需要将其保存为 XML 以便联合。如果没有从一开始就创建 UTF-8 文件(更改 IDE 中的选项和“创建新文件”)或在文件开头添加 BOM,

$file="\xEF\xBB\xBF".$string;

我无法将法文字母保存在 XML 文件中。

When you want to display information encoded in UTF-8 you may not face problems. Declare for example an HTML document as UTF-8 and you will have everything displayed in your browser that is contained in the body of the document.

But this is not the case when we have text, CSV and XML files, either on Windows or Linux.

For example, a text file in Windows or Linux, one of the easiest things imaginable, it is not (usually) UTF-8.

Save it as XML and declare it as UTF-8:

<?xml version="1.0" encoding="UTF-8"?>

It will not display (it will not be be read) correctly, even if it's declared as UTF-8.

I had a string of data containing French letters, that needed to be saved as XML for syndication. Without creating a UTF-8 file from the very beginning (changing options in IDE and "Create New File") or adding the BOM at the beginning of the file

$file="\xEF\xBB\xBF".$string;

I was not able to save the French letters in an XML file.

埋葬我深情 2024-08-27 15:33:56

一个实际的区别是,如果您为 Mac OS X 编写 shell 脚本并将其保存为纯 UTF-8,您将得到响应:

#!/bin/bash: No such file or directory

响应指定您希望使用哪个 shell 的 shebang 行:

#!/bin/bash

如果您另存为UTF-8,无 BOM(例如 BBEdit)一切都会好起来的。

One practical difference is that if you write a shell script for Mac OS X and save it as plain UTF-8, you will get the response:

#!/bin/bash: No such file or directory

in response to the shebang line specifying which shell you wish to use:

#!/bin/bash

If you save as UTF-8, no BOM (say in BBEdit) all will be well.

那一片橙海, 2024-08-27 15:33:56

如上所述,带有 BOM 的 UTF-8 可能会导致不支持 BOM 的(或兼容的)软件出现问题。我曾经使用基于 Mozilla 的 KompoZer 编辑了编码为 UTF-8 + BOM 的 HTML 文件,作为客户要求所见即所得程序。

保存时布局总是会被破坏。我花了一些时间才解决这个问题。这些文件在 Firefox 中运行良好,但在 Internet Explorer 中表现出 CSS 怪癖,再次破坏了布局。在摆弄链接的 CSS 文件几个小时但无济于事后,我发现 Internet Explorer 不喜欢 BOMfed HTML 文件。再也不会了。

另外,我刚刚在维基百科上找到了这个:

shebang 字符由扩展 ASCII 编码(包括 UTF-8)中的相同两个字节表示,UTF-8 通常用于当前类 Unix 系统上的脚本和其他文本文件。但是,UTF-8 文件可能以可选的字节顺序标记 (BOM) 开头;如果“exec”函数专门检测到字节0x23 0x21,则shebang之前存在BOM(0xEF 0xBB 0xBF)将阻止脚本解释器执行。一些权威机构建议不要在 POSIX(类 Unix)脚本中使用字节顺序标记,[15] 出于这个原因以及更广泛的互操作性和哲学问题

As mentioned above, UTF-8 with BOM may cause problems with non-BOM-aware (or compatible) software. I once edited HTML files encoded as UTF-8 + BOM with the Mozilla-based KompoZer, as a client required that WYSIWYG program.

Invariably the layout would get destroyed when saving. It took my some time to fiddle my way around this. These files then worked well in Firefox, but showed a CSS quirk in Internet Explorer destroying the layout, again. After fiddling with the linked CSS files for hours to no avail I discovered that Internet Explorer didn't like the BOMfed HTML file. Never again.

Also, I just found this in Wikipedia:

The shebang characters are represented by the same two bytes in extended ASCII encodings, including UTF-8, which is commonly used for scripts and other text files on current Unix-like systems. However, UTF-8 files may begin with the optional byte order mark (BOM); if the "exec" function specifically detects the bytes 0x23 0x21, then the presence of the BOM (0xEF 0xBB 0xBF) before the shebang will prevent the script interpreter from being executed. Some authorities recommend against using the byte order mark in POSIX (Unix-like) scripts,[15] for this reason and for wider interoperability and philosophical concerns

爱你是孤单的心事 2024-08-27 15:33:56

Unicode 字节顺序标记 (BOM) 常见问题解答提供了简洁的答案:

问:我应该如何处理 BOM?

答:以下是一些需要遵循的准则:

  1. 特定协议(例如 Microsoft .txt 文件约定)可能需要在某些 Unicode 数据流上使用 BOM,例如
    文件。当您需要遵守此类协议时,请使用 BOM。

  2. 某些协议允许在未标记文本的情况下使用可选的 BOM。在这些情况下,

    • 如果已知文本数据流是纯文本,但编码未知,则可以使用 BOM 作为签名。如果没有BOM,
      编码可以是任何内容。

    • 如果已知文本数据流是纯 Unicode 文本(但不知道哪种字节序),则可以使用 BOM 作为签名。如果有
      没有 BOM,文本应被解释为 big-endian。

  3. 某些面向字节的协议期望在文件开头使用 ASCII 字符。如果 UTF-8 与这些协议一起使用,则使用
    应避免将 BOM 作为编码表单签名。

  4. 如果数据流的精确类型已知(例如 Unicode big-endian 或 Unicode Little-endian),则不应使用 BOM。在
    特别是,每当数据流被声明为 UTF-16BE 时,
    不得使用 UTF-16LE、UTF-32BE 或 UTF-32LE BOM。

The Unicode Byte Order Mark (BOM) FAQ provides a concise answer:

Q: How I should deal with BOMs?

A: Here are some guidelines to follow:

  1. A particular protocol (e.g. Microsoft conventions for .txt files) may require use of the BOM on certain Unicode data streams, such as
    files. When you need to conform to such a protocol, use a BOM.

  2. Some protocols allow optional BOMs in the case of untagged text. In those cases,

    • Where a text data stream is known to be plain text, but of unknown encoding, BOM can be used as a signature. If there is no BOM,
      the encoding could be anything.

    • Where a text data stream is known to be plain Unicode text (but not which endian), then BOM can be used as a signature. If there
      is no BOM, the text should be interpreted as big-endian.

  3. Some byte oriented protocols expect ASCII characters at the beginning of a file. If UTF-8 is used with these protocols, use of the
    BOM as encoding form signature should be avoided.

  4. Where the precise type of the data stream is known (e.g. Unicode big-endian or Unicode little-endian), the BOM should not be used. In
    particular, whenever a data stream is declared to be UTF-16BE,
    UTF-16LE, UTF-32BE or UTF-32LE a BOM must not be used.

纵性 2024-08-27 15:33:56

来自http://en.wikipedia.org/wiki/Byte-order_mark

字节顺序标记 (BOM) 是 Unicode
用于表示的字符
文本文件的字节顺序(字节顺序)
或流。它的代码点是U+FEFF。
BOM 的使用是可选的,如果使用的话,
应出现在文本的开头
溪流。除了其特定用途之外
字节顺序指示符,BOM
字符还可以指示哪个
几种 Unicode 表示形式
文本被编码。

始终在文件中使用 BOM 将确保文件始终在支持 UTF-8 和 BOM 的编辑器中正确打开。

我因缺少 BOM 而遇到的真正问题如下。假设我们有一个文件,其中包含:

abc

没有 BOM,它在大多数编辑器中以 ANSI 格式打开。因此该文件的另一个用户打开它并附加一些本机字符,例如:

abg-αβγ

哎呀...现在该文件仍然是 ANSI 格式,猜猜看,“αβγ”不占用 6 个字节,而是 3 个字节。这不是 UTF-8这会导致开发链后期出现其他问题。

From http://en.wikipedia.org/wiki/Byte-order_mark:

The byte order mark (BOM) is a Unicode
character used to signal the
endianness (byte order) of a text file
or stream. Its code point is U+FEFF.
BOM use is optional, and, if used,
should appear at the start of the text
stream. Beyond its specific use as a
byte-order indicator, the BOM
character may also indicate which of
the several Unicode representations
the text is encoded in.

Always using a BOM in your file will ensure that it always opens correctly in an editor which supports UTF-8 and BOM.

My real problem with the absence of BOM is the following. Suppose we've got a file which contains:

abc

Without BOM this opens as ANSI in most editors. So another user of this file opens it and appends some native characters, for example:

abg-αβγ

Oops... Now the file is still in ANSI and guess what, "αβγ" does not occupy 6 bytes, but 3. This is not UTF-8 and this causes other problems later on in the development chain.

痞味浪人 2024-08-27 15:33:56

以下是我使用 Visual Studio、Sourcetree 和 Bitbucket Pull 请求的经验,这些请求一直在提供我遇到了一些问题:

所以事实证明,在审查拉取请求时,带有签名的 BOM 将在每个文件上包含一个红点字符(这可能非常烦人)。

悬停在它上面,它会显示一个像“ufeff”这样的字符,但事实证明 Sourcetree 不显示这些类型的字节标记,所以它很可能最终会出现在你的拉取请求中,这应该没问题,因为这就是 Visual Studio 2017 现在对新文件进行编码的方式,所以也许 Bitbucket 应该忽略这一点或使其以其他方式显示,更多信息请参见:

红点标记 BitBucket 差异视图

Here is my experience with Visual Studio, Sourcetree and Bitbucket pull requests, which has been giving me some problems:

So it turns out BOM with a signature will include a red dot character on each file when reviewing a pull request (it can be quite annoying).

Enter image description here

If you hover on it, it will show a character like "ufeff", but it turns out Sourcetree does not show these types of bytemarks, so it will most likely end up in your pull requests, which should be ok because that's how Visual Studio 2017 encodes new files now, so maybe Bitbucket should ignore this or make it show in another way, more info here:

Red dot marker BitBucket diff view

皇甫轩 2024-08-27 15:33:56

我用utf-8保存autohotkey文件,汉字变得奇怪。

使用utf-8 BOM,工作正常。

AutoHotkey 不会自动识别 UTF-8 文件,除非它以字节顺序标记开头。

https://www.autohotkey.com/docs/FAQ.htm#nonascii

I save a autohotkey file with utf-8, the chinese characters become strang.

With utf-8 BOM, works fine.

AutoHotkey will not automatically recognize a UTF-8 file unless it begins with a byte order mark.

https://www.autohotkey.com/docs/FAQ.htm#nonascii

稍尽春風 2024-08-27 15:33:56

如果您在 HTML 文件中使用 UTF-8,并且在同一页面上使用塞尔维亚西里尔文、塞尔维亚拉丁文、德语、匈牙利语或某些外来语言,则带有 BOM 的 UTF 会更好。

这是我的观点(计算机和 IT 行业 30 年)。

UTF with a BOM is better if you use UTF-8 in HTML files and if you use Serbian Cyrillic, Serbian Latin, German, Hungarian or some exotic language on the same page.

That is my opinion (30 years of computing and IT industry).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文