The UTF-8 BOM is a sequence of bytes at the start of a text stream (0xEF, 0xBB, 0xBF) that allows the reader to more reliably guess a file as being encoded in UTF-8.
Normally, the BOM is used to signal the endianness of an encoding, but since endianness is irrelevant to UTF-8, the BOM is unnecessary.
According to the Unicode standard, the BOM for UTF-8 files is not recommended:
2.6 Encoding Schemes
... Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature. See the “Byte Order Mark” subsection in Section 16.8, Specials, for more information.
The other excellent answers already answered that:
There is no official difference between UTF-8 and BOM-ed UTF-8
A BOM-ed UTF-8 string will start with the three following bytes. EF BB BF
Those bytes, if present, must be ignored when extracting the string from the file/stream.
But, as additional information to this, the BOM for UTF-8 could be a good way to "smell" if a string was encoded in UTF-8... Or it could be a legitimate string in any other encoding...
For example, the data [EF BB BF 41 42 43] could either be:
So while it can be cool to recognize the encoding of a file content by looking at the first bytes, you should not rely on this, as show by the example above
There are at least three problems with putting a BOM in UTF-8 encoded files.
Files that hold no text are no longer empty because they always contain the BOM.
Files that hold text within the ASCII subset of UTF-8 are no longer themselves ASCII because the BOM is not ASCII, which makes some existing tools break down, and it can be impossible for users to replace such legacy tools.
It is not possible to concatenate several files together because each file now has a BOM at the beginning.
And, as others have mentioned, it is neither sufficient nor necessary to have a BOM to detect that something is UTF-8:
It is not sufficient because an arbitrary byte sequence can happen to start with the exact sequence that constitutes the BOM.
It is not necessary because you can just read the bytes as if they were UTF-8; if that succeeds, it is, by definition, valid UTF-8.
Here are examples of the BOM usage that actually cause real problems and yet many people don't know about it.
BOM breaks scripts
Shell scripts, Perl scripts, Python scripts, Ruby scripts, Node.js scripts or any other executable that needs to be run by an interpreter - all start with a shebang line which looks like one of those:
It tells the system which interpreter needs to be run when invoking such a script. If the script is encoded in UTF-8, one may be tempted to include a BOM at the beginning. But actually the "#!" characters are not just characters. They are in fact a magic number that happens to be composed out of two ASCII characters. If you put something (like a BOM) before those characters, then the file will look like it had a different magic number and that can lead to problems.
The shebang characters are represented by the same two bytes in extended ASCII encodings, including UTF-8, which is commonly used for scripts and other text files on current Unix-like systems. However, UTF-8 files may begin with the optional byte order mark (BOM); if the "exec" function specifically detects the bytes 0x23 and 0x21, then the presence of the BOM (0xEF 0xBB 0xBF) before the shebang will prevent the script interpreter from being executed. Some authorities recommend against using the byte order mark in POSIX (Unix-like) scripts,[14] for this reason and for wider interoperability and philosophical concerns. Additionally, a byte order mark is not necessary in UTF-8, as that encoding does not have endianness issues; it serves only to identify the encoding as UTF-8. [emphasis added]
Implementations MUST NOT add a byte order mark to the beginning of a JSON text.
BOM is redundant in JSON
Not only it is illegal in JSON, it is also not needed to determine the character encoding because there are more reliable ways to unambiguously determine both the character encoding and endianness used in any JSON stream (see this answer for details).
BOM breaks JSON parsers
Not only it is illegal in JSON and not needed, it actually breaks all software that determine the encoding using the method presented in RFC 4627:
Determining the encoding and endianness of JSON, examining the first four bytes for the NUL byte:
00 00 00 xx - UTF-32BE
00 xx 00 xx - UTF-16BE
xx 00 00 00 - UTF-32LE
xx 00 xx 00 - UTF-16LE
xx xx xx xx - UTF-8
Now, if the file starts with BOM it will look like this:
00 00 FE FF - UTF-32BE
FE FF 00 xx - UTF-16BE
FF FE 00 00 - UTF-32LE
FF FE xx 00 - UTF-16LE
EF BB BF xx - UTF-8
Note that:
UTF-32BE doesn't start with three NULs, so it won't be recognized
UTF-32LE the first byte is not followed by three NULs, so it won't be recognized
UTF-16BE has only one NUL in the first four bytes, so it won't be recognized
UTF-16LE has only one NUL in the first four bytes, so it won't be recognized
Depending on the implementation, all of those may be interpreted incorrectly as UTF-8 and then misinterpreted or rejected as invalid UTF-8, or not recognized at all.
Additionally, if the implementation tests for valid JSON as I recommend, it will reject even the input that is indeed encoded as UTF-8, because it doesn't start with an ASCII character < 128 as it should according to the RFC.
Other data formats
BOM in JSON is not needed, is illegal and breaks software that works correctly according to the RFC. It should be a nobrainer to just not use it then and yet, there are always people who insist on breaking JSON by using BOMs, comments, different quoting rules or different data types. Of course anyone is free to use things like BOMs or anything else if you need it - just don't call it JSON then.
For other data formats than JSON, take a look at how it really looks like. If the only encodings are UTF-* and the first character must be an ASCII character lower than 128 then you already have all the information needed to determine both the encoding and the endianness of your data. Adding BOMs even as an optional feature would only make it more complicated and error prone.
Other uses of BOM
As for the uses outside of JSON or scripts, I think there are already very good answers here. I wanted to add more detailed info specifically about scripting and serialization, because it is an example of BOM characters causing real problems.
What's different between UTF-8 and UTF-8 without BOM?
Short answer: In UTF-8, a BOM is encoded as the bytes EF BB BF at the beginning of the file.
Long answer:
Originally, it was expected that Unicode would be encoded in UTF-16/UCS-2. The BOM was designed for this encoding form. When you have 2-byte code units, it's necessary to indicate which order those two bytes are in, and a common convention for doing this is to include the character U+FEFF as a "Byte Order Mark" at the beginning of the data. The character U+FFFE is permanently unassigned so that its presence can be used to detect the wrong byte order.
UTF-8 has the same byte order regardless of platform endianness, so a byte order mark isn't needed. However, it may occur (as the byte sequence EF BB FF) in data that was converted to UTF-8 from UTF-16, or as a "signature" to indicate that the data is UTF-8.
Which is better?
Without. As Martin Cote answered, the Unicode standard does not recommend it. It causes problems with non-BOM-aware software.
A better way to detect whether a file is UTF-8 is to perform a validity check. UTF-8 has strict rules about what byte sequences are valid, so the probability of a false positive is negligible. If a byte sequence looks like UTF-8, it probably is.
UTF-8 with BOM is better identified. I have reached this conclusion the hard way. I am working on a project where one of the results is a CSV file, including Unicode characters.
If the CSV file is saved without a BOM, Excel thinks it's ANSI and shows gibberish. Once you add "EF BB BF" at the front (for example, by re-saving it using Notepad with UTF-8; or Notepad++ with UTF-8 with BOM), Excel opens it fine.
特别是Microsoft编译器和解释器,以及许多 Microsoft Windows 上的软件(例如记事本)不会 正确读取 UTF-8 文本,除非它只有 ASCII 字符或者它 以BOM开头,保存文本时会在开头添加BOM 作为 UTF-8。当 Microsoft Word 文档出现时,Google Docs 将添加 BOM 以纯文本文件形式下载。
另请注意,虽然引用的维基百科文章表明许多 Microsoft 应用程序依赖 BOM 来正确检测 UTF-8,但并非所有 Microsoft 应用程序都是如此。例如,正如 @barlop 所指出的,在使用 UTF-8 的 Windows 命令提示符时†,诸如 type 和 more 之类的命令不希望 BOM 存在。如果存在 BOM,则可能会出现问题,就像其他应用程序一样。
Question: What's different between UTF-8 and UTF-8 without a BOM? Which is better?
Here are some excerpts from the Wikipedia article on the byte order mark (BOM) that I believe offer a solid answer to this question.
On the meaning of the BOM and UTF-8:
The Unicode Standard permits the BOM in UTF-8, but does not require or recommend its use. Byte order has no meaning in UTF-8, so its only use in UTF-8 is to signal at the start that the text stream is encoded in UTF-8.
Argument forNOTusing a BOM:
The primary motivation for not using a BOM is backwards-compatibility with software that is not Unicode-aware... Another motivation for not using a BOM is to encourage UTF-8 as the "default" encoding.
ArgumentFORusing a BOM:
The argument for using a BOM is that without it, heuristic analysis is required to determine what character encoding a file is using. Historically such analysis, to distinguish various 8-bit encodings, is complicated, error-prone, and sometimes slow. A number of libraries are available to ease the task, such as Mozilla Universal Charset Detector and International Components for Unicode.
Programmers mistakenly assume that detection of UTF-8 is equally difficult (it is not because of the vast majority of byte sequences are invalid UTF-8, while the encodings these libraries are trying to distinguish allow all possible byte sequences). Therefore not all Unicode-aware programs perform such an analysis and instead rely on the BOM.
In particular, Microsoft compilers and interpreters, and many pieces of software on Microsoft Windows such as Notepad will not correctly read UTF-8 text unless it has only ASCII characters or it starts with the BOM, and will add a BOM to the start when saving text as UTF-8. Google Docs will add a BOM when a Microsoft Word document is downloaded as a plain text file.
On which is better,WITHorWITHOUTthe BOM:
The IETF recommends that if a protocol either (a) always uses UTF-8, or (b) has some other way to indicate what encoding is being used, then it “SHOULD forbid use of U+FEFF as a signature.”
My Conclusion:
Use the BOM only if compatibility with a software application is absolutely essential.
Also note that while the referenced Wikipedia article indicates that many Microsoft applications rely on the BOM to correctly detect UTF-8, this is not the case for all Microsoft applications. For example, as pointed out by @barlop, when using the Windows Command Prompt with UTF-8†, commands such type and more do not expect the BOM to be present. If the BOM is present, it can be problematic as it is for other applications.
† The chcp command offers support for UTF-8 (without the BOM) via code page 65001.
归根结底,我唯一真正遇到问题的文件是 CSV。根据程序的不同,它要么必须有 BOM,要么不能没有 BOM。例如,如果您在 Windows 上使用 Excel 2007+,如果您想顺利打开它而不需要导入数据,则必须使用 BOM 对其进行编码。
This question already has a million-and-one answers and many of them are quite good, but I wanted to try and clarify when a BOM should or should not be used.
As mentioned, any use of the UTF BOM (Byte Order Mark) in determining whether a string is UTF-8 or not is educated guesswork. If there is proper metadata available (like charset="utf-8"), then you already know what you're supposed to be using, but otherwise you'll need to test and make some assumptions. This involves checking whether the file a string comes from begins with the hexadecimal byte code, EF BB BF.
If a byte code corresponding to the UTF-8 BOM is found, the probability is high enough to assume it's UTF-8 and you can go from there. When forced to make this guess, however, additional error checking while reading would still be a good idea in case something comes up garbled. You should only assume a BOM is not UTF-8 (i.e. latin-1 or ANSI) if the input definitely shouldn't be UTF-8 based on its source. If there is no BOM, however, you can simply determine whether it's supposed to be UTF-8 by validating against the encoding.
Why is a BOM not recommended?
Non-Unicode-aware or poorly compliant software may assume it's latin-1 or ANSI and won't strip the BOM from the string, which can obviously cause issues.
It's not really needed (just check if the contents are compliant and always use UTF-8 as the fallback when no compliant encoding can be found)
When should you encode with a BOM?
If you're unable to record the metadata in any other way (through a charset tag or file system meta), and the programs being used like BOMs, you should encode with a BOM. This is especially true on Windows where anything without a BOM is generally assumed to be using a legacy code page. The BOM tells programs like Office that, yes, the text in this file is Unicode; here's the encoding used.
When it comes down to it, the only files I ever really have problems with are CSV. Depending on the program, it either must, or must not have a BOM. For example, if you're using Excel 2007+ on Windows, it must be encoded with a BOM if you want to open it smoothly and not have to resort to importing the data.
BOM tends to boom (no pun intended (sic)) somewhere, someplace. And when it booms (for example, doesn't get recognized by browsers, editors, etc.), it shows up as the weird characters  at the start of the document (for example, HTML file, JSON response, RSS, etc.) and causes the kind of embarrassments like the recent encoding issue experienced during the talk of Obama on Twitter.
It's very annoying when it shows up at places hard to debug or when testing is neglected. So it's best to avoid it unless you must use it.
UTF-8 without BOM has no BOM, which doesn't make it any better than UTF-8 with BOM, except when the consumer of the file needs to know (or would benefit from knowing) whether the file is UTF-8-encoded or not.
The BOM is usually useful to determine the endianness of the encoding, which is not required for most use cases.
Also, the BOM can be unnecessary noise/pain for those consumers that don't know or care about it, and can result in user confusion.
应该注意的是,对于某些文件,即使在 Windows 上也不能有 BOM。例如 SQL*plus 或 VBScript 文件。如果此类文件包含 BOM,您在尝试执行它们时会收到错误消息。
It should be noted that for some files you must not have the BOM even on Windows. Examples are SQL*plus or VBScript files. In case such files contains a BOM you get an error when you try to execute them.
"Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature"
UTF-8 with BOM only helps if the file actually contains some non-ASCII characters. If it is included and there aren't any, then it will possibly break older applications that would have otherwise interpreted the file as plain ASCII. These applications will definitely fail when they come across a non ASCII character, so in my opinion the BOM should only be added when the file can, and should, no longer be interpreted as plain ASCII.
I want to make it clear that I prefer to not have the BOM at all. Add it in if some old rubbish breaks without it, and replacing that legacy application is not feasible.
I look at this from a different perspective. I think UTF-8 with BOM is better as it provides more information about the file. I use UTF-8 without BOM only if I face problems.
I am using multiple languages (even Cyrillic) on my pages for a long time and when the files are saved without BOM and I re-open them for editing with an editor (as cherouvim also noted), some characters are corrupted.
Note that Windows' classic Notepad automatically saves files with a BOM when you try to save a newly created file with UTF-8 encoding.
I personally save server side scripting files (.asp, .ini, .aspx) with BOM and .html files without BOM.
例如,Windows 或 Linux 中的文本文件,这是可以想象到的最简单的事情之一,它(通常)不是 UTF-8。
将其另存为 XML 并将其声明为 UTF-8:
<?xml version="1.0" encoding="UTF-8"?>
即使声明为 UTF-8,它也不会正确显示(不会被读取)。
我有一串包含法语字母的数据,需要将其保存为 XML 以便联合。如果没有从一开始就创建 UTF-8 文件(更改 IDE 中的选项和“创建新文件”)或在文件开头添加 BOM,
$file="\xEF\xBB\xBF".$string;
我无法将法文字母保存在 XML 文件中。
When you want to display information encoded in UTF-8 you may not face problems. Declare for example an HTML document as UTF-8 and you will have everything displayed in your browser that is contained in the body of the document.
But this is not the case when we have text, CSV and XML files, either on Windows or Linux.
For example, a text file in Windows or Linux, one of the easiest things imaginable, it is not (usually) UTF-8.
Save it as XML and declare it as UTF-8:
<?xml version="1.0" encoding="UTF-8"?>
It will not display (it will not be be read) correctly, even if it's declared as UTF-8.
I had a string of data containing French letters, that needed to be saved as XML for syndication. Without creating a UTF-8 file from the very beginning (changing options in IDE and "Create New File") or adding the BOM at the beginning of the file
$file="\xEF\xBB\xBF".$string;
I was not able to save the French letters in an XML file.
如上所述,带有 BOM 的 UTF-8 可能会导致不支持 BOM 的(或兼容的)软件出现问题。我曾经使用基于 Mozilla 的 KompoZer 编辑了编码为 UTF-8 + BOM 的 HTML 文件,作为客户要求所见即所得程序。
保存时布局总是会被破坏。我花了一些时间才解决这个问题。这些文件在 Firefox 中运行良好,但在 Internet Explorer 中表现出 CSS 怪癖,再次破坏了布局。在摆弄链接的 CSS 文件几个小时但无济于事后,我发现 Internet Explorer 不喜欢 BOMfed HTML 文件。再也不会了。
As mentioned above, UTF-8 with BOM may cause problems with non-BOM-aware (or compatible) software. I once edited HTML files encoded as UTF-8 + BOM with the Mozilla-based KompoZer, as a client required that WYSIWYG program.
Invariably the layout would get destroyed when saving. It took my some time to fiddle my way around this. These files then worked well in Firefox, but showed a CSS quirk in Internet Explorer destroying the layout, again. After fiddling with the linked CSS files for hours to no avail I discovered that Internet Explorer didn't like the BOMfed HTML file. Never again.
Also, I just found this in Wikipedia:
The shebang characters are represented by the same two bytes in extended ASCII encodings, including UTF-8, which is commonly used for scripts and other text files on current Unix-like systems. However, UTF-8 files may begin with the optional byte order mark (BOM); if the "exec" function specifically detects the bytes 0x23 0x21, then the presence of the BOM (0xEF 0xBB 0xBF) before the shebang will prevent the script interpreter from being executed. Some authorities recommend against using the byte order mark in POSIX (Unix-like) scripts,[15] for this reason and for wider interoperability and philosophical concerns
A particular protocol (e.g. Microsoft conventions for .txt files) may require use of the BOM on certain Unicode data streams, such as files. When you need to conform to such a protocol, use a BOM.
Some protocols allow optional BOMs in the case of untagged text. In those cases,
Where a text data stream is known to be plain text, but of unknown encoding, BOM can be used as a signature. If there is no BOM, the encoding could be anything.
Where a text data stream is known to be plain Unicode text (but not which endian), then BOM can be used as a signature. If there is no BOM, the text should be interpreted as big-endian.
Some byte oriented protocols expect ASCII characters at the beginning of a file. If UTF-8 is used with these protocols, use of the BOM as encoding form signature should be avoided.
Where the precise type of the data stream is known (e.g. Unicode big-endian or Unicode little-endian), the BOM should not be used. In particular, whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE a BOM must not be used.
The byte order mark (BOM) is a Unicode character used to signal the endianness (byte order) of a text file or stream. Its code point is U+FEFF. BOM use is optional, and, if used, should appear at the start of the text stream. Beyond its specific use as a byte-order indicator, the BOM character may also indicate which of the several Unicode representations the text is encoded in.
Always using a BOM in your file will ensure that it always opens correctly in an editor which supports UTF-8 and BOM.
My real problem with the absence of BOM is the following. Suppose we've got a file which contains:
abc
Without BOM this opens as ANSI in most editors. So another user of this file opens it and appends some native characters, for example:
abg-αβγ
Oops... Now the file is still in ANSI and guess what, "αβγ" does not occupy 6 bytes, but 3. This is not UTF-8 and this causes other problems later on in the development chain.
Here is my experience with Visual Studio, Sourcetree and Bitbucket pull requests, which has been giving me some problems:
So it turns out BOM with a signature will include a red dot character on each file when reviewing a pull request (it can be quite annoying).
If you hover on it, it will show a character like "ufeff", but it turns out Sourcetree does not show these types of bytemarks, so it will most likely end up in your pull requests, which should be ok because that's how Visual Studio 2017 encodes new files now, so maybe Bitbucket should ignore this or make it show in another way, more info here:
如果您在 HTML 文件中使用 UTF-8,并且在同一页面上使用塞尔维亚西里尔文、塞尔维亚拉丁文、德语、匈牙利语或某些外来语言,则带有 BOM 的 UTF 会更好。
这是我的观点(计算机和 IT 行业 30 年)。
UTF with a BOM is better if you use UTF-8 in HTML files and if you use Serbian Cyrillic, Serbian Latin, German, Hungarian or some exotic language on the same page.
That is my opinion (30 years of computing and IT industry).
发布评论
评论(22)
UTF-8 BOM 是文本流开头的一系列字节(
0xEF、0xBB、0xBF
),它允许读者更可靠地猜测文件是以 UTF-8 编码。通常,BOM 用于表示编码的字节序,但由于字节序与 UTF-8 无关,因此不需要 BOM。
根据 Unicode 标准,UTF-8 的 BOM不推荐使用的文件:
The UTF-8 BOM is a sequence of bytes at the start of a text stream (
0xEF, 0xBB, 0xBF
) that allows the reader to more reliably guess a file as being encoded in UTF-8.Normally, the BOM is used to signal the endianness of an encoding, but since endianness is irrelevant to UTF-8, the BOM is unnecessary.
According to the Unicode standard, the BOM for UTF-8 files is not recommended:
其他优秀的答案已经回答:
EF BB BF
但是,作为附加信息,UTF-8 的 BOM 可能是“嗅出”字符串是否以 UTF-8 编码的好方法...或者它可能是任何其他编码的合法字符串
...例如,数据 [EF BB BF 41 42 43] 可以是:
因此,虽然通过查看第一个字节来识别文件内容的编码可能很酷,但您不应该依赖于此,如上面的示例所示
编码应该是已知的,而不是猜测的。
The other excellent answers already answered that:
EF BB BF
But, as additional information to this, the BOM for UTF-8 could be a good way to "smell" if a string was encoded in UTF-8... Or it could be a legitimate string in any other encoding...
For example, the data [EF BB BF 41 42 43] could either be:
So while it can be cool to recognize the encoding of a file content by looking at the first bytes, you should not rely on this, as show by the example above
Encodings should be known, not divined.
将 BOM 放入 UTF-8 编码的文件中至少存在三个问题。
而且,正如其他人所提到的,使用 BOM 来检测某些内容是 UTF-8 既不充分也没有必要:
There are at least three problems with putting a BOM in UTF-8 encoded files.
And, as others have mentioned, it is neither sufficient nor necessary to have a BOM to detect that something is UTF-8:
以下是 BOM 用法的示例,它们实际上会导致实际问题,但许多人并不了解。
BOM 破坏脚本
Shell 脚本、Perl 脚本、Python 脚本、Ruby 脚本、Node.js 脚本或任何其他需要由解释器运行的可执行文件 - 全部以 shebang 行 看起来像其中之一:
它告诉系统在调用此类脚本时需要运行哪个解释器。如果脚本以 UTF-8 编码,人们可能会想在开头包含 BOM。但实际上是“#!”角色不仅仅是角色。它们实际上是一个幻数,恰好由两个 ASCII 字符组成。如果您在这些字符之前放置某些内容(例如 BOM),那么该文件将看起来像是具有不同的幻数,这可能会导致问题。
请参阅维基百科,文章:Shebang,部分:幻数:
BOM 在 JSON 中是非法的
请参阅 RFC 7159,第 8.1 节 :
BOM 在 JSON 中是多余的
它在 JSON 中不仅非法,而且不合法需要来确定字符编码,因为有更可靠的方法可以明确确定任何 JSON 流中使用的字符编码和字节顺序(请参阅此答案了解详细信息)。
BOM 破坏了 JSON 解析器
不仅在 JSON 中它是非法并且不需要,它实际上破坏了使用 中介绍的方法确定编码的所有软件 RFC 4627:
确定 JSON 的编码和字节顺序,检查 NUL 的前四个字节byte:
现在,如果文件以 BOM 开头,它将如下所示:
请注意:
识别实现时,所有这些都可能被错误地解释为 UTF-8,然后被误解或拒绝为无效的 UTF-8,或者根本无法识别。
此外,如果实现按照我的建议测试有效的 JSON,它甚至会拒绝确实编码为 UTF-8 的输入,因为它不是以 ASCII 字符 < 开头。 128,因为它应该根据 RFC。
JSON 中的其他数据格式
BOM 是不需要的,是非法的,并且会破坏根据 RFC 正常工作的软件。那时不使用它应该是理所当然的事情,但总有人坚持通过使用 BOM、注释、不同的引用规则或不同的数据类型来破坏 JSON。当然,如果您需要的话,任何人都可以自由使用 BOM 之类的东西或其他任何东西 - 只是不要将其称为 JSON。
对于 JSON 以外的其他数据格式,请看看它的实际情况。如果唯一的编码是 UTF-* 并且第一个字符必须是低于 128 的 ASCII 字符,那么您已经拥有确定数据的编码和字节顺序所需的所有信息。即使将 BOM 添加为可选功能,也只会使其变得更加复杂且容易出错。
BOM 的其他用途
至于 JSON 或脚本之外的用途,我认为这里已经有很好的答案了。我想添加更多有关脚本和序列化的详细信息,因为它是导致实际问题的 BOM 字符的示例。
Here are examples of the BOM usage that actually cause real problems and yet many people don't know about it.
BOM breaks scripts
Shell scripts, Perl scripts, Python scripts, Ruby scripts, Node.js scripts or any other executable that needs to be run by an interpreter - all start with a shebang line which looks like one of those:
It tells the system which interpreter needs to be run when invoking such a script. If the script is encoded in UTF-8, one may be tempted to include a BOM at the beginning. But actually the "#!" characters are not just characters. They are in fact a magic number that happens to be composed out of two ASCII characters. If you put something (like a BOM) before those characters, then the file will look like it had a different magic number and that can lead to problems.
See Wikipedia, article: Shebang, section: Magic number:
BOM is illegal in JSON
See RFC 7159, Section 8.1:
BOM is redundant in JSON
Not only it is illegal in JSON, it is also not needed to determine the character encoding because there are more reliable ways to unambiguously determine both the character encoding and endianness used in any JSON stream (see this answer for details).
BOM breaks JSON parsers
Not only it is illegal in JSON and not needed, it actually breaks all software that determine the encoding using the method presented in RFC 4627:
Determining the encoding and endianness of JSON, examining the first four bytes for the NUL byte:
Now, if the file starts with BOM it will look like this:
Note that:
Depending on the implementation, all of those may be interpreted incorrectly as UTF-8 and then misinterpreted or rejected as invalid UTF-8, or not recognized at all.
Additionally, if the implementation tests for valid JSON as I recommend, it will reject even the input that is indeed encoded as UTF-8, because it doesn't start with an ASCII character < 128 as it should according to the RFC.
Other data formats
BOM in JSON is not needed, is illegal and breaks software that works correctly according to the RFC. It should be a nobrainer to just not use it then and yet, there are always people who insist on breaking JSON by using BOMs, comments, different quoting rules or different data types. Of course anyone is free to use things like BOMs or anything else if you need it - just don't call it JSON then.
For other data formats than JSON, take a look at how it really looks like. If the only encodings are UTF-* and the first character must be an ASCII character lower than 128 then you already have all the information needed to determine both the encoding and the endianness of your data. Adding BOMs even as an optional feature would only make it more complicated and error prone.
Other uses of BOM
As for the uses outside of JSON or scripts, I think there are already very good answers here. I wanted to add more detailed info specifically about scripting and serialization, because it is an example of BOM characters causing real problems.
简短回答:在 UTF-8 中,BOM 被编码为文件开头的字节
EF BB BF
。长答案:
最初,预计 Unicode 将以 UTF-16/UCS-2 编码。 BOM 就是针对这种编码形式而设计的。当您有 2 字节代码单元时,有必要指示这两个字节的顺序,执行此操作的常见约定是在数据开头包含字符 U+FEFF 作为“字节顺序标记”。字符 U+FFFE 永久未分配,因此它的存在可用于检测错误的字节顺序。
无论平台字节序如何,UTF-8 都具有相同的字节顺序,因此不需要字节顺序标记。但是,它可能会出现在从 UTF-16 转换为 UTF-8 的数据中(作为字节序列
EF BB FF
),或者作为“签名”来指示数据是 UTF-8 。没有。正如 Martin Cote 回答的那样,Unicode 标准不推荐它。它会导致不支持 BOM 的软件出现问题。
检测文件是否为 UTF-8 的更好方法是执行有效性检查。 UTF-8 对于哪些字节序列有效有严格的规则,因此误报的概率可以忽略不计。如果字节序列看起来像 UTF-8,那么它很可能就是这样。
Short answer: In UTF-8, a BOM is encoded as the bytes
EF BB BF
at the beginning of the file.Long answer:
Originally, it was expected that Unicode would be encoded in UTF-16/UCS-2. The BOM was designed for this encoding form. When you have 2-byte code units, it's necessary to indicate which order those two bytes are in, and a common convention for doing this is to include the character U+FEFF as a "Byte Order Mark" at the beginning of the data. The character U+FFFE is permanently unassigned so that its presence can be used to detect the wrong byte order.
UTF-8 has the same byte order regardless of platform endianness, so a byte order mark isn't needed. However, it may occur (as the byte sequence
EF BB FF
) in data that was converted to UTF-8 from UTF-16, or as a "signature" to indicate that the data is UTF-8.Without. As Martin Cote answered, the Unicode standard does not recommend it. It causes problems with non-BOM-aware software.
A better way to detect whether a file is UTF-8 is to perform a validity check. UTF-8 has strict rules about what byte sequences are valid, so the probability of a false positive is negligible. If a byte sequence looks like UTF-8, it probably is.
带 BOM 的 UTF-8 更好识别。我是经过艰难的过程才得出这个结论的。我正在开发一个项目,其中结果之一是 CSV 文件,包括 Unicode人物。
如果保存的 CSV 文件没有 BOM,Excel 会认为它是 ANSI 并显示乱码。一旦您在前面添加“EF BB BF”(例如,通过使用带有 UTF-8 的记事本重新保存它;或使用带有 BOM 的 UTF-8 的 Notepad++ 重新保存),Excel 就可以正常打开它。
RFC 3629 建议在 Unicode 文本文件中添加 BOM 字符:“UTF-8,ISO 10646 的转换格式”,2003 年 11 月
在 https://www.rfc-editor.org/rfc/rfc3629 (此最后的信息位于: http://www .herongyang.com/Unicode/Notepad-Byte-Order-Mark-BOM-FEFF-EFBBBF.html)
UTF-8 with BOM is better identified. I have reached this conclusion the hard way. I am working on a project where one of the results is a CSV file, including Unicode characters.
If the CSV file is saved without a BOM, Excel thinks it's ANSI and shows gibberish. Once you add "EF BB BF" at the front (for example, by re-saving it using Notepad with UTF-8; or Notepad++ with UTF-8 with BOM), Excel opens it fine.
Prepending the BOM character to Unicode text files is recommended by RFC 3629: "UTF-8, a transformation format of ISO 10646", November 2003
at https://www.rfc-editor.org/rfc/rfc3629 (this last info found at: http://www.herongyang.com/Unicode/Notepad-Byte-Order-Mark-BOM-FEFF-EFBBBF.html)
以下是维基百科关于字节顺序标记 (BOM) 的文章的一些摘录,我相信这个问题可以得到一个可靠的答案。
关于BOM和UTF-8的含义:
支持 不 使用 BOM 的论点:
参数 FOR 使用 BOM:
哪个更好, 带 或 不带 物料清单:
我的结论:
仅如果与软件应用程序的兼容性绝对必要,则使用 BOM。
另请注意,虽然引用的维基百科文章表明许多 Microsoft 应用程序依赖 BOM 来正确检测 UTF-8,但并非所有 Microsoft 应用程序都是如此。例如,正如 @barlop 所指出的,在使用 UTF-8 的 Windows 命令提示符时†,诸如
type
和more
之类的命令不希望 BOM 存在。如果存在 BOM,则可能会出现问题,就像其他应用程序一样。†
chcp
命令通过代码页提供对 UTF-8(无 BOM)的支持 65001。Here are some excerpts from the Wikipedia article on the byte order mark (BOM) that I believe offer a solid answer to this question.
On the meaning of the BOM and UTF-8:
Argument for NOT using a BOM:
Argument FOR using a BOM:
On which is better, WITH or WITHOUT the BOM:
My Conclusion:
Use the BOM only if compatibility with a software application is absolutely essential.
Also note that while the referenced Wikipedia article indicates that many Microsoft applications rely on the BOM to correctly detect UTF-8, this is not the case for all Microsoft applications. For example, as pointed out by @barlop, when using the Windows Command Prompt with UTF-8†, commands such
type
andmore
do not expect the BOM to be present. If the BOM is present, it can be problematic as it is for other applications.† The
chcp
command offers support for UTF-8 (without the BOM) via code page 65001.这个问题已经有百万零一个答案,其中许多都非常好,但我想尝试澄清何时应该或不应该使用 BOM。
如前所述,任何使用 UTF BOM(字节顺序标记)来确定字符串是否为 UTF-8 的行为都是有根据的猜测。如果有适当的元数据可用(例如
charset="utf-8"
),那么您已经知道应该使用什么,但否则您需要测试并做出一些假设。这涉及检查字符串来自的文件是否以十六进制字节代码 EF BB BF 开头。如果找到与 UTF-8 BOM 对应的字节码,则概率足够高,可以假设它是 UTF-8,您可以从那里开始。然而,当被迫做出这种猜测时,在阅读时进行额外的错误检查仍然是一个好主意,以防出现乱码。仅当输入根据其源绝对不应该是 UTF-8 时,才应假设 BOM 不是 UTF-8(即 latin-1 或 ANSI)。但是,如果没有 BOM,您可以通过验证编码来简单地确定它是否应该是 UTF-8。
为什么不建议使用 BOM?
何时应该使用 BOM 进行编码?
如果您无法以任何其他方式(通过字符集标记或文件系统元)记录元数据,并且使用的程序如 BOM,则应使用 BOM 进行编码。在 Windows 上尤其如此,通常认为没有 BOM 的任何内容都使用旧代码页。 BOM 告诉 Office 等程序,是的,该文件中的文本是 Unicode;这是使用的编码。
归根结底,我唯一真正遇到问题的文件是 CSV。根据程序的不同,它要么必须有 BOM,要么不能没有 BOM。例如,如果您在 Windows 上使用 Excel 2007+,如果您想顺利打开它而不需要导入数据,则必须使用 BOM 对其进行编码。
This question already has a million-and-one answers and many of them are quite good, but I wanted to try and clarify when a BOM should or should not be used.
As mentioned, any use of the UTF BOM (Byte Order Mark) in determining whether a string is UTF-8 or not is educated guesswork. If there is proper metadata available (like
charset="utf-8"
), then you already know what you're supposed to be using, but otherwise you'll need to test and make some assumptions. This involves checking whether the file a string comes from begins with the hexadecimal byte code, EF BB BF.If a byte code corresponding to the UTF-8 BOM is found, the probability is high enough to assume it's UTF-8 and you can go from there. When forced to make this guess, however, additional error checking while reading would still be a good idea in case something comes up garbled. You should only assume a BOM is not UTF-8 (i.e. latin-1 or ANSI) if the input definitely shouldn't be UTF-8 based on its source. If there is no BOM, however, you can simply determine whether it's supposed to be UTF-8 by validating against the encoding.
Why is a BOM not recommended?
When should you encode with a BOM?
If you're unable to record the metadata in any other way (through a charset tag or file system meta), and the programs being used like BOMs, you should encode with a BOM. This is especially true on Windows where anything without a BOM is generally assumed to be using a legacy code page. The BOM tells programs like Office that, yes, the text in this file is Unicode; here's the encoding used.
When it comes down to it, the only files I ever really have problems with are CSV. Depending on the program, it either must, or must not have a BOM. For example, if you're using Excel 2007+ on Windows, it must be encoded with a BOM if you want to open it smoothly and not have to resort to importing the data.
BOM 往往会在某个地方蓬勃发展(没有双关语(原文如此))。当它蓬勃发展时(例如,无法被浏览器、编辑器等识别),它会在文档开头显示为奇怪的字符

(例如,HTML文件,JSON 响应,RSS 等)并导致类似 最近奥巴马在 Twitter 上讲话时遇到的编码问题。当它出现在难以调试的地方或忽略测试时,这是非常烦人的。因此,除非必须使用它,否则最好避免使用它。
BOM tends to boom (no pun intended (sic)) somewhere, someplace. And when it booms (for example, doesn't get recognized by browsers, editors, etc.), it shows up as the weird characters

at the start of the document (for example, HTML file, JSON response, RSS, etc.) and causes the kind of embarrassments like the recent encoding issue experienced during the talk of Obama on Twitter.It's very annoying when it shows up at places hard to debug or when testing is neglected. So it's best to avoid it unless you must use it.
不带 BOM 的 UTF-8 没有 BOM,这并不比带 BOM 的 UTF-8 更好,除非文件的使用者需要知道(或将从知道中受益)文件是否是 UTF-8 编码的或不。
BOM 通常可用于确定编码的字节顺序,但这对于大多数用例来说不是必需的。
此外,对于那些不了解或不关心 BOM 的消费者来说,BOM 可能会带来不必要的噪音/痛苦,并可能导致用户困惑。
UTF-8 without BOM has no BOM, which doesn't make it any better than UTF-8 with BOM, except when the consumer of the file needs to know (or would benefit from knowing) whether the file is UTF-8-encoded or not.
The BOM is usually useful to determine the endianness of the encoding, which is not required for most use cases.
Also, the BOM can be unnecessary noise/pain for those consumers that don't know or care about it, and can result in user confusion.
应该注意的是,对于某些文件,即使在 Windows 上也不能有 BOM。例如
SQL*plus
或VBScript
文件。如果此类文件包含 BOM,您在尝试执行它们时会收到错误消息。It should be noted that for some files you must not have the BOM even on Windows. Examples are
SQL*plus
orVBScript
files. In case such files contains a BOM you get an error when you try to execute them.在 BOM 的 Wikipedia 页面底部引用:http://en.wikipedia。 org/wiki/Byte-order_mark#cite_note-2
Quoted at the bottom of the Wikipedia page on BOM: http://en.wikipedia.org/wiki/Byte-order_mark#cite_note-2
仅当文件实际包含一些非 ASCII 字符时,带有 BOM 的 UTF-8 才有用。如果包含它但没有任何文件,那么它可能会破坏旧的应用程序,否则这些应用程序会将文件解释为纯 ASCII。这些应用程序在遇到非 ASCII 字符时肯定会失败,因此在我看来,只有当文件可以且不应该再被解释为纯 ASCII 时才应添加 BOM。
我想明确表示,我宁愿根本没有 BOM。如果一些旧的垃圾没有它就崩溃了,那么添加它,并且替换该遗留应用程序是不可行的。
不要做任何除了 UTF-8 的 BOM 之外的事情。
UTF-8 with BOM only helps if the file actually contains some non-ASCII characters. If it is included and there aren't any, then it will possibly break older applications that would have otherwise interpreted the file as plain ASCII. These applications will definitely fail when they come across a non ASCII character, so in my opinion the BOM should only be added when the file can, and should, no longer be interpreted as plain ASCII.
I want to make it clear that I prefer to not have the BOM at all. Add it in if some old rubbish breaks without it, and replacing that legacy application is not feasible.
Don't make anything expect a BOM for UTF-8.
我从不同的角度看待这个问题。我认为 带有 BOM 的 UTF-8 更好,因为它提供了有关文件的更多信息。仅当遇到问题时我才使用无 BOM 的 UTF-8。
我在页面上长时间使用多种语言(甚至西里尔语),并且当文件保存时没有 BOM,我重新打开它们以使用编辑器进行编辑(正如 cherouvim 也指出的那样),某些字符已损坏。
请注意,Windows 的经典 记事本 会在以下情况下自动保存带有 BOM 的文件:您尝试使用 UTF-8 编码保存新创建的文件。
我个人使用 BOM 和 .html 保存服务器端脚本文件(.asp、.ini、.aspx)没有 BOM 的文件。
I look at this from a different perspective. I think UTF-8 with BOM is better as it provides more information about the file. I use UTF-8 without BOM only if I face problems.
I am using multiple languages (even Cyrillic) on my pages for a long time and when the files are saved without BOM and I re-open them for editing with an editor (as cherouvim also noted), some characters are corrupted.
Note that Windows' classic Notepad automatically saves files with a BOM when you try to save a newly created file with UTF-8 encoding.
I personally save server side scripting files (.asp, .ini, .aspx) with BOM and .html files without BOM.
当您想要显示以 UTF-8 编码的信息时,您可能不会遇到问题。例如,将 HTML 文档声明为 UTF-8,您将在浏览器中显示文档正文中包含的所有内容。
但当我们在 Windows 上有文本、 CSV 和 XML 文件时,情况并非如此或Linux。
例如,Windows 或 Linux 中的文本文件,这是可以想象到的最简单的事情之一,它(通常)不是 UTF-8。
将其另存为 XML 并将其声明为 UTF-8:
即使声明为 UTF-8,它也不会正确显示(不会被读取)。
我有一串包含法语字母的数据,需要将其保存为 XML 以便联合。如果没有从一开始就创建 UTF-8 文件(更改 IDE 中的选项和“创建新文件”)或在文件开头添加 BOM,
我无法将法文字母保存在 XML 文件中。
When you want to display information encoded in UTF-8 you may not face problems. Declare for example an HTML document as UTF-8 and you will have everything displayed in your browser that is contained in the body of the document.
But this is not the case when we have text, CSV and XML files, either on Windows or Linux.
For example, a text file in Windows or Linux, one of the easiest things imaginable, it is not (usually) UTF-8.
Save it as XML and declare it as UTF-8:
It will not display (it will not be be read) correctly, even if it's declared as UTF-8.
I had a string of data containing French letters, that needed to be saved as XML for syndication. Without creating a UTF-8 file from the very beginning (changing options in IDE and "Create New File") or adding the BOM at the beginning of the file
I was not able to save the French letters in an XML file.
一个实际的区别是,如果您为 Mac OS X 编写 shell 脚本并将其保存为纯 UTF-8,您将得到响应:
响应指定您希望使用哪个 shell 的 shebang 行:
如果您另存为UTF-8,无 BOM(例如 BBEdit)一切都会好起来的。
One practical difference is that if you write a shell script for Mac OS X and save it as plain UTF-8, you will get the response:
in response to the shebang line specifying which shell you wish to use:
If you save as UTF-8, no BOM (say in BBEdit) all will be well.
如上所述,带有 BOM 的 UTF-8 可能会导致不支持 BOM 的(或兼容的)软件出现问题。我曾经使用基于 Mozilla 的 KompoZer 编辑了编码为 UTF-8 + BOM 的 HTML 文件,作为客户要求所见即所得程序。
保存时布局总是会被破坏。我花了一些时间才解决这个问题。这些文件在 Firefox 中运行良好,但在 Internet Explorer 中表现出 CSS 怪癖,再次破坏了布局。在摆弄链接的 CSS 文件几个小时但无济于事后,我发现 Internet Explorer 不喜欢 BOMfed HTML 文件。再也不会了。
另外,我刚刚在维基百科上找到了这个:
As mentioned above, UTF-8 with BOM may cause problems with non-BOM-aware (or compatible) software. I once edited HTML files encoded as UTF-8 + BOM with the Mozilla-based KompoZer, as a client required that WYSIWYG program.
Invariably the layout would get destroyed when saving. It took my some time to fiddle my way around this. These files then worked well in Firefox, but showed a CSS quirk in Internet Explorer destroying the layout, again. After fiddling with the linked CSS files for hours to no avail I discovered that Internet Explorer didn't like the BOMfed HTML file. Never again.
Also, I just found this in Wikipedia:
Unicode 字节顺序标记 (BOM) 常见问题解答提供了简洁的答案:
The Unicode Byte Order Mark (BOM) FAQ provides a concise answer:
来自http://en.wikipedia.org/wiki/Byte-order_mark:
始终在文件中使用 BOM 将确保文件始终在支持 UTF-8 和 BOM 的编辑器中正确打开。
我因缺少 BOM 而遇到的真正问题如下。假设我们有一个文件,其中包含:
没有 BOM,它在大多数编辑器中以 ANSI 格式打开。因此该文件的另一个用户打开它并附加一些本机字符,例如:
哎呀...现在该文件仍然是 ANSI 格式,猜猜看,“αβγ”不占用 6 个字节,而是 3 个字节。这不是 UTF-8这会导致开发链后期出现其他问题。
From http://en.wikipedia.org/wiki/Byte-order_mark:
Always using a BOM in your file will ensure that it always opens correctly in an editor which supports UTF-8 and BOM.
My real problem with the absence of BOM is the following. Suppose we've got a file which contains:
Without BOM this opens as ANSI in most editors. So another user of this file opens it and appends some native characters, for example:
Oops... Now the file is still in ANSI and guess what, "αβγ" does not occupy 6 bytes, but 3. This is not UTF-8 and this causes other problems later on in the development chain.
以下是我使用 Visual Studio、Sourcetree 和 Bitbucket Pull 请求的经验,这些请求一直在提供我遇到了一些问题:
所以事实证明,在审查拉取请求时,带有签名的 BOM 将在每个文件上包含一个红点字符(这可能非常烦人)。
悬停在它上面,它会显示一个像“ufeff”这样的字符,但事实证明 Sourcetree 不显示这些类型的字节标记,所以它很可能最终会出现在你的拉取请求中,这应该没问题,因为这就是 Visual Studio 2017 现在对新文件进行编码的方式,所以也许 Bitbucket 应该忽略这一点或使其以其他方式显示,更多信息请参见:
红点标记 BitBucket 差异视图
Here is my experience with Visual Studio, Sourcetree and Bitbucket pull requests, which has been giving me some problems:
So it turns out BOM with a signature will include a red dot character on each file when reviewing a pull request (it can be quite annoying).
If you hover on it, it will show a character like "ufeff", but it turns out Sourcetree does not show these types of bytemarks, so it will most likely end up in your pull requests, which should be ok because that's how Visual Studio 2017 encodes new files now, so maybe Bitbucket should ignore this or make it show in another way, more info here:
Red dot marker BitBucket diff view
我用utf-8保存autohotkey文件,汉字变得奇怪。
使用utf-8 BOM,工作正常。
https://www.autohotkey.com/docs/FAQ.htm#nonascii
I save a autohotkey file with utf-8, the chinese characters become strang.
With utf-8 BOM, works fine.
https://www.autohotkey.com/docs/FAQ.htm#nonascii
如果您在 HTML 文件中使用 UTF-8,并且在同一页面上使用塞尔维亚西里尔文、塞尔维亚拉丁文、德语、匈牙利语或某些外来语言,则带有 BOM 的 UTF 会更好。
这是我的观点(计算机和 IT 行业 30 年)。
UTF with a BOM is better if you use UTF-8 in HTML files and if you use Serbian Cyrillic, Serbian Latin, German, Hungarian or some exotic language on the same page.
That is my opinion (30 years of computing and IT industry).