当前位置：文江博客话题详情

为什么要使用 Unicode 签名字节顺序标记 (BOM)？

发布于 2024-07-25 10:11:22 字数 80 浏览 8 评论 0原文

这些已经过时了吗？它们似乎是有史以来最糟糕的想法——在文件内容中嵌入一些没人能看到的东西，但却会影响文件的功能。我不明白为什么我想要一个。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

我不咬妳我踢妳 2024-08-01 10:11:22

是的，它们在某些情况下是必要的，因为 UTF-16 既有小端字节序又有大端字节序实现。

当读取未知的 UTF-16 文件时，如何判断使用的是两者中的哪一个？
唯一的解决方案是在文件中放置某种易于识别的标记，无论使用何种字节序，都永远不会将其误认为其他任何内容。

这就是 BOM 的作用。

您需要一个吗？仅当您 1) 使用存在字节序问题的 UTF 编码（这对于 UTF-16 很重要，但无论字节序如何，UTF8 看起来总是相同），文件将被共享与外部应用程序。

如果您自己的应用程序是唯一要读取和写入该文件的应用程序，则您可以省略 BOM，并简单地一劳永逸地决定要使用哪种字节序。但如果另一个应用程序必须读取该文件，它不会提前知道字节顺序，因此添加 BOM 可能是一个好主意。

回复收藏 0 原文

合久必婚 2024-08-01 10:11:22

Unicode 联盟的UTF 和 BOM 常见问题解答中的一些摘录可能会有所帮助。

问：什么是 BOM？
A：字节顺序标记 (BOM) 由字符代码 U+FEFF 位于数据流开头组成，可用作签名定义字节顺序和编码形式，主要是未标记的纯文本文件。在某些更高级别的协议下，在该协议中定义的 Unicode 数据流中可能强制（或禁止）使用 BOM。 （强调我的。）

我不会确切地说字节顺序标记嵌入在数据中。相反，它为数据添加前缀。当该字符是数据流中的第一个字符时，它只是一个字节顺序标记。在其他任何地方，它都是零宽度不间断空格。不尊重字节顺序标记的 Unicode 感知程序无论如何都不会因为它的存在而受到真正的损害，因为字符是不可见的，并且文本块开头的单词连接器只是将下一个字符连接到空，所以没有效果。

问：BOM 在哪里有用？
答： BOM 在以文本形式输入的文件开头很有用，但对于这些文件，不知道它们是大端还是小端格式 — 它也可以用作提示表明该文件采用 Unicode，而不是传统编码，此外，它充当所使用的特定编码形式的签名。

因此，当您的程序能够处理 Unicode 的多种编码时，您就需要 BOM。否则你的程序如何知道在解释其输入时使用哪种编码？

问：使用 BOM 时，是否只能在 16 位 Unicode 文本中使用？
答：不，无论 Unicode 文本如何转换，BOM 都可以用作签名：UTF-16、UTF-8、UTF-7 等。 BOM 将是由该转换格式转换成的 Unicode 字符 U+FEFF。在这种形式中，BOM 用于指示它是一个 Unicode 文件以及它采用的格式。

这可能是当今 BOM 最常用的情况。它将 UTF-8 编码的文本与任何其他编码区分开来；它并没有真正标记字节的顺序，因为 UTF-8 只有一种顺序。

如果您正在设计自己的协议或数据格式，则不需要使用 BOM。常见问题解答中的另一个问题涉及到这一点：

问：如何标记不将 U+FEFF 解释为 BOM 的数据？
答：使用标签 UTF-16BE 表示大端 UTF-16 文本，使用 UTF-16LE 表示小端 UTF-16 文本。如果您确实使用 BOM，请将文本标记为简单的 UTF-16。

它提到了标记数据格式的概念。这意味着从数据本身指定带外格式。如果您可以使用这样的工具，那就太好了，但通常情况下是不行的，特别是当旧系统正在针对 Unicode 进行改造时。

Some excerpts from the UTF and BOM FAQ from the Unicode Consortium may be helpful.

Q: What is a BOM?
A: A byte order mark (BOM) consists of the character code U+FEFF at the beginning of a data stream, where it can be used as a signature defining the byte order and encoding form, primarily of unmarked plaintext files. Under some higher level protocols, use of a BOM may be mandatory (or prohibited) in the Unicode data stream defined in that protocol. (Emphasis mine.)

I wouldn't exactly say the byte-order mark is embedded in the data. Rather, it prefixes the data. The character is only a byte-order mark when it's the first thing in the data stream. Anywhere else, and it's the zero-width non-breaking space. Unicode-aware programs that don't honor the byte-order mark aren't really harmed by its presence anyway since the character is invisible, and a word-joiner at the start of a block of text just joins the next character to nothing, so it has no effect.

Q: Where is a BOM useful?
A: A BOM is useful at the beginning of files that are typed as text, but for which it is not known whether they are in big or little endian format—it can also serve as a hint indicating that the file is in Unicode, as opposed to in a legacy encoding and furthermore, it act as a signature for the specific encoding form used.

So, you'd want a BOM when your program is capable of handling multiple encodings of Unicode. How else will your program know which encoding to use when interpreting its input?

Q: When a BOM is used, is it only in 16-bit Unicode text?
A: No, a BOM can be used as a signature no matter how the Unicode text is transformed: UTF-16, UTF-8, UTF-7, etc. The exact bytes comprising the BOM will be whatever the Unicode character U+FEFF is converted into by that transformation format. In that form, the BOM serves to indicate both that it is a Unicode file, and which of the formats it is in.

That's probably the case where the BOM is used most frequently today. It distinguishes UTF-8-encoded text from any other encodings; it's not really marking the order of the bytes since UTF-8 only has one order.

If you're designing your own protocol or data format, you're not required to use a BOM. Another question from the FAQ touches on that:

Q: How do I tag data that does not interpret U+FEFF as a BOM?
A: Use the tag UTF-16BE to indicate big-endian UTF-16 text, and UTF-16LE to indicate little-endian UTF-16 text. If you do use a BOM, tag the text as simply UTF-16.

It mentions the concept of tagging your data's format. That means specifying the format out-of-band from the data itself. That's great if such a facility is available to you, but it's often not, especially when older systems are being retrofitted for Unicode.

回复收藏 0 原文

陌生 2024-08-01 10:11:22

BOM 表示文件采用哪种 Unicode 编码。如果没有这种区别，unicode 读取器将不知道如何读取文件。

但是，UTF-8 不需要 BOM。

查看维基百科文章。

回复收藏 0 原文

酒中人 2024-08-01 10:11:22

当你用 UTF-8 标记它时，我会说你不需要 BOM。 Byto 顺序标记仅对 UTF-16 和 UTF-32 有用，因为它通知计算机文件是否位于大端或小端。某些文本编辑器可能会使用字节顺序标记来决定文档使用的编码，但这不是 Unicode 标准的一部分。

回复收藏 0 原文

巾帼英雄 2024-08-01 10:11:22

“BOM”是 Unicode 早期的遗留物，当时人们认为使用 Unicode 就意味着使用 16 位字符。对于像 UTF-8 这样只有一个字节顺序的编码来说，这是完全没有意义的。对于 UTF-32，U+FEFF 的选择也不是最优的，因为它无法区分所有可能的中间字节顺序（为此，需要使用 4 个不同字节编码的 BOM）。

使用它的唯一原因是在具有不同字节顺序的平台之间发送 UTF-16 或 UTF-32 数据时，但 (1) 大多数人无论如何都使用 UTF-8，以及 (2) MIME 字符集参数提供了更好的机制。

回复收藏 0 原文

怎会甘心 2024-08-01 10:11:22

由于 UTF16 和 UTF32 BOM 表明内容是 Big-Endian 格式还是 Little-Endian 格式以及该内容是否为 Unicode，因此 UTF-8 BOM 将文件分类为 utf-8 编码。没有UTF-8 BOM，怎么知道是ANSI文件还是UTF-8编码文件呢？ UTF-8 BOM 当然不会告诉字节顺序，因为 utf-8 始终是字节流，但它会告诉内容是 utf-8 编码的 Unicode 还是 ANSI。当然，您可以扫描有效的 utf-8 序列，但在我看来，检查文件的前三个字节更容易。

回复收藏 0 原文