为什么要使用 Unicode 签名字节顺序标记 (BOM)?
这些已经过时了吗? 它们似乎是有史以来最糟糕的想法——在文件内容中嵌入一些没人能看到的东西,但却会影响文件的功能。 我不明白为什么我想要一个。
Are these obsolete? They seem like the worst idea ever -- embed something in the contents of your file that no one can see, but impacts the file's functionality. I don't understand why I would want one.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
是的,它们在某些情况下是必要的,因为 UTF-16 既有小端字节序又有大端字节序实现。
当读取未知的 UTF-16 文件时,如何判断使用的是两者中的哪一个?
唯一的解决方案是在文件中放置某种易于识别的标记,无论使用何种字节序,都永远不会将其误认为其他任何内容。
这就是 BOM 的作用。
您需要一个吗? 仅当您 1) 使用存在字节序问题的 UTF 编码(这对于 UTF-16 很重要,但无论字节序如何,UTF8 看起来总是相同), 文件将被共享与外部应用程序。
如果您自己的应用程序是唯一要读取和写入该文件的应用程序,则您可以省略 BOM,并简单地一劳永逸地决定要使用哪种字节序。 但如果另一个应用程序必须读取该文件,它不会提前知道字节顺序,因此添加 BOM 可能是一个好主意。
They're necessary in some cases, yes, because there are both little-endian and big-endian implementations of UTF-16.
When reading an unknown UTF-16 file, how can you tell which of the two is used?
The only solution is to place some kind of easily identifiable marker in the file, which can never be mistaken for anything else, regardless of the endian-ness used.
That's what the BOM does.
And do you need one? Only if you're 1) using an UTF encoding where endianness is an issue (It matters for UTF-16, but UTF8 always looks the same regardless of endianness), and the file is going to be shared with external applications.
If your own app is the only one that's going to read and write the file, you can omit the BOM, and simply decide once and for all which endianness you're going to use. But if another application has to read the file, it won't know the endianness in advance, so adding the BOM might be a good idea.
Unicode 联盟的UTF 和 BOM 常见问题解答中的一些摘录可能会有所帮助。
我不会确切地说字节顺序标记嵌入在数据中。 相反,它为数据添加前缀。 当该字符是数据流中的第一个字符时,它只是一个字节顺序标记。 在其他任何地方,它都是零宽度不间断空格。 不尊重字节顺序标记的 Unicode 感知程序无论如何都不会因为它的存在而受到真正的损害,因为字符是不可见的,并且文本块开头的单词连接器只是将下一个字符连接到空,所以没有效果。
因此,当您的程序能够处理 Unicode 的多种编码时,您就需要 BOM。 否则你的程序如何知道在解释其输入时使用哪种编码?
这可能是当今 BOM 最常用的情况。 它将 UTF-8 编码的文本与任何其他编码区分开来; 它并没有真正标记字节的顺序,因为 UTF-8 只有一种顺序。
如果您正在设计自己的协议或数据格式,则不需要使用 BOM。 常见问题解答中的另一个问题涉及到这一点:
它提到了标记数据格式的概念。 这意味着从数据本身指定带外格式。 如果您可以使用这样的工具,那就太好了,但通常情况下是不行的,特别是当旧系统正在针对 Unicode 进行改造时。
Some excerpts from the UTF and BOM FAQ from the Unicode Consortium may be helpful.
I wouldn't exactly say the byte-order mark is embedded in the data. Rather, it prefixes the data. The character is only a byte-order mark when it's the first thing in the data stream. Anywhere else, and it's the zero-width non-breaking space. Unicode-aware programs that don't honor the byte-order mark aren't really harmed by its presence anyway since the character is invisible, and a word-joiner at the start of a block of text just joins the next character to nothing, so it has no effect.
So, you'd want a BOM when your program is capable of handling multiple encodings of Unicode. How else will your program know which encoding to use when interpreting its input?
That's probably the case where the BOM is used most frequently today. It distinguishes UTF-8-encoded text from any other encodings; it's not really marking the order of the bytes since UTF-8 only has one order.
If you're designing your own protocol or data format, you're not required to use a BOM. Another question from the FAQ touches on that:
It mentions the concept of tagging your data's format. That means specifying the format out-of-band from the data itself. That's great if such a facility is available to you, but it's often not, especially when older systems are being retrofitted for Unicode.
BOM 表示文件采用哪种 Unicode 编码。如果没有这种区别,unicode 读取器将不知道如何读取文件。
但是,UTF-8 不需要 BOM。
查看维基百科文章。
The BOM signifies which encoding of Unicode the file is in. Without this distinction, a unicode reader would not know how to read the file.
However, UTF-8 doesn't require a BOM.
Check out the Wikipedia article.
当你用 UTF-8 标记它时,我会说你不需要 BOM。 Byto 顺序标记仅对 UTF-16 和 UTF-32 有用,因为它通知计算机文件是否位于 大端或小端。 某些文本编辑器可能会使用字节顺序标记来决定文档使用的编码,但这不是 Unicode 标准的一部分。
As you tagged this with UTF-8 I'm going to say you don't need a BOM. Byto Order Marks are only useful for UTF-16 and UTF-32 as it informs the computer whether the file is in Big Endian or Little Endian. Some text editors may use the Byte Order Mark to decide what encoding the document uses but this is not part of the Unicode standard.
“BOM”是 Unicode 早期的遗留物,当时人们认为使用 Unicode 就意味着使用 16 位字符。 对于像 UTF-8 这样只有一个字节顺序的编码来说,这是完全没有意义的。 对于 UTF-32,U+FEFF 的选择也不是最优的,因为它无法区分所有可能的中间字节顺序(为此,需要使用 4 个不同字节编码的 BOM)。
使用它的唯一原因是在具有不同字节顺序的平台之间发送 UTF-16 或 UTF-32 数据时,但 (1) 大多数人无论如何都使用 UTF-8,以及 (2) MIME
字符集 参数提供了更好的机制。
The "BOM" is a holdover from the early days of Unicode when it was assumed that using Unicode would mean using 16-bit characters. It is completely pointless in an encoding like UTF-8 which has only one byte order. The choice of U+FEFF is also suboptimal for UTF-32, because it cannot distinguish between all possible middle-endian byte orders (to do so would require a BOM encoded with 4 different bytes).
The only reason you'd use one is when sending UTF-16 or UTF-32 data between platforms with different byte orders, but (1) most people use UTF-8 anyway, and (2) the MIME
charset
parameter provides a better mechanism.由于 UTF16 和 UTF32 BOM 表明内容是 Big-Endian 格式还是 Little-Endian 格式以及该内容是否为 Unicode,因此 UTF-8 BOM 将文件分类为 utf-8 编码。 没有UTF-8 BOM,怎么知道是ANSI文件还是UTF-8编码文件呢? UTF-8 BOM 当然不会告诉字节顺序,因为 utf-8 始终是字节流,但它会告诉内容是 utf-8 编码的 Unicode 还是 ANSI。 当然,您可以扫描有效的 utf-8 序列,但在我看来,检查文件的前三个字节更容易。
As UTF16 and UTF32 BOMs tell whether the content is in Big-Endian or Little-Endian Format and also that content is Unicode, the UTF-8 BOM classifies the file as utf-8 encoded. Without the UTF-8 BOM, how can you know if it is a ANSI file or UTF-8 encoded file? The UTF-8 BOM doesn't tell endianess of course, because utf-8 is always a byte stream, but it tells if content is utf-8 encoded Unicode or ANSI. Of course you can scan for valid utf-8 sequences but in my opinion, it is easier to check the first three Bytes of the file.
UTF16 和 UTF32 可以以 Big-Endian 和 Little-Endian 形式编写。 您可以尝试通过分析以任一字节顺序处理文件的结果来启发式确定字节顺序,但为了省去您的麻烦,BOM 可以立即告诉您。
不过,UTF-8 实际上并不需要 BOM,因为您可以逐字节对其进行解码。
UTF16 and UTF32 can be written in both Big-Endian and Little-Endian forms. You could try to heuristically determine the endianess by analysing the result of treating the file in either endianess, but to save you all that bother, the BOM can tell you right away.
UTF-8 doesn't really need a BOM though, as you decode it byte by byte.
无论您在创建文本文件时是否自己使用这些,在阅读文本文件时可能都值得注意。 即检测并跳过(最好进行相应处理)文件开头的 BOM。 我遇到过一些有它的人,最初这给我带来了一些问题,直到我弄清楚发生了什么。
Regardless of whether you use these yourself when creating text files, its probably worthwhile to be aware of when you read text files. i.e. detect and skip (and ideally handle accordingly) the BOM at the beginning of the file. I've run into a few which had it and which caused my some issues initially until I figured out what was going on.