到底是什么导致二进制文件“乱码”?
我还没有找到这个问题的答案; 也许没有。 但我对此感到好奇有一段时间了。
当您在文本编辑器中查看二进制文件时,到底是什么导致它显示为“乱码”? 加密文件也是如此。 文件的二进制值是否尝试转换为 ASCII? 是否可以将视图转换为显示原始二进制值,即显示组成文件的 1 和 0?
最后,有没有办法确定哪个程序可以正确打开数据文件? 很多时候,尤其是在 Windows 中,文件是孤立的或与特定程序无关。 在文本编辑器中打开它有时会告诉您它所属的位置,但大多数时候由于乱码而无法告诉您。 如果扩展程序不提供任何信息,您如何确定它属于哪个程序?
I haven't found an answer to this particular question; perhaps there isn't one. But I've been wondering for a while about it.
What exactly causes a binary file to display as "gibberish" when you look at it in a text editor? It's the same thing with encrypted files. Are the binary values of the file trying to be converted into ASCII? Is it possible to convert the view to display raw binary values, i.e. to show the 1s and 0s that make up the file?
Finally, is there a way to determine what program will properly open a data file? Many times, especially with Windows, a file is orphaned or otherwise not associated w/ a particular program. Opening it in a text editor sometimes tells you where it belongs but most of the time doesn't, due to the gibberish. If the extension doesn't provide any information, how can you determine what program it belongs to?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
是的,这正是发生的情况。 通常,文件的二进制值还包括不可打印的 ASCII 控制字符,导致在典型的文本编辑器中显示更加奇怪。
显示组成文件的 1 和 0?
这取决于您的编辑器。 您想要的是“十六进制编辑器”,而不是普通的文本编辑器。 这将向您显示文件的原始内容(通常以十六进制而不是二进制形式,因为零和一会占用大量空间并且更难以读取)。
数据文件?
有一个名为 "file 的 Linux 命令行程序" 它将尝试分析文件(通常寻找常见的标头模式)并告诉您它是什么类型的文件(例如文本、音频、视频或 XML 等)。 我不确定是否有适用于 Windows 的等效程序。 当然,这个程序的输出只是一个猜测,但是当您不知道文件的格式是什么时,它会非常有用。
Yes, that's exactly what's happening. Typically, the binary values of the file also include ASCII control characters that aren't printable, resulting in even more bizarre display in a typical text editor.
to show the 1s and 0s that make up the file?
It depends on your editor. What you want is a "hex editor", rather than a normal text editor. This will show you the raw contents of the file (typically in hexadecimal rather than binary, since the zeros and ones would take up a lot of space and be harder to read).
a data file?
There is a Linux command-line program called "file" that will attempt to analyze the file (typically looking for common header patterns) and tell you what sort of file it is (for example text, or audio, or video, or XML, etc). I'm not sure if there is an equivalent program for Windows. Of course, the output of this program is just a guess, but it can be very useful when you don't know what the format of a file is.
二进制文件看起来是乱码,因为其中的数据是为机器读取而不是为人类设计的。 可悲的是,我们中的一些人习惯于解释乱码——尽管有一些专门的工具可以帮助更好地查看数据——但大多数人不需要知道。
文件中的每个字节都被视为当前代码集中的一个字符(在 Windows 上可能是 CP1252)。 例如,字节值 65 为“A”; 您可以在网上轻松找到说明性示例。 因此,组成二进制数据的字节将根据代码集显示 - 尽文本编辑器所能达到的最佳效果。 它不会尝试转换二进制文件 - 它不知道如何转换(只有原始程序会这样做)。
至于如何检测创建该文件的程序 - 有时您可能可以做到这一点,但并不容易且可靠。 在 Unix 上(或者在 Windows 上使用 Cygwin),“文件”程序可能会有所帮助。 该程序查看前几个字节来尝试猜测该程序。
加密的数据应该看起来像乱码。 如果它看起来不像乱码,那么它可能没有很好地加密。
A binary file appears as gibberish because the data in it is designed for the machine to read and not for humans. Sadly, some of us get used to interpreting gibberish - albeit with somewhat specialized tools to help see the data better - but most people should not need to know.
Each byte in the file is treated as a character in the current code set (probably CP1252 on Windows). Byte value 65 is 'A', for example; you can find illustrative examples easily on the web. So, the bytes that make up the binary data are displayed according to the code set - as best as the text editor can. It doesn't try to convert the binary - it doesn't know how (only the original program does).
As to how to detect what program created the file - you may be able to do that sometimes, but not easily and reliably. On Unix (or with Cygwin on Windows) the 'file' program may be able to help. This program looks at the first few bytes to try and guess the program.
Encrypted data is supposed to look like gibberish. If it doesn't look like gibberish, then it probably isn't very well encrypted.
显示看起来很有趣,因为二进制文件可以包含不可打印的字符。 由显示程序将这些字符替换为其他字符。
使用十六进制编辑器可以防止这种情况。 这样的程序将文件中的每个字节显示为其十六进制值。 这形成了一个很好的文件表格视图,但对于普通人来说,解读这个视图并不容易,因为我们不习惯以这种方式查看数据。
有几种方法可以找出文件可能属于哪个程序。 您可以查看文件的开头,并掌握一些知识,您可能会识别文件类型。 有些类型以相同的字符开头(RAR、GIF 等)。 对于其他类型来说可能没那么容易。
在 Linux 中,您可以使用“file”命令来帮助您确定文件类型。 可能有适用于 Windows 的程序可以执行相同的操作。
The display looks interesting, because a binary file can contain non-printable characters. It is up to the displaying program to replace such characters with something else.
This can be prevented by using a hex editor. Such a program displays each byte from the file as its hexadecimal value. That makes for a nice tabular view of the file, but it is not easy for the average person to decipher this view, because we are not used to look at data that way.
There are a few ways to find out what program a file might belong to. You can look at the beginning of the file and with some knowledge, you might recognize the file type. There are some types that begin with the same characters (RAR, GIF etc.). For other types it might not be as easy.
In Linux you can use the "file" command to help you determine file type. There are probably programs for Windows that will do the same.
二进制数据通常非常随机。 根据定义,尤其是加密数据。 每个字节可以由 256 个字符之一表示(不考虑 Unicode)。 ASCII 仅涵盖其中 128 个字符,其中只有 94 个是实际可打印字符。 在 ASCII 范围之外,您会遇到许多国际字符和奇怪的符号。 其中肯定超过 128 个,因此必须指定代码页来选择一组特定的符号。
无论如何,由于二进制文件可以表示为熟悉和不熟悉的字符的非常随机的分类,因此如果您在编辑器中打开该文件,该文件将看起来像乱码。
您始终可以在十六进制编辑器中打开文件(二进制或文本文件,实际上没有区别),然后查看原始二进制数据。
无法判断哪个程序创建了特定文件。 特别是,如果程序对其数据进行了加密,那么所有希望都会消失。 否则,通常很容易识别某些“签名”。
Binary data is often very random. Encrypted data in particular, by definition. Each byte can be represented by one of 256 characters (leaving Unicode out of the equation). ASCII only covers 128 of these, and only 94 of these are actual printable characters. Outside the ASCII range, you have a number of international characters and strange symbols. There are certainly more than 128 of these, so one must specify a codepage to select a specific set of symbols.
Anyway, since binary files can be represented as a very random assortment of familiar and unfamiliar characters, the file will look like gibberish if you open it in an editor.
You could always open a file (binary or text file, there really is no difference) in a hex editor, and look at the raw binary data.
There is no way to tell which program created a specific file. In particular, if the program has encrypted its data, all hope is lost. Otherwise, it is often easy to recognize certain "signatures."
在标准文本编辑器(例如记事本)中查看时,二进制文件显示为乱码的原因是,当使用这些类型的应用程序常用的编码(例如 UTF-8 的 ASCII)显示时,数据在编码时会映射到字符对于显示而言,此过程的输出通常对人类来说与映射的二进制数据一样没有意义,因此您看到的乱码
如前所述,当以不同方式(例如使用十六进制编辑器)查看时,这些文件更有意义。
某些文件类型可以通过给定类型的所有文件中存在的数据来识别,例如所有可执行文件 (*.exe) 以字母 MZ 开头
The reason files that are binary display as gibberish when viewed in standard text editors such as notepad is because when displayed with the encodings commonly used by these types of applications (e.g. ASCII of UTF-8) the data is mapped to characters when it is encoded for display, the output of this process generally makes as little sense to humans as the binary data being mapped, ergo the gibberish you see
As previously mentioned these files make more sense when viewed in a different way such as with a hex editor.
Certain file types can be recognized by data present in all files of a given type, for example all executable files (*.exe) begin with the letters MZ
是的,写字板和记事本以及许多其他文本编辑器假定您用它打开的任何文件都是文本文件,并且会尝试显示文件中字节表示的 ASCII 字符。
十六进制编辑器用于查看和编辑二进制文件。 它们通常将每个字节显示为一对十六进制数字而不是“1 和 0”,因为这样更容易阅读。
Yes, Wordpad and Notepad and many other text editors assume that any file you open with it is a text file and will try to display the ASCII characters represented by the bytes in the file.
Hex Editors are made to view and edit binary files. They usually display each byte as a pair of hexadecimal digits instead of "1s and 0s" because it's easier to read that way.
除了字符编码等之外,文本编辑器对进入其中的数据很少做任何假设。 因此,它会(如您所说)以 ASCII 形式读取文件数据并以这种方式显示。 由于二进制数据并不总是落在字母数字范围内,因此您会得到乱码。 至于显示原始二进制值,您需要一个十六进制编辑器,例如 XVI32< /a>.
二进制文件通常在使用它们的程序之外没有上下文。 某些二进制格式在开头包含 4 字节魔术序列(例如,Java .class 文件以“CAFE”开头),但要在没有程序的情况下识别它们,您需要这些 4 字节序列的映射。 我相信某些 Linux 发行版包含各种二进制格式的信息,并将检查文件的开头以尝试识别它。 除此之外,你无能为力。
A text editor makes very few assumptions about the data coming into it, besides things like character encodings. Thus, it will (as you say) read the file's data as ASCII and display it that way. Since binary data doesn't always fall within the alphanumeric range, you get gibberish. As for showing the raw binary values, you need a hex editor like XVI32.
Binary files often have no context outside of the program that uses them. Some binary formats contain a 4-byte magic sequence at the beginning (for example, Java .class files start with "CAFE"), but to recognize them without their program, you need a mapping of those 4-byte sequences. I believe some Linux distros contain this information for a wide variety of binary formats and will examine the beginning of the file to attempt to identify it. Other than that, there's not much you can do.