关于二进制文件的一般问题

发布于 2024-10-26 12:37:55 字数 250 浏览 1 评论 0原文

我是初学者,在掌握二进制文件方面遇到困难。当我以二进制模式(在 python 中)写入文件时,我只写入普通文本。它没有任何二元性。我知道我计算机上的每个文件都是二进制文件,但我无法区分我以二进制模式编写的文件和音频、视频等文件,如果我在文本编辑器中打开它们,这些文件会显示为乱码。

显示为乱码的文件是如何创建的?你能举一个这样创建的小文件的例子吗?最好是用Python创建的?

我有一种感觉,我问了一个非常愚蠢的问题,但我不得不问。谷歌搜索对我没有帮助。

I am a beginner and I am having trouble in grasping binary files. When I write to a file in binary mode (in python), I just write normal text. There is nothing binary about it. I know every file on my computer is a binary file but I am having trouble distinguishing between files written in binary mode by me and files like audio, video etc files that show up as gibberish if I open them in a text editor.

How are files that show up as gibberish created? Can you please give an example of a small file that is created like this, preferably in python?

I have a feeling I am asking a really stupid question but I just had to ask it. Googling around didn't help me.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

往事随风而去 2024-11-02 12:37:55

当我以二进制模式(在 python 中)写入文件时,我只写入普通文本。

当您升级到 Python 3.x 时,您必须改变您的方法:

>>> f = open(filename, 'wb')
>>> f.write("Hello, world!\n")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: must be bytes or buffer, not str
>>> f.write(b"Hello, world!\n")
14

但您的问题实际上并不是关于二进制文件。这是关于 str 的。

在 Python 2.x 中,str 是一个字节序列,具有重载含义:

  • 非 Unicode 字符串或
  • 原始二进制数据(如图像中的像素)。

如果你像前者一样打印后者,你会得到乱码。

Python 3.x 通过引入单独的 bytes 类型摆脱了这种双重含义对于二进制数据,将 str 明确保留为文本字符串(并使其成为 Unicode)。

When I write to a file in binary mode (in python), I just write normal text.

You'll have to change your approach when you upgrade to Python 3.x:

>>> f = open(filename, 'wb')
>>> f.write("Hello, world!\n")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: must be bytes or buffer, not str
>>> f.write(b"Hello, world!\n")
14

But your question isn't really about binary files. It's about str.

In Python 2.x, str is a byte sequence that has an overloaded meaning:

  • A non-Unicode string, or
  • Raw binary data (like pixels in an image).

If you print the latter as it were the former, you get gibberish.

Python 3.x got rid of this double meaning by introducing a separate bytes type for binary data, leaving str unambiguously as a text string (and making it Unicode).

书间行客 2024-11-02 12:37:55

这是您问题的字面答案:

import struct
with open('gibberish.bin', 'wb') as f:
    f.write(struct.pack('<4d', 3.14159, 42.0, 123.456, 987.654))

即将这 4 个浮点数打包成二进制格式(little-endian IEEE 756 64 位浮点)。

以下是您需要了解的(部分)内容:

以二进制模式读取和写入文件不会导致您读取或写入的数据发生任何转换。在文本模式下,以及与 Unicode 的任何解码/编码,您读取或写入的数据将根据“文本文件”的平台约定进行转换。

Unix/Linux/Mac OS X:没有变化

旧版 Mac:行分隔符为 \r,更改为 Python 标准 \n

Windows:行分隔符为 \ r\n,更改为/从 \n。另外(鲜为人知的事实),Ctrl-Z 又名 \x1a 被解释为文件结束符,这是从 CP/M 继承的约定,它将文件大小记录为数字使用的 128 字节扇区数。

Here's a literal answer to your question:

import struct
with open('gibberish.bin', 'wb') as f:
    f.write(struct.pack('<4d', 3.14159, 42.0, 123.456, 987.654))

That's packing those 4 floating point numbers into a binary format (little-endian IEEE 756 64-bit floating point).

Here's (some of) what you need to know:

Reading and writing a file in binary mode incurs no transformation on the data that you read or write. In text mode, as well as any decoding/encoding to/from Unicode, the data that you read or write is transformed according to the platform conventions for "text files".

Unix/Linux/Mac OS X: no change

older Mac: line separator is \r, changed to/from Python standard \n

Windows: line separator is \r\n, changed to/from \n. Also (little known fact), Ctrl-Z aka \x1a is interpreted as end-of-file, a convention inherited from CP/M which recorded file sizes as the number of 128-byte sectors used.

清风不识月 2024-11-02 12:37:55

所谓的“文本”文件只是遵循某些约定的文件:字节通常是所有可能字节(通常是 ASCII 或 Unicode 值)的子集,并用“行终止符”组织成“行”。标准行终止符因平台而异 - Unix 使用 \n、Mac \r 和 Windows \r\n - 所以是约定的一部分就是即时翻译这些内容。这适用于文本文件,但会破坏其他类型的文件,因为声音文件中的 0x0a (\n) 字节或其他内容不太适合转换为 0x0d 0x0a (\r\n)。当然,如果您只使用过 Unix,就不会出现这个问题。

在 Python 3 中,所有字符串都是 Unicode,并且以文本方式打开文件意味着您必须读取和写入 Unicode 字符串,并且可能指定编码(默认为 UTF-8)。以二进制方式打开文件意味着您必须使用 bytes 对象,这些对象是 8 位字节的简单列表,并且不会进行编码。

这是否澄清了事情?

So-called "text" files are simply files that follow certain conventions: the bytes are usually a subset of all the possible bytes, generally ASCII or Unicode values, and are organized into "lines" with "line terminators". The standard line terminators vary by platform - Unix uses \n, Mac \r, and Windows \r\n - so part of the convention is to translate these on the fly. This works fine with text files, but will clobber other kinds of files, because an 0x0a (\n) byte in a sound file or something won't take well to being converted to 0x0d 0x0a (\r\n). Of course, if you've only been using Unix, this won't have come up.

In Python 3, all strings are Unicode, and opening a file as text means you have to read and write Unicode strings, and perhaps specify an encoding (it defaults to UTF-8). Opening a file as binary means you have to use bytes objects, which are simple lists of 8-bit bytes and don't get encoded.

Does this clarify things?

待"谢繁草 2024-11-02 12:37:55

当您尝试对对象进行编码时,通常会创建二进制文件。例如,您可能有一个具有姓名、年龄、身高等属性的 Person 对象。如果您将此文件写入为文本以便稍后可以读回,您可能会输出如下内容:

Name:Ralph
Age:25
Height:5'6"

但您可以用二进制更紧凑地表示它。在二进制中,您可能只是一个接一个地输出姓名、年龄和身高,并且您必须以完全相同的顺序读回它们,因为您不再有这些分隔符。在这种情况下,您的字符串必须使用 Ralph\0 之类的内容进行编码。 \0 是空字符,因此它知道字符串在哪里结束。

25 可以仅表示为文本/ASCII 中的 2 个字符,但如果您尝试并排放置两个数字,例如 25 和 26,您会得到 2526,而且您不知道在哪里一个结束,下一个开始。这些数字实际上是整数,由 4 个字节表示。当您将文件写入二进制文件时,即使最左边的位全为 0,您也会写出所有 4 个字节。这样它始终知道要读取多少内容。等等......

这就是为什么“二进制文件”看起来像胡言乱语,因为它们里面有所有这些额外的信息。

要生成这些文件,您必须像 John Machin 建议的那样对数据进行编码或“打包”。

Binary files are normally created when you try to encode objects. For example, you might have a Person object with properties like Name, Age, Height. If you were to write this file as text so that it can be read back in later, you might output something like this:

Name:Ralph
Age:25
Height:5'6"

But you can represent it more compactly in binary. In binary, you might just output the name, age and height one right after the other, and you'd have to read them back in in the exact same order because you no longer have these delimiters. In that case, your string would have to encoded with something like Ralph\0. The \0 is the null character so that it knows where the string ends.

The 25 can be represented as just 2 characters in text/ASCII but if you tried putting two numbers side-by-side, like 25 and 26, you'd get 2526 and you wouldn't know where one ends and the next begins. These numbers are actually integers and be represented by 4 bytes. When you write a file as binary, you'd write out all 4 bytes, even if the left-most bits are all 0. That way it always knows exactly how much to read it. And so forth...

That's why "binary files" look like jibberish, because they've got all this extra information in them.

To generate these files, you'd have to encode or "pack" your data like John Machin suggests.

你在看孤独的风景 2024-11-02 12:37:55

也许您正在二进制文件中发送字符串,并且您的计算机可以解码它并将其显示给您?尝试用随机字节写入文件。或者您可以向我们展示您的代码,以便我们了解问题。

Maybe your are sending string in your binary file and your computer can decode it and show it to you? Try to write a file with random byte. Or you could show us your code so we can understand the problem.

雨落□心尘 2024-11-02 12:37:55

我建议使用Python的codecs模块来编写文本文件(它允许您设置相关的字符集/编码)。要写入二进制文件,请使用标准 file() 方法。在 Windows 上,您可能需要使用“wb”或“rb”来表示二进制模式(在 Unix 上无关紧要)。

I recommend using the codecs module of Python for writing text files (it allows you to set the related charset/encoding). For writing binary file use the standard file() method. On windows you may need use 'wb' or 'rb' for binary modes (does not matter on Unix).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文