将数据保存到二进制文件
我想将文件保存为二进制文件,因为我听说它可能比普通文本文件小。
现在我尝试保存带有一些文本的二进制文件,但问题是该文件仅包含文本和末尾的 NULL
。我希望在文件中只看到零和一个。
任何解释或建议都将受到高度赞赏。
这是我的代码
#include <iostream>
#include <stdio.h>
int main()
{
/*Temporary data buffer*/
char buffer[20];
/*Data to be stored in file*/
char temp[20]="Test";
/*Opening file for writing in binary mode*/
FILE *handleWrite=fopen("test.bin","wb");
/*Writing data to file*/
fwrite(temp, 1, 13, handleWrite);
/*Closing File*/
fclose(handleWrite);
/*Opening file for reading*/
FILE *handleRead=fopen("test.bin","rb");
/*Reading data from file into temporary buffer*/
fread(buffer,1,13,handleRead);
/*Displaying content of file on console*/
printf("%s",buffer);
/*Closing File*/
fclose(handleRead);
std::system("pause");
return 0;
}
I would like to save a file as binary, because I've heard that it would probably be smaller than a normal text file.
Now I am trying to save a binary file with some text, but the problem is that the file just contains the text and NULL
at the end. I would expect to see only zero's and one's inside the file.
Any explaination or suggestions are highly appreciated.
Here is my code
#include <iostream>
#include <stdio.h>
int main()
{
/*Temporary data buffer*/
char buffer[20];
/*Data to be stored in file*/
char temp[20]="Test";
/*Opening file for writing in binary mode*/
FILE *handleWrite=fopen("test.bin","wb");
/*Writing data to file*/
fwrite(temp, 1, 13, handleWrite);
/*Closing File*/
fclose(handleWrite);
/*Opening file for reading*/
FILE *handleRead=fopen("test.bin","rb");
/*Reading data from file into temporary buffer*/
fread(buffer,1,13,handleRead);
/*Displaying content of file on console*/
printf("%s",buffer);
/*Closing File*/
fclose(handleRead);
std::system("pause");
return 0;
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
所有文件都只包含 1 和 0,在二进制计算机上这就是可以使用的全部内容。
保存文本时,您正在以给定的编码保存该文本的二进制表示形式,该编码定义了每个字母如何映射到位。
因此对于文本来说,文本文件或二进制文件几乎并不重要;您听说过的空间节省通常适用于其他数据类型。
考虑一个浮点数,例如 3.141592653589。如果保存为文本,则每个数字需要一个字符(只需计算它们),再加上句点。如果以二进制形式保存为浮点型位的副本,则在典型的 32 位系统上它将占用四个字符(四个字节或 32 位)。调用存储的确切位数,例如:
CHAR_BIT * sizeof x
,请参阅
了解CHAR_BIT
。All files contain only ones and zeroes, on binary computers that's all there is to play with.
When you save text, you are saving the binary representation of that text, in a given encoding that defines how each letter is mapped to bits.
So for text, a text file or a binary file almost doesn't matter; the savings in space that you've heard about generally come into play for other data types.
Consider a floating point number, such as 3.141592653589. If saved as text, that would take one character per digit (just count them), plus the period. If saved in binary as just a copy of the
float
's bits, it will take four characters (four bytes, or 32 bits) on a typical 32-bit system. The exact number of bits stored by a call such as:is
CHAR_BIT * sizeof x
, see<stdlib.h>
forCHAR_BIT
.您描述的问题是一系列(不幸的是,非常常见的1)错误和误解。让我尝试全面详细地说明正在发生的事情,希望您能花时间阅读所有材料:它很长,但这些是任何程序员都应该掌握的非常重要的基础知识。如果您没有完全理解所有内容,请不要绝望:只要尝试一下,一两周后回来,练习,看看会发生什么:)
角色的概念之间存在着至关重要的区别< em>编码和字符集。除非你真正理解这种差异,否则你永远不会真正了解这里发生的事情。 Joel Spolsky(想想看,Stackoverflow 的创始人之一)不久前写了一篇文章解释了其中的差异: 每个软件开发人员绝对、积极地必须了解 Unicode 和字符集的绝对最低限度(没有任何借口!)< /a>.在继续阅读本文之前,甚至在继续编程之前,请先阅读该内容。老实说,读一下,理解一下:标题一点也不夸张。你一定绝对知道这些东西。
之后,让我们继续:
当 C 程序运行时,应该保存“char”类型值的内存位置,就像任何其他内存位置一样,包含一系列 1 和 0。变量的“类型”仅对编译器有意义,而对正在运行的程序没有意义,后者只看到 1 和 0,并不知道更多信息。换句话说:当你通常认为一个“字母”(来自字符集合的元素)驻留在内存中的某个地方时,实际上存在的是一个位序列(来自字符的元素编码)。
每个编译器都可以自由地使用他们希望在内存中表示字符的任何编码。因此,它可以在内部自由地将我们所说的“换行符”表示为它选择的任何数字。例如,假设我编写一个编译器,我可以同意自己的观点,每次我想在内部存储“换行符”时,我都会将其存储为数字六 (6),即二进制中的 0x6(或二进制中的 110)。
写入文件是通过同时告诉操作系统2四件事来完成的:
fwrite()
fwrite
的第一个参数)请注意,这有与该数据的“类型”无关:您的操作不知道,也不关心。它对字符集一无所知,也不关心:它只是看到从某处开始的一系列 1 和 0,并将其复制到文件中。
以“二进制”模式打开文件实际上是新手程序员所期望的处理文件的正常、直观的方式:您指定的内存位置被一对一地复制到文件中。如果您编写一个用于保存编译器决定存储为“char”类型的变量的内存位置,这些值将一对一地写入文件。除非您知道编译器如何在内部存储值(它将什么值与换行符、字母“a”、“b”等相关联),否则这是没有意义的。将此与乔尔关于文本文件在不知道其编码是什么的情况下毫无用处的类似观点进行比较:同样的事情。
以“文本”模式打开文件几乎等于二进制模式,只有一处(且仅有一处)差异:任何时候写入的值都等于编译器内部用于换行符的值(在我们的例子中为 6),它会向文件写入不同的内容:不是那个值,而是您所在的操作系统认为是换行符的任何内容。在 Windows 上,这是两个字节(13 和 10,或者在 Windows 上为 0x0d 0x0a)。再次注意,如果您不知道编译器对其他字符的内部表示的选择,这仍然没有意义。
请注意,此时很明显,在文本模式下将编译器指定为字符的数据以外的任何内容写入文件都是一个坏主意:在我们的例子中,6 可能恰好是您正在写入的值之一,在这种情况下,输出会以我们绝对无意的方式改变。
(不)幸运的是,大多数(全部?)编译器实际上使用相同的字符内部表示形式:这种表示形式是 US-ASCII,它是所有默认值的根源。这就是为什么您可以将一些“字符”写入程序中的文件,使用任何随机编译器进行编译,然后使用文本编辑器打开它:它们都使用/理解 US-ASCII 并且它恰好可以工作。
好的,现在将其与您的示例联系起来:为什么在二进制模式和文本模式下编写“test”没有区别? 因为“test”中没有换行符,这就是原因!
当您“打开文件”,然后“看到”字符时,这意味着什么?这意味着您用来检查该文件中的 1 和 0 序列的程序(因为硬盘上的所有内容都是 1 和 0)决定将其解释为 US-ASCII,而这恰好是您的编译器决定编码的内容该字符串在其内存中。
奖励点:编写一个程序,将文件中的 1 和 0 读取到内存中,并打印每个位(有多个位组成一个字节,要提取它们,您需要了解“按位”运算符技巧,谷歌!)作为“ 1”或“0”给用户。请注意,“1”是字符 1,即您选择的字符集中的点,因此您的程序必须采用一个位(数字 1 或 0)并将其转换为表示字符集中的字符 1 或 0 所需的位序列。终端仿真器使用的编码,您正在查看天哪上的程序的标准。好消息:通过假设到处都是 US-ASCII,您可以走很多捷径。该程序将向您展示您想要的内容:编译器在内部用来表示“测试”的 1 和 0 序列。
这些东西对于新手来说确实令人畏惧,我知道我花了很长时间才知道字符集和编码之间存在差异,更不用说这一切是如何工作的了。希望我没有让你失去动力,如果我这样做了,只要记住你永远不会失去你已经拥有的知识,只会获得它(好吧,并不总是如此:P)。在生活中,一个陈述提出的问题多于它回答的问题是很正常的,苏格拉底知道这一点,他的智慧无缝地应用到了 2400 年后的现代技术中。
祝你好运,不要犹豫,继续提问。致其他读者:如果您发现错误,欢迎改进这篇文章。
例如, Hraban
1 告诉您“以二进制形式保存文件可能更小”的人可能严重误解了这些基本原理。除非他指的是在保存数据之前压缩数据,在这种情况下,他只是使用一个令人困惑的词(“二进制”)来表示“压缩”。
2 “告诉操作系统一些事情”就是通常所说的系统调用。
The problem you describe is a chain of (very common1, unfortunately) mistakes and misunderstandings. Let me try to fully detail what is going on, hopefully you will take the time to read through all the material: it is lengthy, but these are very important basics that any programmer should master. Please do not despair if you do not fully understand all of it: just try to play around with it, come back in a week, or two, practice, see what happens :)
There is a crucial difference between the concepts of a character encoding and a character set. Unless you really understand this difference, you will never really get what is going on, here. Joel Spolsky (one of the founders of Stackoverflow, come to think of it) wrote an article explaining the difference a while ago: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). Before you continue reading this, before you continue programming, even, read that. Honestly, read it, understand it: the title is no exaggeration. You must absolutely know this stuff.
After that, let us proceed:
When a C program runs, a memory location that is supposed to hold a value of type "char" contains, just like any other memory location, a sequence of ones and zeroes. "type" of a variable only means something to the compiler, not to the running program who just sees ones and zeroes and does not know more than that. In other words: where you commonly think of a "letter" (an element from a character set) residing in memory somewhere, what is actually there is a bit sequence (an element from a character encoding).
Every compiler is free to use whatever encoding they wish to represent characters in memory. As a consequence, it is free represent what we call a "newline" internally as any number it chooses. For example, say I write a compiler, I can agree with myself that every time I want to store a "newline" internally I store it as number six (6), which is just 0x6 in binary (or 110 in binary).
Writing to a file is done by telling the operating system2 four things at the same time:
fwrite()
)fwrite
)Note that this has nothing to do with the "type" of that data: your operating has no idea, and does not care. It does not know anything about characters sets and it does not care: it just sees a sequence of ones and zeroes starting somewhere and copies that to a file.
Opening a file in "binary" mode is actually the normal, intuitive way of dealing with files that a novice programmer would expect: the memory location you specify is copied one-on-one to the file. If you write a memory location that used to hold variables that the compiler decided to store as type "char", those values are written one-on-one to the file. Unless you know how the compiler stores values internally (what value it associates with a newline, with a letter 'a', 'b', etc), THIS IS MEANINGLESS. Compare this to Joel's similar point about a text file being useless without knowing what its encoding is: same thing.
Opening a file in "text" mode is almost equal to binary mode, with one (and only one) difference: anytime a value is written that has value equal to what the compiler uses INTERNALLY for the newline (6, in our case), it writes something different to the file: not that value, but whatever the operating system you are on considers to be a newline. On windows, this is two bytes (13 and 10, or 0x0d 0x0a, on Windows). Note, again, if you do not know about the compiler's choice of internal representation of the other characters, this is STILL MEANINGLESS.
Note at this point that it is pretty clear that writing anything but data that the compiler designated as characters to a file in text mode is a bad idea: in our case, a 6 might just happen to be among the values you are writing, in which case the output is altered in a way that we absolutely do not mean to.
(Un)Luckily, most (all?) compilers actually use the same internal representation for characters: this representation is US-ASCII and it is the mother of all defaults. This is the reason you can write some "characters" to a file in your program, compiled with any random compiler, and then open it with a text editor: they all use/understand US-ASCII and it happens to work.
OK, now to connect this to your example: why is there no difference between writing "test" in binary mode and in text mode? Because there is no newline in "test", that is why!
And what does it mean when you "open a file", and then "see" characters? It means that the program you used to inspect the sequence of ones and zeroes in that file (because everything is ones and zeroes on your hard disk) decided to interpret that as US-ASCII, and that happened to be what your compiler decided to encode that string as, in its memory.
Bonus points: write a program that reads the ones and zeroes from a file into memory and prints every BIT (there's multiple bits to make up one byte, to extract them you need to know 'bitwise' operator tricks, google!) as a "1" or "0" to the user. Note that "1" is the CHARACTER 1, the point in the character set of your choosing, so your program must take a bit (number 1 or 0) and transform it to the sequence of bits needed to represent character 1 or 0 in the encoding that the terminal emulator uses that you are viewing the standard out of the program on oh my God. Good news: you can take lots of short-cuts by assuming US-ASCII everywhere. This program will show you what you wanted: the sequence of ones and zeroes that your compiler uses to represent "test" internally.
This stuff is really daunting for newbies, and I know that it took me a long time to even know that there was a difference between a character set and an encoding, let alone how all of this worked. Hopefully I did not demotivate you, if I did, just remember that you can never lose knowledge you already have, only gain it (ok not always true :P). It is normal in life that a statement raises more questions than it answered, Socrates knew this and his wisdom seamlessly applies to modern day technology 2.4k years later.
Good luck, do not hesitate to continue asking. To other readers: please feel welcome to improve this post if you see errors.
Hraban
1 The person that told you that "saving a file in binary is probably smaller", for example, probably gravely misunderstands these fundamentals. Unless he was referring to compressing the data before you save it, in which case he just uses a confusing word ("binary") for "compressed".
2 "telling the operating system something" is what is commonly known as a system call.
好吧,本机和二进制之间的区别在于处理行尾的方式。
如果你在二进制中写入一个字符串,它将保留该字符串。
如果你想让它更小,你必须以某种方式压缩它(例如寻找 libz)。
较小的是:当想要保存二进制数据(如字节数组)时,将其保存为二进制数据比将其放入字符串(无论是十六进制表示还是 Base64)更小。我希望这有帮助。
Well, the difference between native and binary is the way the end of line is handled.
If you write a string in a binary, it will stay the string.
If you want to make it smaller, you'll have to somehow compress it (look for libz for example).
What is smaller is: when wanting to save binary data (like an array of bytes), it's smaller to save it as binary rather than putting it in a string (either in hexa representation or base64). I hope this helps.
我想你在这里有点困惑。
当您将 ASCII 字符串“Test”写入文件时(即使在二进制模式下),它仍然是 ASCII 字符串。编写二进制有意义的情况是针对字符以外的其他类型(例如整数数组)。
I think you're a bit confused here.
The ASCII-string "Test" will still be an ASCII-string when you write it to the file (even in binary mode). The cases when it makes sense to write binary are for other types than chars (e.g. an array of integers).
尝试替换
为
try replacing
with
函数 printf("%s",缓冲区);将缓冲区打印为以零结尾的字符串。
尝试使用:
char temp[20]="测试\n\r测试";
Function printf("%s",buffer); prints buffer as zero-ending string.
Try to use:
char temp[20]="Test\n\rTest";