文件流实际上是如何工作的?
我想知道文件流到底是如何工作的?对于文件流,我的意思是访问文件的一部分而不将整个文件加载到内存中。
我(相信)知道 C++ 类 (i|o)fstream
正是这样做的,但它是如何实现的呢?是否可以自己实现文件流?
它在最低的 C / C++(或任何支持文件流的语言)级别如何工作? C 函数 fopen
、fclose
、fread
和 FILE*
指针是否已处理流处理(即不将整个文件加载到内存中)?如果没有,您将如何直接从硬盘读取数据?是否已经在 C / C++ 中实现了这样的功能?
任何指向正确方向的链接、提示、指针都会非常有帮助。我已经用谷歌搜索过,但似乎谷歌不太明白我在追求什么...
Ninja-Edit:如果有人知道如何在汇编/机器代码中工作 级别,如果可以自己实现这一点或者如果您必须依赖系统调用,那就太棒了。 :) 不是答案的要求,尽管正确方向的链接会很好。
I've been wondering for a while now, how exactly does file streaming work? With file streaming, I mean accessing parts of a file without loading the whole file into memory.
I (believe to) know that the C++ classes (i|o)fstream
do exactly that, but how is it implemented? Is it possible to implement file streaming yourself?
How does it work at the lowest C / C++ (or any language that supports file streaming) level? Do the C functions fopen
, fclose
, fread
and the FILE*
pointer already take care of streaming (i.e., not loading the whole file into memory)? If not, how would you read directly from the harddrive and is there such a facility alread implemented in C / C++?
Any links, hints, pointers in the right direction would already be very helpful. I've googled, but it seems Google doesn't quite understand what I'm after...
Ninja-Edit: If anybody knows anything about how to this works at assembly / machine code level and if it's possible to implement this yourself or if you have to rely on system calls, that would be awesome. :) Not a requirement for an answer, though a link in the right direction would be nice.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
在最低级别(至少对于用户态代码),您将使用系统调用。在类 UNIX 平台上,这些包括:
open
close
read
write
lseek
。 ..和其他人。这些通过传递称为文件描述符的东西来工作。文件描述符只是不透明的整数。在操作系统内部,每个进程都有一个文件描述符表,包含了所有的文件描述符和相关信息,比如是哪个文件,是什么类型的文件等。
还有类似于系统调用的Windows API调用UNIX:
CreateFile
CloseHandle
ReadFile
/ReadFileEx
WriteFile
/WriteFileEx
SetFilePointer
/SetFilePointerEx
Windows 传递
HANDLE
,它们与文件描述符类似,但我认为灵活性稍差一些。 (例如,在 UNIX 上,文件描述符不仅可以表示文件,还可以表示套接字、管道等)C 标准库函数
fopen
、fclose
、fread
、fwrite
和fseek
只是这些系统调用的包装器。当您打开文件时,通常不会将文件的任何内容读入内存。当您使用
fread
或read
时,您告诉操作系统将特定数量的字节读入缓冲区。该特定字节数可以是但不必是文件的长度。因此,如果需要,您可以仅将文件的一部分读入内存。对 ninja-edit 的回答:
您询问这在机器代码级别是如何工作的。我只能真正解释一下它在 Linux 和 Intel 32 位架构上的工作原理。当您使用系统调用时,一些参数被放入寄存器中。将参数放入寄存器后,将引发中断
0x80
。因此,例如,要从stdin
(文件描述符 0)读取 1 KB 到地址0xDEADBEEF
,您可以使用以下汇编代码:int 0x80
引发操作系统通常已在中断向量表或中断描述符表中注册的软件中断。无论如何,处理器都会跳转到内存中的特定位置。一旦到达那里,操作系统通常会进入内核模式(如果需要),然后在eax
上执行与 C 的switch
相同的操作。从那里,它将跳转到read
的实现。在read
中,它通常会从调用进程的文件描述符表中读取一些有关描述符的元数据。一旦它获得了所需的所有数据,它就会执行其操作,然后返回到用户代码。为了“做它的事情”,我们假设它是从磁盘读取,而不是从管道或 stdin 或其他非物理位置读取。我们还假设它正在从主硬盘读取。另外,我们假设操作系统仍然可以访问 BIOS 中断。
要访问该文件,它需要执行一系列文件系统操作。例如,遍历目录树以查找实际文件所在的位置。我不会过多介绍这一点,因为我打赌您可以猜到。
有趣的部分是从磁盘读取数据,无论是文件系统元数据、文件内容还是其他内容。首先,您获得一个逻辑块地址(LBA)。 LBA 只是磁盘上数据块的索引。每个块通常为 512 字节(尽管这个数字可能是过时的)。仍然假设我们可以访问 BIOS 并且操作系统使用它,然后它会将 LBA 转换为 CHS 表示法。 CHS(柱面-磁头-扇区)表示法是引用硬盘驱动器各部分的另一种方法。它曾经对应于物理概念,但现在,它已经过时了,但几乎每个 BIOS 都支持它。从那里,操作系统将数据填充到寄存器中并触发中断
0x13
,即BIOS的磁盘读取中断。这是我能解释的最低级别,我确信我假设操作系统使用 BIOS 后的部分已经过时了。不过,我相信,即使不是在简化的层面上,之前的一切仍然是这样的。
At the lowest level (at least for userland code), you'll use system calls. On UNIX-like platforms, these include:
open
close
read
write
lseek
...and others. These work by passing around these things called file descriptors. File descriptors are just opaque integers. Inside the operating system, each process has a file descriptor table, containing all of the file descriptors and relevant information, such as which file it is, what kind of file it is, etc.
There are also Windows API calls similar to system calls on UNIX:
CreateFile
CloseHandle
ReadFile
/ReadFileEx
WriteFile
/WriteFileEx
SetFilePointer
/SetFilePointerEx
Windows passes around
HANDLE
s, which are similar to file descriptors, but are, I believe, a little less flexible. (for example, on UNIX, file descriptors can not only represent files, but also sockets, pipes, and other things)The C standard library functions
fopen
,fclose
,fread
,fwrite
, andfseek
are merely wrappers around these system calls.When you open a file, usually none of the file's contents is read into memory. When you use
fread
orread
, you tell the operating system to read a particular number of bytes into a buffer. This particular number of bytes can be, but does not have to be, the length of the file. As such, you can read only part of a file into memory, if desired.Answer to ninja-edit:
You asked how this works at the machine code level. I can only really explain how this works on Linux and the Intel 32-bit architecture. When you use a system call, some of the arguments are placed into registers. After the arguments are placed into the registers, interrupt
0x80
is raised. So, for example, to read one kilobyte fromstdin
(file descriptor 0) to the address0xDEADBEEF
, you might use this assembly code:int 0x80
raises a software interrupt that the operating system usually will have registered in the interrupt vector table or interrupt descriptor table. Anyway, the processor will jump to a particular place in memory. Once there, usually the operating system will enter kernel mode (if necessary) and then do the equivalent of C'sswitch
oneax
. From there, it will jump into the implementation forread
. Inread
, it will usually read some metadata about the descriptor from the calling process's file descriptor table. Once it has all the data it needs, it does its stuff, then returns back to the user code.To "do its stuff", let's assume it's reading from disk, and not a pipe or
stdin
or some other non-physical place. Let's also assume it's reading from the primary hard disk. Also, let's assume the operating system can still access the BIOS interrupts.To access the file, it needs to do a bunch of filesystem things. For example, traversing the directory tree to find where the actual file is. I'm not going to cover this, much, since I bet you can guess.
The interesting part is reading data from the disk, whether it be filesystem metadata, file contents, or something else. First, you get a logical block address (LBA). An LBA is just an index of a block of data on the disk. Each block is usually 512 bytes (although this figure may be dated). Still assuming we have access to the BIOS and the OS uses it, it then will convert the LBA to CHS notation. CHS (Cylinder-Head-Sector) notation is another way to reference parts of the hard drive. It used to correspond to physical concepts, but nowadays, it's outdated, but almost every BIOS supports it. From there, the OS will stuff data into registers and trigger interrupt
0x13
, the BIOS's disk-reading interrupt.That's the lowest level I can explain, and I'm sure the part after I assumed the operating system used the BIOS is outdated. Everything before that is how it still works, though, I believe, if not at a simplified level.
在最低级别,在 POSIX 平台上,打开的文件由用户空间中的“描述符”表示。文件描述符只是一个整数,在任何给定时间在打开的文件中都是唯一的。当要求内核实际执行该操作时,描述符用于标识应将操作应用于哪个打开的文件。因此,
read(0, charptr, 1024)
从与描述符0
关联的打开文件中进行读取(按照惯例,这可能是进程的标准输入)。据用户空间所知,加载到内存中的文件的唯一部分是满足诸如读取之类的操作所需的部分。要从文件中间读取字节,支持另一种操作 - ''seek''。这告诉内核重新定位特定文件中的偏移量。下一个读取(或写入)操作将使用该新偏移量中的字节。因此,
lseek(123, 100, SEEK_SET)
将与123
(我刚刚编写的描述符值)关联的文件的偏移量重新定位到第 100 个字节位置。123
上的下一次读取将从该位置开始读取,而不是从文件的开头(或之前偏移量所在的位置)开始读取。并且任何未读取的字节都不需要加载到内存中。幕后有一点复杂性 - 磁盘通常无法读取小于一个“块”的数据,该“块”通常是 4096 左右的 2 的幂;内核可能会进行额外的缓存和所谓的“预读”。但这些都是优化,基本思想就是我上面描述的。
At the lowest level, on POSIX platforms, open files are represented by "descriptors" in userspace. A file descriptor is just an integer which is unique across open files at any given time. The descriptor is used to identify which open file an operation should be applied to when asking the kernel to actually perform that operation. So,
read(0, charptr, 1024)
does a read from the open file which is associated with the descriptor0
(by convention, this will probably be a process's standard input).As far as userspace can tell, the only parts of a file that are loaded into memory are those that are required to satisfy an operation like
read
. To read bytes from the middle of a file, another operation is supported - ''seek''. This tells the kernel to reposition the offset in a particular file. The nextread
(orwrite
) operation will work with bytes from that new offset. Solseek(123, 100, SEEK_SET)
repositions the offset for the file associated with123
(a descriptor value I just made up) to the 100th byte position. The next read on123
will read starting from that position, not from the beginning of the file (or wherever the offset was previously). And any bytes not read don't need to be loaded into memory.There is a little more complexity behind the scenes - the disk usually can't read less than a "block" which is typically a power of two around 4096; the kernel probably does extra caching and something called "readahead". But these are optimizations, and the basic idea is what I described above.