文件流实际上是如何工作的？

发布于 2024-11-08 06:59:52 字数 507 浏览 8 评论 0原文

我想知道文件流到底是如何工作的？对于文件流，我的意思是访问文件的一部分而不将整个文件加载到内存中。
我（相信）知道 C++ 类 (i|o)fstream 正是这样做的，但它是如何实现的呢？是否可以自己实现文件流？
它在最低的 C / C++（或任何支持文件流的语言）级别如何工作？ C 函数 fopen、fclose、fread 和 FILE* 指针是否已处理流处理（即不将整个文件加载到内存中）？如果没有，您将如何直接从硬盘读取数据？是否已经在 C / C++ 中实现了这样的功能？

任何指向正确方向的链接、提示、指针都会非常有帮助。我已经用谷歌搜索过，但似乎谷歌不太明白我在追求什么...

Ninja-Edit：如果有人知道如何在汇编/机器代码中工作级别，如果可以自己实现这一点或者如果您必须依赖系统调用，那就太棒了。 :) 不是答案的要求，尽管正确方向的链接会很好。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

じее 2024-11-15 06:59:52

在最低级别（至少对于用户态代码），您将使用系统调用。在类 UNIX 平台上，这些包括：

open
close
read
write
lseek

。 ..和其他人。这些通过传递称为文件描述符的东西来工作。文件描述符只是不透明的整数。在操作系统内部，每个进程都有一个文件描述符表，包含了所有的文件描述符和相关信息，比如是哪个文件，是什么类型的文件等。

还有类似于系统调用的Windows API调用UNIX：

Windows 传递 HANDLE，它们与文件描述符类似，但我认为灵活性稍差一些。（例如，在 UNIX 上，文件描述符不仅可以表示文件，还可以表示套接字、管道等）

C 标准库函数 fopen、fclose、fread、fwrite 和 fseek 只是这些系统调用的包装器。

当您打开文件时，通常不会将文件的任何内容读入内存。当您使用fread或read时，您告诉操作系统将特定数量的字节读入缓冲区。该特定字节数可以是但不必是文件的长度。因此，如果需要，您可以仅将文件的一部分读入内存。

对 ninja-edit 的回答：

您询问这在机器代码级别是如何工作的。我只能真正解释一下它在 Linux 和 Intel 32 位架构上的工作原理。当您使用系统调用时，一些参数被放入寄存器中。将参数放入寄存器后，将引发中断0x80。因此，例如，要从 stdin（文件描述符 0）读取 1 KB 到地址 0xDEADBEEF，您可以使用以下汇编代码：

mov eax, 0x03       ; system call number (read = 0x03)
mov ebx, 0          ; file descriptor (stdin = 0)
mov ecx, 0xDEADBEEF ; buffer address
mov edx, 1024       ; number of bytes to read
int 0x80 ; Linux system call interrupt

int 0x80 引发操作系统通常已在中断向量表或中断描述符表中注册的软件中断。无论如何，处理器都会跳转到内存中的特定位置。一旦到达那里，操作系统通常会进入内核模式（如果需要），然后在 eax 上执行与 C 的 switch 相同的操作。从那里，它将跳转到read的实现。在read中，它通常会从调用进程的文件描述符表中读取一些有关描述符的元数据。一旦它获得了所需的所有数据，它就会执行其操作，然后返回到用户代码。

为了“做它的事情”，我们假设它是从磁盘读取，而不是从管道或 stdin 或其他非物理位置读取。我们还假设它正在从主硬盘读取。另外，我们假设操作系统仍然可以访问 BIOS 中断。

要访问该文件，它需要执行一系列文件系统操作。例如，遍历目录树以查找实际文件所在的位置。我不会过多介绍这一点，因为我打赌您可以猜到。

有趣的部分是从磁盘读取数据，无论是文件系统元数据、文件内容还是其他内容。首先，您获得一个逻辑块地址（LBA）。 LBA 只是磁盘上数据块的索引。每个块通常为 512 字节（尽管这个数字可能是过时的）。仍然假设我们可以访问 BIOS 并且操作系统使用它，然后它会将 LBA 转换为 CHS 表示法。 CHS（柱面-磁头-扇区）表示法是引用硬盘驱动器各部分的另一种方法。它曾经对应于物理概念，但现在，它已经过时了，但几乎每个 BIOS 都支持它。从那里，操作系统将数据填充到寄存器中并触发中断0x13，即BIOS的磁盘读取中断。

这是我能解释的最低级别，我确信我假设操作系统使用 BIOS 后的部分已经过时了。不过，我相信，即使不是在简化的层面上，之前的一切仍然是这样的。

At the lowest level (at least for userland code), you'll use system calls. On UNIX-like platforms, these include:

open
close
read
write
lseek

...and others. These work by passing around these things called file descriptors. File descriptors are just opaque integers. Inside the operating system, each process has a file descriptor table, containing all of the file descriptors and relevant information, such as which file it is, what kind of file it is, etc.

There are also Windows API calls similar to system calls on UNIX:

Windows passes around HANDLEs, which are similar to file descriptors, but are, I believe, a little less flexible. (for example, on UNIX, file descriptors can not only represent files, but also sockets, pipes, and other things)

The C standard library functions fopen, fclose, fread, fwrite, and fseek are merely wrappers around these system calls.

When you open a file, usually none of the file's contents is read into memory. When you use fread or read, you tell the operating system to read a particular number of bytes into a buffer. This particular number of bytes can be, but does not have to be, the length of the file. As such, you can read only part of a file into memory, if desired.

Answer to ninja-edit:

You asked how this works at the machine code level. I can only really explain how this works on Linux and the Intel 32-bit architecture. When you use a system call, some of the arguments are placed into registers. After the arguments are placed into the registers, interrupt 0x80 is raised. So, for example, to read one kilobyte from stdin (file descriptor 0) to the address 0xDEADBEEF, you might use this assembly code:

mov eax, 0x03       ; system call number (read = 0x03)
mov ebx, 0          ; file descriptor (stdin = 0)
mov ecx, 0xDEADBEEF ; buffer address
mov edx, 1024       ; number of bytes to read
int 0x80 ; Linux system call interrupt

int 0x80 raises a software interrupt that the operating system usually will have registered in the interrupt vector table or interrupt descriptor table. Anyway, the processor will jump to a particular place in memory. Once there, usually the operating system will enter kernel mode (if necessary) and then do the equivalent of C's switch on eax. From there, it will jump into the implementation for read. In read, it will usually read some metadata about the descriptor from the calling process's file descriptor table. Once it has all the data it needs, it does its stuff, then returns back to the user code.

To "do its stuff", let's assume it's reading from disk, and not a pipe or stdin or some other non-physical place. Let's also assume it's reading from the primary hard disk. Also, let's assume the operating system can still access the BIOS interrupts.

To access the file, it needs to do a bunch of filesystem things. For example, traversing the directory tree to find where the actual file is. I'm not going to cover this, much, since I bet you can guess.

The interesting part is reading data from the disk, whether it be filesystem metadata, file contents, or something else. First, you get a logical block address (LBA). An LBA is just an index of a block of data on the disk. Each block is usually 512 bytes (although this figure may be dated). Still assuming we have access to the BIOS and the OS uses it, it then will convert the LBA to CHS notation. CHS (Cylinder-Head-Sector) notation is another way to reference parts of the hard drive. It used to correspond to physical concepts, but nowadays, it's outdated, but almost every BIOS supports it. From there, the OS will stuff data into registers and trigger interrupt 0x13, the BIOS's disk-reading interrupt.

That's the lowest level I can explain, and I'm sure the part after I assumed the operating system used the BIOS is outdated. Everything before that is how it still works, though, I believe, if not at a simplified level.

回复收藏 0 原文

虚拟世界 2024-11-15 06:59:52

在最低级别，在 POSIX 平台上，打开的文件由用户空间中的“描述符”表示。文件描述符只是一个整数，在任何给定时间在打开的文件中都是唯一的。当要求内核实际执行该操作时，描述符用于标识应将操作应用于哪个打开的文件。因此，read(0, charptr, 1024)从与描述符0关联的打开文件中进行读取（按照惯例，这可能是进程的标准输入）。

据用户空间所知，加载到内存中的文件的唯一部分是满足诸如读取之类的操作所需的部分。要从文件中间读取字节，支持另一种操作 - ''seek''。这告诉内核重新定位特定文件中的偏移量。下一个读取（或写入）操作将使用该新偏移量中的字节。因此，lseek(123, 100, SEEK_SET) 将与 123（我刚刚编写的描述符值）关联的文件的偏移量重新定位到第 100 个字节位置。 123 上的下一次读取将从该位置开始读取，而不是从文件的开头（或之前偏移量所在的位置）开始读取。并且任何未读取的字节都不需要加载到内存中。

幕后有一点复杂性 - 磁盘通常无法读取小于一个“块”的数据，该“块”通常是 4096 左右的 2 的幂；内核可能会进行额外的缓存和所谓的“预读”。但这些都是优化，基本思想就是我上面描述的。