当前位置：文江博客话题详情

读取大型二进制文件每 30 个字节的最快方法？

发布于 2024-08-23 22:19:31 字数 106 浏览 11 评论 0原文

读取大型二进制文件 (2-3 GB) 的每 30 个字节的最快方法是什么？我读到由于 I/O 缓冲区，fseek 存在性能问题，但我也不想在抓取每 30 个字节之前将 2-3 GB 的数据读入内存。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

独夜无伴 2024-08-30 22:19:31

我建议您创建一个几千字节的缓冲区，从中读取每 30 个字节，用接下来的几千字节重新加载缓冲区，然后继续，直到到达 eof。这样，读入内存的数据量就受到限制，并且您也不必经常从文件中读取数据。您会发现创建的缓冲区越大，速度就越快。

编辑：实际上，正如下面所建议的，您可能希望将缓冲区设置为几百 kb，而不是几千字节（就像我说的 - 更大的缓冲区 = 更快的文件读取速度）。

回复收藏 0 原文

时光礼记 2024-08-30 22:19:31

性能测试。如果您想自己使用它，请注意，完整性检查（打印总计）仅在“step”除以 BUFSZ 时才起作用，并且 MEGS 足够小，以至于您不会读取文件末尾。这是由于（a）懒惰，（b）不希望掩盖真实代码。 rand1.data 是使用 dd 从 /dev/urandom 复制的几 GB 数据。

#include <stdio.h>
#include <stdlib.h>

const long long size = 1024LL*1024*MEGS;
const int step = 32;

int main() {
    FILE *in = fopen("/cygdrive/c/rand1.data", "rb");
    int total = 0;
    #if SEEK
        long long i = 0;
        char buf[1];
        while (i < size) {
            fread(buf, 1, 1, in);
            total += (unsigned char) buf[0];
            fseek(in, step - 1, SEEK_CUR);
            i += step;
        }
    #endif
    #ifdef BUFSZ
        long long i = 0;
        char buf[BUFSZ];
        while (i < size) {
            fread(buf, BUFSZ, 1, in);
            i += BUFSZ;
            for (int j = 0; j < BUFSZ; j += step) 
                total += (unsigned char) buf[j];
        }
    #endif
    printf("%d\n", total);
}

结果：

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=20 && time ./buff2
83595817

real    0m1.391s
user    0m0.030s
sys     0m0.030s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=20 && time ./buff2
83595817

real    0m0.172s
user    0m0.108s
sys     0m0.046s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=20 && time ./buff2
83595817

real    0m0.031s
user    0m0.030s
sys     0m0.015s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=20 && time ./buff2
83595817

real    0m0.141s
user    0m0.140s
sys     0m0.015s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DSEEK -DMEGS=20 && time ./buff2
83595817

real    0m20.797s
user    0m1.733s
sys     0m9.140s

摘要：

我最初使用了 20MB 的数据，这当然适合缓存。我第一次读取它（使用 32KB 缓冲区）需要 1.4 秒，将其放入缓存。第二次（使用 32 字节缓冲区）需要 0.17 秒。第三次（再次使用 32KB 缓冲区）需要 0.03 秒，这与我的计时器的粒度太接近而没有意义。 fseek 需要超过 20 秒，即使数据已经在磁盘缓存中。

此时，我将 fseek 从环中拉出，以便其他两个可以继续：

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=1000 && time ./buff2
-117681741

real    0m33.437s
user    0m0.749s
sys     0m1.562s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=1000 && time ./buff2
-117681741

real    0m6.078s
user    0m5.030s
sys     0m0.484s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=1000 && time ./buff2
-117681741

real    0m1.141s
user    0m0.280s
sys     0m0.500s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=1000 && time ./buff2
-117681741

real    0m6.094s
user    0m4.968s
sys     0m0.640s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=1000 && time ./buff2
-117681741

real    0m1.140s
user    0m0.171s
sys     0m0.640s

1000MB 的数据似乎也被大量缓存。 32KB 缓冲区比 32 字节缓冲区快 6 倍。但区别在于所有用户时间，而不是磁盘 I/O 阻塞所花费的时间。现在，8000MB 比我拥有的 RAM 多得多，因此我可以避免缓存：

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=8000 && time ./buff2
-938074821

real    3m25.515s
user    0m5.155s
sys     0m12.640s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=8000 && time ./buff2
-938074821

real    3m59.015s
user    1m11.061s
sys     0m10.999s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=8000 && time ./buff2
-938074821

real    3m42.423s
user    0m5.577s
sys     0m14.484s

忽略这三个中的第一个，它受益于 RAM 中已经存在的文件的前 1000MB。

现在，32KB 的版本在挂钟时间上仅稍快一些（我懒得重新运行，所以我们暂时忽略它），但看看用户+系统时间的差异：20 秒与 20 秒。 82秒。我认为我的操作系统的推测性预读磁盘缓存在这里保存了 32 字节缓冲区的培根：当 32 字节缓冲区正在缓慢重新填充时，操作系统正在加载接下来的几个磁盘扇区，即使没有人要求它们。如果没有这个，我怀疑它会比 32KB 缓冲区慢一分钟 (20%)，而 32KB 缓冲区在请求下一次读取之前在用户态上花费的时间更少。

这个故事的寓意是：标准 I/O 缓冲在我的实现中并没有减少它，fseek 的性能正如提问者所说的那样糟糕。当文件缓存在操作系统中时，缓冲区大小是一个大问题。当文件未缓存在操作系统中时，缓冲区大小不会对挂钟时间产生太大影响，但我的 CPU 更忙。

incrediman 关于使用读取缓冲区的基本建议至关重要，因为 fseek 令人震惊。在我的机器上，争论缓冲区应该是几 KB 还是几百 KB 很可能毫无意义，可能是因为操作系统已经完成了确保操作严格 I/O 绑定的工作。但我很确定这取决于操作系统磁盘预读，而不是标准 I/O 缓冲，因为如果是后者，那么 fseek 会比现在更好。实际上，可能是标准 I/O 正在执行预读，但 fseek 的过于简单的实现每次都会丢弃缓冲区。我还没有研究过它的实现（如果我研究过的话，我也无法跨越边界进入操作系统和文件系统驱动程序）。

Performance test. If you want to use it yourself, note that the integrity check (printing total) only works if "step" divides BUFSZ, and MEGS is small enough that you don't read off the end of the file. This is due to (a) laziness, (b) desire not to obscure the real code. rand1.data is a few GB copied from /dev/urandom using dd.

#include <stdio.h>
#include <stdlib.h>

const long long size = 1024LL*1024*MEGS;
const int step = 32;

int main() {
    FILE *in = fopen("/cygdrive/c/rand1.data", "rb");
    int total = 0;
    #if SEEK
        long long i = 0;
        char buf[1];
        while (i < size) {
            fread(buf, 1, 1, in);
            total += (unsigned char) buf[0];
            fseek(in, step - 1, SEEK_CUR);
            i += step;
        }
    #endif
    #ifdef BUFSZ
        long long i = 0;
        char buf[BUFSZ];
        while (i < size) {
            fread(buf, BUFSZ, 1, in);
            i += BUFSZ;
            for (int j = 0; j < BUFSZ; j += step) 
                total += (unsigned char) buf[j];
        }
    #endif
    printf("%d\n", total);
}

Results:

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=20 && time ./buff2
83595817

real    0m1.391s
user    0m0.030s
sys     0m0.030s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=20 && time ./buff2
83595817

real    0m0.172s
user    0m0.108s
sys     0m0.046s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=20 && time ./buff2
83595817

real    0m0.031s
user    0m0.030s
sys     0m0.015s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=20 && time ./buff2
83595817

real    0m0.141s
user    0m0.140s
sys     0m0.015s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DSEEK -DMEGS=20 && time ./buff2
83595817

real    0m20.797s
user    0m1.733s
sys     0m9.140s

Summary:

I'm using 20MB of data initially, which of course fits in cache. The first time I read it (using a 32KB buffer) takes 1.4s, bringing it into cache. The second time (using a 32 byte buffer) takes 0.17s. The third time (back with the 32KB buffer again) takes 0.03s, which is too close to the granularity of my timer to be meaningful. fseek takes over 20s, even though the data is already in disk cache.

At this point I'm pulling fseek out of the ring so the other two can continue:

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=1000 && time ./buff2
-117681741

real    0m33.437s
user    0m0.749s
sys     0m1.562s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=1000 && time ./buff2
-117681741

real    0m6.078s
user    0m5.030s
sys     0m0.484s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=1000 && time ./buff2
-117681741

real    0m1.141s
user    0m0.280s
sys     0m0.500s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=1000 && time ./buff2
-117681741

real    0m6.094s
user    0m4.968s
sys     0m0.640s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=1000 && time ./buff2
-117681741

real    0m1.140s
user    0m0.171s
sys     0m0.640s

1000MB of data also appears to be substantially cached. A 32KB buffer is 6 times faster than a 32 byte buffer. But the difference is all user time, not time spent blocked on disk I/O. Now, 8000MB is much more than I have RAM, so I can avoid caching:

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=8000 && time ./buff2
-938074821

real    3m25.515s
user    0m5.155s
sys     0m12.640s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=8000 && time ./buff2
-938074821

real    3m59.015s
user    1m11.061s
sys     0m10.999s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=8000 && time ./buff2
-938074821

real    3m42.423s
user    0m5.577s
sys     0m14.484s

Ignore the first of those three, it benefited from the first 1000MB of the file already being in RAM.

Now, the version with the 32KB is only slightly faster in wall clock time (and I can't be bothered to re-run, so let's ignore it for now), but look at the difference in user+sys time: 20s vs. 82s. I think that my OS's speculative read-ahead disk caching has saved the 32-byte buffer's bacon here: while the 32 byte buffer is being slowly refilled, the OS is loading the next few disk sectors even though nobody has asked for them. Without that I suspect it would have been a minute (20%) slower than the 32KB buffer, which spends less time in user-land before requesting the next read.

Moral of the story: standard I/O buffering doesn't cut it in my implementation, the performance of fseek is atrocious as the questioner says. When the file is cached in the OS, buffer size is a big deal. When the file is not cached in the OS, buffer size doesn't make a whole lot of difference to wall clock time, but my CPU was busier.

incrediman's fundamental suggestion to use a read buffer is vital, since fseek is appalling. Arguing over whether the buffer should be a few KB or a few hundred KB is most likely pointless on my machine, probably because the OS has done a job of ensuring that the operation is tightly I/O bound. But I'm pretty sure this is down to OS disk read-ahead, not standard I/O buffering, because if it was the latter then fseek would be better than it is. Actually, it could be that the standard I/O is doing the read ahead, but a too-simple implementation of fseek is discarding the buffer every time. I haven't looked into the implementation (and I couldn't follow it across the boundary into the OS and filesystem drivers if I did).

回复收藏 0 原文

我ぃ本無心為│何有愛 2024-08-30 22:19:31

好吧，您可以读取一个字节，然后在循环中查找 29 个字节。但 IO 子系统必须按扇区读取文件，扇区大小通常为 512 字节，因此它最终仍会读取整个文件。

从长远来看，以步长倍数的块读取整个文件，然后查看缓冲区会更快。如果您确保缓冲区大小是 30 的倍数，那么您的生活会变得更简单，如果它是 512 的倍数，那么您的文件IO 子系统的生活会更轻松。

while (still more file to read)
{ 
   char buf[30 * 512];
   int cread = fread (buf, sizeof(buf), 1, fd);
   for (int ii = 0; ii < cread; ii += 30)
   {

   }
}

这可能看起来效率低下，但结果会是比尝试读取 30 字节块更快。

顺便一提。如果您在 Windows 上运行，并且愿意特定于操作系统，那么您确实无法击败内存映射文件的性能。
如何扫描磁盘上非常大的文件？

Well, you can read a byte and then seek 29 bytes in a loop. But the IO subsystem has to read from the file by sectors, which are typically 512 bytes in size, so it will still end up reading the whole file.

In the long run, it will be faster to just read the whole file in chunks that are a multiple of your step size, and then just look in the buffer. You'll make your life a bit simpler if you make sure that you buffer size is a multiple of 30, and you make the fileio subsystem's life easier if it's a multiple of 512.

while (still more file to read)
{ 
   char buf[30 * 512];
   int cread = fread (buf, sizeof(buf), 1, fd);
   for (int ii = 0; ii < cread; ii += 30)
   {

   }
}

This may look inefficient, but it will work out to be faster than trying to read in 30 byte chunks.

By the way. If you are running on Windows, and willing to be OS specific, you really can't beat the performance of memory mapped files.
How to scan through really huge files on disk?

回复收藏 0 原文

笑咖 2024-08-30 22:19:31

如果您愿意突破 ANSI-C 并使用操作系统特定的调用，我建议使用内存映射文件。这是 Posix 版本（Windows 有自己的操作系统特定调用）：

#define MAPSIZE 4096
int fd = open(file, O_RDONLY);
struct stat stbuf;
fstat(fd, &stbuf);


char *addr = 0;
off_t last_mapped_offset = -1;
off_t idx = 0;
while (idx < stbuf.st_size)
{
    if (last_mapped_offset != (idx / MAPSIZE))
    {
        if (addr)
            munmap(addr, MAPSIZE);

        last_mapped_offset = idx / MAPSIZE; 

        addr = mmmap(0, MAPSIZE, PROT_READ, MAP_FILE, fd, idx, last_mapped_offset);
    }

    *(addr + (idx % MAPSIZE));

    idx += 30;

}

munmap(addr, MAPSIZE);
close(fd);

If you're willing to break out of ANSI-C and use OS specific calls, I'd recommend using memory mapped files. This is the Posix version (Windows has it's own OS specific calls):

#define MAPSIZE 4096
int fd = open(file, O_RDONLY);
struct stat stbuf;
fstat(fd, &stbuf);


char *addr = 0;
off_t last_mapped_offset = -1;
off_t idx = 0;
while (idx < stbuf.st_size)
{
    if (last_mapped_offset != (idx / MAPSIZE))
    {
        if (addr)
            munmap(addr, MAPSIZE);

        last_mapped_offset = idx / MAPSIZE; 

        addr = mmmap(0, MAPSIZE, PROT_READ, MAP_FILE, fd, idx, last_mapped_offset);
    }

    *(addr + (idx % MAPSIZE));

    idx += 30;

}

munmap(addr, MAPSIZE);
close(fd);

回复收藏 0 原文

晒暮凉 2024-08-30 22:19:31

缓冲 I/O 库的整个目的就是让您摆脱此类担忧。如果必须每 30 个字节读取一次，操作系统最终将读取整个文件，因为操作系统会读取更大的块。以下是您的选择，从最高性能到最低性能：

如果您有较大的地址空间（即，您在 64 位硬件上运行 64 位操作系统），则使用内存映射 IO（<代码>mmap（在 POSIX 系统上）将节省操作系统将数据从内核空间复制到用户空间的成本。这种节省可能非常显着。
如下面的详细注释所示（并感谢 Steve Jessop 提供的基准测试），如果您关心 I/O 性能，您应该下载 Phong Vo 的 sfio 库，来自 AT&T 高级软件技术小组。它比 C 的标准 I/O 库更安全、设计更好、速度更快。在经常使用 fseek 的程序上，速度显着更快：
在简单的微基准测试中速度提高了七倍。
放松一下，使用 fseek 和 fgetc，它们的设计和实现正是为了解决您的问题。

如果您认真对待这个问题，您应该衡量所有三种替代方案。 Steve Jessop 和我表明使用 fseek 速度较慢，如果您使用 GNU C 库，fseek 速度会慢很多。你应该测量mmap；它可能是最快的。

附录：您想要查看您的文件系统并确保它可以快速从磁盘上提取 2-3 GB 的空间。例如，XFS 可能会击败 ext2。当然，如果您坚持使用 NTFS 或 HFS+，速度就会很慢。

当

我在 Linux 上重复 Steve Jessop 的测量时，结果令人震惊。 GNU C 库在每次 fseek 时都会进行一次系统调用。除非 POSIX 由于某种原因需要这样做，否则这是疯狂的。我可以咀嚼一堆 1 和 0，并吐一个比这更好的缓冲 I/O 库。不管怎样，成本增加了大约 20 倍，其中大部分花费在内核上。如果您使用 fgetc 而不是 fread 来读取单个字节，则可以在小型基准测试上节省大约 20% 的时间。

使用像样的 I/O 库，结果不会那么令人震惊

我再次进行了实验，这次使用 Phong Vo 的 sfio 库。不使用 fseek读取 200MB 需要

0.15 秒（BUFSZ 为 30k）
使用 fseek 需要 0.57 秒

重复测量表明，不使用 fseek >，使用 sfio 仍然可以节省大约 10% 的运行时间，但运行时间非常嘈杂（几乎所有时间都花在操作系统上）。

在这台机器（笔记本电脑）上，我没有足够的可用磁盘空间来运行不适合磁盘缓存的文件，但我愿意得出以下结论：

使用合理的 I/O 库，fseek 更昂贵，但还不足以产生很大的差异（如果您所做的只是 I/O，则为 4 秒）。
GNU 项目没有提供合理的 I/O 库。通常情况下，GNU 软件很糟糕。

结论：如果您想要快速 I/O，您的第一步应该是用 AT&T sfio 库替换 GNU I/O 库。相比之下，其他影响可能很小。

The whole purpose of a buffered I/O library is to free you from such concerns. If you have to read every 30th byte, the OS is going to wind up reading the whole file, because the OS reads in larger chunks. Here are your options, from highest performance to lowest performance:

If you have a large address space (i.e., you're running a 64-bit OS on 64-bit hardware), then using memory-mapped IO (mmap on POSIX systems) will save you the cost of having the OS copy data from kernel space to user space. This savings could be significant.
As shown by the detailed notes below (and thanks to Steve Jessop for the benchmark), if you care about I/O performance you should download Phong Vo's sfio library from the AT&T Advanced Software Technology group. It is safer, better designed, and faster than C's standard I/O library. On programs that use fseek a lot, it is dramatically faster:
up to seven times faster on a simple microbenchmark.
Just relax and use fseek and fgetc, which are designed and implemented exactly to solve your problem.

If you take this problem seriously, you should measure all three alternatives. Steve Jessop and I showed that using fseek is slower, and if you are using the GNU C library, fseek is a lot slower. You should measure mmap; it may be the fastest of all.

Addendum: You want to look into your filesystem and making sure it can pull 2–3 GB off the disk quickly. XFS may beat ext2, for example. Of course, if you're stuck with NTFS or HFS+, it's just going to be slow.

Shocking results just in

I repeated Steve Jessop's measurements on Linux. The GNU C library makes a system call at every fseek. Unless POSIX requires this for some reason, it's insane. I could chew up a bunch of ones and zeroes and puke a better buffered I/O library than that. Anyway, costs go up by about a factor of 20, much of which is spent in the kernel. If you use fgetc instead of fread to read single bytes, you can save about 20% on small benchmarks.

Less shocking results with a decent I/O library

I did the experiment again, this time using Phong Vo's sfio library. Reading 200MB takes

0.15s without using fseek (BUFSZ is 30k)
0.57s using fseek

Repeated measurements show that without fseek, using sfio still shaves about 10% off the run time, but the run times are very noisy (almost all time is spent in the OS).

On this machine (laptop) I don't have enough free disk space to run with a file that won't fit in the disk cache, but I'm willing to draw these conclusions:

Using a sensible I/O library, fseek is more expensive, but not more expensive enough to make a big difference (4 seconds if all you do is the I/O).
The GNU project does not provide a sensible I/O library. As is too often the case, the GNU software sucks.

Conclusion: if you want fast I/O, your first move should be to replace the GNU I/O library with the AT&T sfio library. Other effects are likely to be small by comparison.

回复收藏 0 原文