是否可以使用线程来加快文件读取速度？

发布于 2024-09-06 01:37:58 字数 366 浏览 7 评论 0原文

我想尽快读取文件（40k 行）[编辑：其余内容已过时]。

编辑：Andres Jaan Tack 建议了一种基于每个文件一个线程的解决方案，我想确保我得到了这个（因此这是最快的方法）：

每个条目文件一个线程读取整个文件并将其内容存储在关联的容器中（ -> 与入口文件一样多的容器）
一个线程计算输入线程读取的每个单元的线性组合，并将结果存储在出口容器中（与输出文件关联）。
一个线程按块（每 4kB 数据，因此大约 10 行）写入输出容器的内容。

我是否应该推断出我不能使用 m 映射文件（因为程序处于待机状态等待数据）？

预先感谢。

此致，

神秘先生。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

怂人 2024-09-13 01:37:58

当您进一步询问时，您的问题变得更加深入。我将尝试涵盖您的所有选项...

正在阅读一个文件：有多少个线程？

使用一个线程。

如果您从单个线程从前到后直接读取文件，操作系统将不会像您想象的那样以小块的形式获取文件。相反，它会预取您之前的大块（呈指数增长）的文件，因此您几乎永远不会因为访问磁盘而付出代价。您可能会等待磁盘几次，但一般来说，就像文件已经在内存中一样，这甚至与 mmap 无关。

操作系统非常擅长这种顺序文件读取，因为它是可预测的。当您从多个线程读取文件时，您本质上是随机读取，这（显然）不太可预测。对于随机读取，预取器的效率往往低得多，在这种情况下可能会使整个应用程序变慢而不是更快。

注意：这甚至是在您添加设置线程及其所有其余部分的成本之前。这也会带来一些成本，但与更多阻塞磁盘访问的成本相比，这基本上不算什么。

读取多个文件：有多少线程？

使用与文件一样多的线程（或一些合理的数量）。

为每个打开的文件单独完成文件预取。一旦开始读取多个文件，您应该并行读取其中的几个文件。这是有效的，因为磁盘 I/O Scheduler 将尝试找出最快的顺序通常，操作系统和硬盘本身都有一个磁盘调度程序。同时，预取器仍然可以完成其工作。

并行读取多个文件总是比逐个读取文件更好。如果您确实一次读取它们，则您的磁盘将在预取之间闲置；这是将更多数据读入内存的宝贵时间！唯一可能出错的地方是 RAM 太少，无法支持许多打开的文件；这不再常见了。

需要注意的是：如果您过于热衷于读取多个文件，则读取一个文件将开始将其他文件的一些内容踢出内存，然后您将回到随机读取的情况。

将n 个文件合并为一个。

从多个线程处理和生成输出可能可行，但这取决于您需要如何组合它们。无论如何，您必须小心如何同步线程，尽管肯定有一些相对简单的无锁方法可以做到这一点。

不过，需要注意一件事：不要费心在小块（< 4K）中写入文件。在调用 write() 之前，一次至少收集 4K 数据。另外，由于内核会在您写入文件时锁定该文件，因此不要从所有线程一起调用 write() ；他们都会互相等待，而不是处理更多数据。

Your question got a little bit deeper, when you asked further. I'll try to cover all your options...

Reading One File: How many threads?

Use one thread.

If you read straight through a file front-to-back from a single thread, the operating system will not fetch the file in small chunks like you're thinking. Rather, it will prefetch the file ahead of you in huge (exponentially growing) chunks, so you almost never pay a penalty for going to disk. You might wait for the disk a handful of times, but in general it will be like the file was already in memory, and this is even irrespective of mmap.

The OS is very good at this kind of sequential file reading, because it's predictable. When you read a file from multiple threads, you're essentially reading randomly, which is (obviously) less predictable. Prefetchers tend to be much less effective with random reads, in this case probably making the whole application slower instead of faster.

Notice: This is even before you add the cost of setting up the threads and all the rest of it. That costs something, too, but it's basically nothing compared with the cost of more blocking disk accesses.

Reading Multiple Files: How many threads?

Use as many threads as you have files (or some reasonable number).

File prefetching done separately for each open file. Once you start reading multiple files, you should read from several of them in parallel. This works because the disk I/O Scheduler will try to figure out the fastest order in which to read all of them in. Often, there's a disk scheduler both in the OS and on the hard drive itself. Meanwhile, the prefetcher can still do its job.

Reading several files in parallel is always better than reading the files one-by-one. If you did read them one at a time, your disk would idle between prefetches; that's valuable time to read more data into memory! The only way you can go wrong is if you have too little RAM to support many open files; that's not common, anymore.

A word of caution: If you're too overzealous with your multiple file reads, reading one file will start kicking bits of other files out of memory, and you're back to a random-read situation.

Combining n Files into One.

Processing and producing output from multiple threads might work, but it depends how you need to combine them. You'll have to be careful about how you synchronize the threads, in any case, though there are surely some relatively easy lock-less ways to do that.

One thing to look for, though: Don't bother writing the file in small (< 4K) blocks. Collect at least 4K of data at a time before you call write(). Also, since the kernel will lock the file when you write it, don't call write() from all of your threads together; they'll all wait for each other instead of processing more data.

回复收藏 0 原文