是否可以使用线程来加快文件读取速度?
我想尽快读取文件(40k 行)[编辑:其余内容已过时]。
编辑:Andres Jaan Tack 建议了一种基于每个文件一个线程的解决方案,我想确保我得到了这个(因此这是最快的方法):
- 每个条目文件一个线程读取整个文件并将其内容存储在关联的容器中( -> 与入口文件一样多的容器)
- 一个线程计算输入线程读取的每个单元的线性组合,并将结果存储在出口容器中(与输出文件关联)。
- 一个线程按块(每 4kB 数据,因此大约 10 行)写入输出容器的内容。
我是否应该推断出我不能使用 m 映射文件(因为程序处于待机状态等待数据)?
预先感谢。
此致,
神秘先生。
I want to read a file as fast as possible (40k lines) [Edit : the rest is obsolete].
Edit: Andres Jaan Tack suggested a solution based on one thread per file, and I want to be sure I got this (thus this is the fastest way) :
- One thread per entry file reads it whole and stocks its content in a container associated (-> as many containers as there are entry files)
- One thread calculates the linear combination of every cell read by the input threads, and stocks the results in the exit container (associated to the output file).
- One thread writes by block (every 4kB of data, so about 10 lines) the content of the output container.
Should I deduce that I must not use m-mapped files (because the program's on standby waiting for the data) ?
Thanks aforehand.
Sincerely,
Mister mystère.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
当您进一步询问时,您的问题变得更加深入。我将尝试涵盖您的所有选项...
正在阅读 一个 文件:有多少个线程?
使用一个线程。
如果您从单个线程从前到后直接读取文件,操作系统将不会像您想象的那样以小块的形式获取文件。相反,它会预取您之前的大块(呈指数增长)的文件,因此您几乎永远不会因为访问磁盘而付出代价。您可能会等待磁盘几次,但一般来说,就像文件已经在内存中一样,这甚至与
mmap
无关。操作系统非常擅长这种顺序文件读取,因为它是可预测的。当您从多个线程读取文件时,您本质上是随机读取,这(显然)不太可预测。对于随机读取,预取器的效率往往低得多,在这种情况下可能会使整个应用程序变慢而不是更快。
注意:这甚至是在您添加设置线程及其所有其余部分的成本之前。这也会带来一些成本,但与更多阻塞磁盘访问的成本相比,这基本上不算什么。
读取多个文件:有多少线程?
使用与文件一样多的线程(或一些合理的数量)。
为每个打开的文件单独完成文件预取。一旦开始读取多个文件,您应该并行读取其中的几个文件。这是有效的,因为磁盘 I/O Scheduler 将尝试找出最快的顺序通常,操作系统和硬盘本身都有一个磁盘调度程序。同时,预取器仍然可以完成其工作。
并行读取多个文件总是比逐个读取文件更好。如果您确实一次读取它们,则您的磁盘将在预取之间闲置;这是将更多数据读入内存的宝贵时间!唯一可能出错的地方是 RAM 太少,无法支持许多打开的文件;这不再常见了。
需要注意的是:如果您过于热衷于读取多个文件,则读取一个文件将开始将其他文件的一些内容踢出内存,然后您将回到随机读取的情况。
将n 个文件合并为一个。
从多个线程处理和生成输出可能可行,但这取决于您需要如何组合它们。无论如何,您必须小心如何同步线程,尽管肯定有一些相对简单的无锁方法可以做到这一点。
不过,需要注意一件事:不要费心在小块(< 4K)中写入文件。在调用
write()
之前,一次至少收集 4K 数据。另外,由于内核会在您写入文件时锁定该文件,因此不要从所有线程一起调用write()
;他们都会互相等待,而不是处理更多数据。Your question got a little bit deeper, when you asked further. I'll try to cover all your options...
Reading One File: How many threads?
Use one thread.
If you read straight through a file front-to-back from a single thread, the operating system will not fetch the file in small chunks like you're thinking. Rather, it will prefetch the file ahead of you in huge (exponentially growing) chunks, so you almost never pay a penalty for going to disk. You might wait for the disk a handful of times, but in general it will be like the file was already in memory, and this is even irrespective of
mmap
.The OS is very good at this kind of sequential file reading, because it's predictable. When you read a file from multiple threads, you're essentially reading randomly, which is (obviously) less predictable. Prefetchers tend to be much less effective with random reads, in this case probably making the whole application slower instead of faster.
Notice: This is even before you add the cost of setting up the threads and all the rest of it. That costs something, too, but it's basically nothing compared with the cost of more blocking disk accesses.
Reading Multiple Files: How many threads?
Use as many threads as you have files (or some reasonable number).
File prefetching done separately for each open file. Once you start reading multiple files, you should read from several of them in parallel. This works because the disk I/O Scheduler will try to figure out the fastest order in which to read all of them in. Often, there's a disk scheduler both in the OS and on the hard drive itself. Meanwhile, the prefetcher can still do its job.
Reading several files in parallel is always better than reading the files one-by-one. If you did read them one at a time, your disk would idle between prefetches; that's valuable time to read more data into memory! The only way you can go wrong is if you have too little RAM to support many open files; that's not common, anymore.
A word of caution: If you're too overzealous with your multiple file reads, reading one file will start kicking bits of other files out of memory, and you're back to a random-read situation.
Combining n Files into One.
Processing and producing output from multiple threads might work, but it depends how you need to combine them. You'll have to be careful about how you synchronize the threads, in any case, though there are surely some relatively easy lock-less ways to do that.
One thing to look for, though: Don't bother writing the file in small (< 4K) blocks. Collect at least 4K of data at a time before you call
write()
. Also, since the kernel will lock the file when you write it, don't callwrite()
from all of your threads together; they'll all wait for each other instead of processing more data.[编辑:最初提出的问题是否启动最多 40,000 个线程会加快文件读取速度]
由于创建线程和上下文切换的开销,您的建议很可能会减慢访问速度。更多线程仅在以下情况下才有帮助:
1)计算受限并且您有额外的核心可以帮助完成工作
2)阻塞并且其他线程可以在等待其他线程解锁时工作
3)您有一个非常聪明的算法,可以利用缓存行为
您的速度很可能受到磁盘和/或内存带宽的限制,而不是计算限制,因此单个执行线程将能够最大限度地提高这些速度。
[Edit: original question asked if launching up to 40,000 threads would speed up file read]
What you suggest would most likely slow down the access due to the overhead of creating threads and context switching. More threads only help if you are
1) computationally bound and you have extra cores that could help with the work
2) blocking and other threads could work while waiting for others to unblock
3) you have a very clever algorithm that makes use of cache behavior
Most likely your speed is bound by disk and/or memory bandwidth not computational limits so a single execution thread would be able to max those out.
是的,这是浪费时间。最好的情况下,您最终会获得大致相同的性能。在最坏的情况下,它可能会损害磁盘寻找文件的不同部分而不是连续读取文件的性能。
Yes, it's a waste of time. At very best you'll end up with about the same performance. At worst, it might hurt performance from the disk seeking to different parts of the file instead of reading through it consecutively.
与其他读者相比,我相信理论上会有一些好处,即使您在 SP(单处理器)系统上运行。
然而,我永远不会对多达 40K 行执行此操作(假设您谈论的是正常大小的行)。
关键是 Amardeep 的回答,他/她说,当线程由于某种原因阻塞时,创建线程很有用。
现在,映射文件如何“工作”?
当您第一次访问该区域中的内存页面时,处理器会生成页面错误。操作系统将文件的内容(这涉及磁盘访问)加载到内存页中。然后执行返回到您的线程。
我还相信,在发生页面错误时,操作系统会填充一堆连续的页面,而不仅仅是单个页面。
现在,重要的是在页面错误处理期间,您的线程被挂起。此外,在此期间,CPU 不会加载(除了其他进程可能执行的操作之外)。
因此,如果您查看时间尺度,您会看到一个由两个部分组成的周期:一个是 CPU 加载的阶段(在这里您读取页面的内容并进行一些处理),另一个是 CPU 几乎空闲且 I/O 开启的阶段。磁盘已执行。
另一方面,您可以创建多个线程,每个线程都被分配读取文件的不同部分。您将受益于两种效果:
当某个线程被 I/O 阻塞时,其他线程有机会加载 CPU(如果是 MP 系统,则为多个 CPU)。
即使处理时间很短(因此 CPU 不是瓶颈),仍然有好处。这与以下事实有关:如果您在同一物理设备上发出多个 I/O - 它有机会更有效地执行它们。
例如,当从硬盘驱动器读取许多不同的扇区时,您实际上可以在磁盘旋转一周内读取所有这些扇区。
PS
当然,我从来没有想过要对 40K 行执行此操作。创建线程、等待线程完成、上下文切换、逻辑复杂化、错误/失败处理等的开销。
我会尝试对至少数十 MB 的文件执行此操作。
In contrast to other readers I believe that theoretically there can be some benifit, even if you're running on an SP (single-processor) system.
However I'd never do this for as much as 40K lines (assuming you talk about normal-sized lines).
They key is Amardeep's answer, where he/she says that creating threads is useful when a thread becomes blocked for some reason.
Now, how do mapped files "work"?
When you access a memory page in that region for the first time - the processor generates a page fault. The OS loads the contents of the file (this involves disk access) into the memory page. Then the execution returns to your thread.
I also believe upon page fault the OS fills a bunch of consecutive pages, not just single one.
Now, what's important is that during the page fault processing your thread is suspended. Also during this period the CPU isn't loaded (apart from what other processes may do).
So that if you look at the time scale you see a period of two sections: one where CPU is loaded (here you read the contents of the page and do some processing), and one where CPU is nearly idle and the I/O on the disk is performed.
On the other hand you may create several threads, each one is assigned to read a different portion of the file. You benefit from two effects:
Other thread has a chance to load the CPU (or several CPUs if MP system) when one is blocked by I/O.
Even in case where the processing is very short (hence the CPU is not the bottleneck) - still there's a benefit. It's related to the fact that if you issue several I/O on the same physical device - it has a chance to perform them more efficiently.
For instance, when reading many different sectors from the HD drive you can actually read them all within one disk rotation.
P.S.
And, of course, I'd never thought to do this for 40K lines. The overhead of creating threads, waiting for them to finish, context switches, logic complification, error/failure handling, and etc.
I'd try to do this for a file of at least tens of MBs.
这是一个粒度问题。您的文件很小,需要做的处理很少。一个线程可能可以在一个时间片内吞噬整个文件并在下一个时间片中处理它。两个线程会比一个线程更糟糕。在考虑将并行性作为性能解决方案之前,您需要完成更大的任务。
This is a problem of granularity. You've got a small file, and very little processing to do. One thread can probably gobble the entire file in one time slice and process it in the next. Two threads would be worse than one. You need a much larger task before considering parallelism as a performance solution.
这显然是一个是/否问题,但不知怎的,很少有人能回答是/否:(
我将你的问题简化为“是否可以使用线程来加速IO任务?”
答案是否定的,因为
将 I/O 请求放入队列中, 1GBps,你想读取一个1GB大小的文件,
因为带宽仍然是1GBps。
你可能想把文件分成10个较小的块并使用10个线程来读取,但这没有帮助, 如果您希望在读取文件时从线程中获益,则需要通过购买更多磁盘或使用具有多个队列的磁盘来获得更多 IO 队列。
It's apparently an yes/no question, but somehow few people can answered by yes/no :(
I'll simplify your question to "Is it possible to use threads to speed up IO task?"
The answer is NO, because
For example, if the bandwidth is 1GBps and you want to read a 1GB-size file, it would cost 1 second.
You might want to split the file into 10 smaller chunks and use 10 threads to read, but it doesn't help because the bandwidth is still 1GBps.
If you want to gain benefit from threads when reading files, you'll need more IO queues by buying more disks or use disks that have multiple queues.
我是这样想的。
你有 8 个核心,所以创建 8 个线程。让每个线程解析文件的一个块。所以你需要获取设备/磁盘块大小。当线程解析一个块时,让该线程解析尚未“分配”给线程的新块。
我的另一个想法是有两个线程。解析线程和仅跨过文件磁盘块的线程,即仅读取每个块的第一个字节,从而强制将文件尽快读入内存。
但是,这可以变成一场竞赛。没有什么比真正的现场跑步更好的了!人们会告诉你! :) 找到合适的价格!
I'm thinking like this.
You have 8 cores, so make 8 threads. Let each thread parse one block of the file. So you need to get the device/disk block size. When a block has been parsed by a thread let the thread parse a new one not yet "assigned" to a thread.
Another idea I have would to have 2 threads. A parsing thread and a thread just stepping over the file's disk blocks, ie just by reading the first byte of each block thus forcing the file to be read into memory as fast as possible.
But, this could be made into a contest. Nothing beats doing real live runs! and people will show you! :) find a suitable price!