同一文件上的多个缓冲区
该过程如下。
过滤巨大的
file.txt
file( fastq fastq 文件格式,如果您有兴趣) line by Line 通过文件流c
中的。每个过滤过程后,输出为
filtered_i.txt
file。用1000个不同过滤器重复步骤1-2。
预期结果为1000
filtered_i.txt
文件,i
从1到1000。
问题是:
我可以并行运行这些过滤过程吗?
我担心的是,如果并行,将在file.txt
中打开多个缓冲区。安全吗?有潜在的缺点吗?
The procedure is as follows.
Filtering a huge
File.txt
file (FASTQ file format if you are interested) by line by line through file streaming inC
.After each filtering process, the output is a
filtered_i.txt
file.Repeat steps 1-2 with 1000 different filters.
The expected results are 1000
filtered_i.txt
files,i
from 1 to 1000.
The question is:
Can I run these filtering processes in parallel?
My concern is multiple buffers would be opened in File.txt
if do parallel. Is it safe to do? Any potential drawbacks?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
没有最佳回答您的问题:以下是一些潜在的问题:
至于所有优化问题,您应该测试不同的方法并衡量性能。
这是一个简单的脚本,可以并行运行20个过滤器:
There is no best answer to your problem: here are some potential issues to take into consideration:
As for all optimisation problems, you should test different approaches and measure performance.
Here is a simple script to run 20 filters in parallel:
我建议不要并行多次打开文件。这给操作系统带来了很大的压力,如果您的所有线程立即流式传输,则您的性能将由于颤动而大大下降。您会更好地在串行流式传输文件,甚至大型文件中。如果您确实想要一个并行解决方案,我建议您将一个线程成为“流媒体”,在该线程中,您会从文件中读取一定数量的块,然后将这些块传递到其他线程中。
I would advise against opening a file multiple times in parallel. This puts a lot of strain on the OS, and if all of your threads are streaming at once, your performance is going to drop significantly because of thrashing. You'd be much better off streaming the file serially, even large files. If you do want a parallel solution, I'd suggest having one thread be the "streamer", where you'd read a certain number of chunks from the file and then pass those chunks off to the other threads.
在包括所有大型操作系统(包括所有大型操作系统)中,对于不同的过程或同一过程的不同线程,可以并行打开同一文件以进行读取。
操作系统还会缓存文件并执行读取,因此,如果两个线程/进程从同一文件中读取,则第一个将从磁盘中读取,OS将加速它,第二个将从CACHE中读取。
您应该担心的主要内容是将并行级别匹配到机器的功能(处理器数量,内存大小)和过滤器的要求(是否过滤线程是I/O BONDING还是CPU绑定,它们是多少,它们的内存多少消费等)。
请注意,滤波器使用的内存与OS缓存用于缓存文件的内存相同,因此,如果您对过滤器进行了太多的内存,则您将获得某种thrash,其中OS在其中汇总了缓存的文件然后重新加载。每次。
In any sane operating system, including all the big ones, it is possible and safe for different processes, or different threads of the same process, to open the same file, in parallel, for reading.
Operating systems also cache the file and perform read-ahead, so if two threads/processes read from the same file, the first one will read from disk, the OS will cache it, and the second one will read from cache.
The main thing you should worry about is to match the level of parallelism to the capabilities of the machine (number of processors, memory size) and requirements of filters (whether the filtering threads are I/O bound or CPU bound, how much memory they consume, etc.).
Note that the memory used by filters is the same memory used by the OS cache to cache the file, so if you take too much memory for the filters, you'll get a sort of thrashing where the OS flushes the cached file and then reloads it every time.