同一文件上的多个缓冲区

发布于 2025-01-27 05:02:39 字数 588 浏览 4 评论 0原文

该过程如下。

  1. 过滤巨大的file.txt file( fastq fastq 文件格式,如果您有兴趣) line by Line 通过文件流 c中的。

  2. 每个过滤过程后,输出为filtered_i.txt file。

  3. 用1000个不同过滤器重复步骤1-2。

  4. 预期结果为1000 filtered_i.txt文件,i从1到1000。

问题是:

我可以并行运行这些过滤过程吗?

我担心的是,如果并行,将在file.txt中打开多个缓冲区。安全吗?有潜在的缺点吗?

The procedure is as follows.

  1. Filtering a huge File.txt file (FASTQ file format if you are interested) by line by line through file streaming in C.

  2. After each filtering process, the output is a filtered_i.txt file.

  3. Repeat steps 1-2 with 1000 different filters.

  4. The expected results are 1000 filtered_i.txt files, i from 1 to 1000.

The question is:

Can I run these filtering processes in parallel?

My concern is multiple buffers would be opened in File.txt if do parallel. Is it safe to do? Any potential drawbacks?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

知足的幸福 2025-02-03 05:02:39

没有最佳回答您的问题:以下是一些潜在的问题:

  • 在相同或多个过程中多次打开相同的文件,并不会出现任何问题,但是您在过程级别或系统级别上可能用完文件处理。
  • 如果过滤器为其目的使用大量RAM,则并行运行太多的RAM可能会导致交换,
  • 如果文件很大但适合内存,这将大大减慢整个过程,它可能会停留在缓存中,因此,按顺序运行过滤器不会导致I/O延迟,但是并行运行它们可能会利用多个内核。
  • 相反,如果文件不适合内存,则并行运行过滤器应增加整体吞吐量,尤其是如果它们同时消耗文件的同一区域。
  • 如果该过程是i/o绑定的,并且过滤器一次可以消耗一行,则在一个简单的循环中以一个简单读取一行读取一行的函数将其称为函数,这可能是一个简单的解决方案。并联运行多个此类过程,每个处理所有过滤器的子集都可以进一步改善吞吐量。

至于所有优化问题,您应该测试不同的方法并衡量性能。

这是一个简单的脚本,可以并行运行20个过滤器:

#!/bin/bash
for i in {0..20}; do (for j in {0..50}; do ./filter_$[$j*20+$i+1]; done)& done

There is no best answer to your problem: here are some potential issues to take into consideration:

  • opening the same file multiple times for reading in the same or multiple processes does not pose any problems per se, but you might run out of file handles either at the process level or at the system level.
  • if the filters use a lot of RAM for their purpose, running too many of them in parallel may cause swapping, which will significantly slow down the whole process
  • if the file is large but fits in memory, it is likely to stay in the cache, so running filters in sequence would not cause I/O delays, but running them in parallel may take advantage of multiple cores.
  • conversely, if the file does not fit in memory, running filters in parallel should increase overall throughput, especially if they consume the same area of the file at the same time.
  • if the process is I/O bound and filters can consume one line at a time, calling them as functions in sequence in a simple loop in a process that reads one line at a time may be a simple solution. Running multiple such processes in parallel, each handling a subset of all filters can further improve the throughput.

As for all optimisation problems, you should test different approaches and measure performance.

Here is a simple script to run 20 filters in parallel:

#!/bin/bash
for i in {0..20}; do (for j in {0..50}; do ./filter_$[$j*20+$i+1]; done)& done
赢得她心 2025-02-03 05:02:39

我建议不要并行多次打开文件。这给操作系统带来了很大的压力,如果您的所有线程立即流式传输,则您的性能将由于颤动而大大下降。您会更好地在串行流式传输文件,甚至大型文件中。如果您确实想要一个并行解决方案,我建议您将一个线程成为“流媒体”,在该线程中,您会从文件中读取一定数量的块,然后将这些块传递到其他线程中。

I would advise against opening a file multiple times in parallel. This puts a lot of strain on the OS, and if all of your threads are streaming at once, your performance is going to drop significantly because of thrashing. You'd be much better off streaming the file serially, even large files. If you do want a parallel solution, I'd suggest having one thread be the "streamer", where you'd read a certain number of chunks from the file and then pass those chunks off to the other threads.

仙女 2025-02-03 05:02:39

在包括所有大型操作系统(包括所有大型操作系统)中,对于不同的过程或同一过程的不同线程,可以并行打开同一文件以进行读取。

操作系统还会缓存文件并执行读取,因此,如果两个线程/进程从同一文件中读取,则第一个将从磁盘中读取,OS将加速它,第二个将从CACHE中读取。

您应该担心的主要内容是将并行级别匹配到机器的功能(处理器数量,内存大小)和过滤器的要求(是否过滤线程是I/O BONDING还是CPU绑定,它们是多少,它们的内存多少消费等)。

请注意,滤波器使用的内存与OS缓存用于缓存文件的内存相同,因此,如果您对过滤器进行了太多的内存,则您将获得某种thrash,其中OS在其中汇总了缓存的文件然后重新加载。每次。

In any sane operating system, including all the big ones, it is possible and safe for different processes, or different threads of the same process, to open the same file, in parallel, for reading.

Operating systems also cache the file and perform read-ahead, so if two threads/processes read from the same file, the first one will read from disk, the OS will cache it, and the second one will read from cache.

The main thing you should worry about is to match the level of parallelism to the capabilities of the machine (number of processors, memory size) and requirements of filters (whether the filtering threads are I/O bound or CPU bound, how much memory they consume, etc.).

Note that the memory used by filters is the same memory used by the OS cache to cache the file, so if you take too much memory for the filters, you'll get a sort of thrashing where the OS flushes the cached file and then reloads it every time.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文