当前位置：文江博客话题详情

同一文件上的多个缓冲区

发布于 2025-01-27 05:02:39 字数 588 浏览 4 评论 0原文

该过程如下。

过滤巨大的file.txt file（ fastq fastq 文件格式，如果您有兴趣） line by Line 通过文件流 c中的。
每个过滤过程后，输出为filtered_i.txt file。
用1000个不同过滤器重复步骤1-2。
预期结果为1000 filtered_i.txt文件，i从1到1000。

问题是：

我可以并行运行这些过滤过程吗？

我担心的是，如果并行，将在file.txt中打开多个缓冲区。安全吗？有潜在的缺点吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

知足的幸福 2025-02-03 05:02:39

没有最佳回答您的问题：以下是一些潜在的问题：

在相同或多个过程中多次打开相同的文件，并不会出现任何问题，但是您在过程级别或系统级别上可能用完文件处理。
如果过滤器为其目的使用大量RAM，则并行运行太多的RAM可能会导致交换，
如果文件很大但适合内存，这将大大减慢整个过程，它可能会停留在缓存中，因此，按顺序运行过滤器不会导致I/O延迟，但是并行运行它们可能会利用多个内核。
相反，如果文件不适合内存，则并行运行过滤器应增加整体吞吐量，尤其是如果它们同时消耗文件的同一区域。
如果该过程是i/o绑定的，并且过滤器一次可以消耗一行，则在一个简单的循环中以一个简单读取一行读取一行的函数将其称为函数，这可能是一个简单的解决方案。并联运行多个此类过程，每个处理所有过滤器的子集都可以进一步改善吞吐量。

至于所有优化问题，您应该测试不同的方法并衡量性能。

这是一个简单的脚本，可以并行运行20个过滤器：

#!/bin/bash
for i in {0..20}; do (for j in {0..50}; do ./filter_$[$j*20+$i+1]; done)& done

There is no best answer to your problem: here are some potential issues to take into consideration:

opening the same file multiple times for reading in the same or multiple processes does not pose any problems per se, but you might run out of file handles either at the process level or at the system level.
if the filters use a lot of RAM for their purpose, running too many of them in parallel may cause swapping, which will significantly slow down the whole process
if the file is large but fits in memory, it is likely to stay in the cache, so running filters in sequence would not cause I/O delays, but running them in parallel may take advantage of multiple cores.
conversely, if the file does not fit in memory, running filters in parallel should increase overall throughput, especially if they consume the same area of the file at the same time.
if the process is I/O bound and filters can consume one line at a time, calling them as functions in sequence in a simple loop in a process that reads one line at a time may be a simple solution. Running multiple such processes in parallel, each handling a subset of all filters can further improve the throughput.

As for all optimisation problems, you should test different approaches and measure performance.

Here is a simple script to run 20 filters in parallel:

#!/bin/bash
for i in {0..20}; do (for j in {0..50}; do ./filter_$[$j*20+$i+1]; done)& done

回复收藏 0 原文

赢得她心 2025-02-03 05:02:39

我建议不要并行多次打开文件。这给操作系统带来了很大的压力，如果您的所有线程立即流式传输，则您的性能将由于颤动而大大下降。您会更好地在串行流式传输文件，甚至大型文件中。如果您确实想要一个并行解决方案，我建议您将一个线程成为“流媒体”，在该线程中，您会从文件中读取一定数量的块，然后将这些块传递到其他线程中。

回复收藏 0 原文