如何限制磁盘空间中的磁盘空间？

发布于 2025-02-05 03:37:24 字数 1063 浏览 5 评论 0原文

我使用8个配对端FASTQ文件，每个文件都有150 GB，需要通过带有空间按需子任务的管道来处理。我尝试了多个选项，但我仍在耗尽磁盘空间：

不再需要时使用临时删除输出文件，
使用了disk_mb资源来限制并行数量的并行作业。

我使用以下执行将我的磁盘空间使用限制为500GB，但显然不能保证，并且超过了500GB。如何将磁盘使用量限制为固定值以避免磁盘空间用完？

snakemake --resources disk_mb=500000 --use-conda --cores 16  -p

rule merge:
  input:
    fw="{sample}_1.fq.gz",
    rv="{sample}_2.fq.gz",
  output:
    temp("{sample}.assembled.fastq")
  resources:
    disk_mb=100000
  threads: 16
  shell:
    """
    merger-tool -f {input.fw} -r {input.rv} -o {output}
    """


rule filter:
  input:
    "{sample}.assembled.fastq"
  output:
    temp("{sample}.assembled.filtered.fastq")
  resources:
    disk_mb=100000
  shell:
    """
    filter-tool {input} {output}
    """


rule mapping:
  input:
    "{sample}.assembled.filtered.fastq"
  output:
    "{sample}_mapping_table.txt"
  resources:
    disk_mb=100000
  shell:
    """
    mapping-tool {input} {output}
    """

原文

I work with 8 paired-end fastq files with 150 GB each, which need to be processed by a pipeline with space-demanding sub-tasks. I tried several options but I am still running out out disk space:

used temp to delete output files when not needed anymore
used disk_mb resources to limit number of parallel jobs.

I use the following execution to limit my disk space usage to 500GB, but apparently this is not guaranteed and exceeds the 500GB. How to limit the disk usage to a fixed value to avoid running out of disk space ?

snakemake --resources disk_mb=500000 --use-conda --cores 16  -p

rule merge:
  input:
    fw="{sample}_1.fq.gz",
    rv="{sample}_2.fq.gz",
  output:
    temp("{sample}.assembled.fastq")
  resources:
    disk_mb=100000
  threads: 16
  shell:
    """
    merger-tool -f {input.fw} -r {input.rv} -o {output}
    """


rule filter:
  input:
    "{sample}.assembled.fastq"
  output:
    temp("{sample}.assembled.filtered.fastq")
  resources:
    disk_mb=100000
  shell:
    """
    filter-tool {input} {output}
    """


rule mapping:
  input:
    "{sample}.assembled.filtered.fastq"
  output:
    "{sample}_mapping_table.txt"
  resources:
    disk_mb=100000
  shell:
    """
    mapping-tool {input} {output}
    """

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

情深缘浅 2025-02-12 03:37:24

snakemake没有限制资源的功能，而只能以尊重资源约束的方式安排作业。

现在，snakemake使用Resources限制并发作业，而您的问题具有累积的方面。看看这个答案，解决此问题的一种方法是引入PRIFISTICY，所以下游任务的优先级最高。

在您的特定文件中，似乎将优先级添加到映射规则应该足够：

rule mapping:
    input:
        "{sample}.assembled.filtered.fastq"
    output:
        "{sample}_mapping_table.txt"
    resources:
        disk_mb=100_000
    priority: 100
    shell:
        """
        mapping-tool {input} {output}
        """

您可能还要谨慎启动规则（避免填写该规则带有合并的结果的磁盘空间。

Snakemake does not have the functionality to constrain resources, but can only schedule jobs in a way that respects resource constraints.

Now, snakemake uses resources to limit concurrent jobs, while your problem has a cumulative aspect to it. Taking a look at this answer, one way to resolve this is to introduce priority, so that downstream tasks have highest priority.

In your particular file, it seems that adding priority to the mapping rule should be sufficient:

rule mapping:
    input:
        "{sample}.assembled.filtered.fastq"
    output:
        "{sample}_mapping_table.txt"
    resources:
        disk_mb=100_000
    priority: 100
    shell:
        """
        mapping-tool {input} {output}
        """

You might also want to be careful about launching the rule initially (to avoid filling up the disk space with results of merge).

回复收藏 0 原文

~没有更多了~