用于命名依赖于多个变量的 Snakemake 管道输出文件的不同方法有哪些?

发布于 2025-01-13 18:31:35 字数 1269 浏览 0 评论 0原文

我编写了一个 Snakemake 管道,旨在在每次运行期间使用用户在新配置文件中提供的不同变量再次运行。

config.yml:

param_a: 100 #filter dataset rule1
param_b: 200 #filter sample rule2
param_c: 300 #filter sample again rule3

config2.yml:

param_a: 150 #100->150
param_b: 200
param_c: 300

Snakefile:

rule rule1:
    #dataset is filtered by param_a
    output: {dataset}_{param_a}/{sample}

rule rule2:
    #sample is filtered by param_a
    output: {dataset}_{param_a}/{sample}_{param_b}

rule rule3:
    #sample is then filtered by param_c
    output: {dataset}_{param_a}/{sample}_{param_b}_{param_c}

目的是使用户可以在不同步骤中使用不同选项重新运行分析,而不必运行所有内容,直到参数再次更改的步骤。

当我们有太多这样的参数时,目录和文件名开始变得太长,例如:

dataset1/sample-minSize200_samtools-F4-F1024-q20_mosdepth-minDepth4-maxDepth100_bedtools-merge-gap200_angsd-minQ20_loci-maxBase100/mysample.bam
dataset1/sample-minSize200_samtools-F4-F1024-q20_mosdepth-minDepth4-maxDepth100_bedtools-merge-gap200_angsd-minQ20_loci-maxBase200/mysample.bam

有没有更简单、更有效的命名方法,例如自动创建版本名称并将参数详细信息保存到文本文件?

我读到了有关影子目录功能的信息,但我认为它没有达到我想要的效果。

I wrote a snakemake pipeline which is intended to be run again with different variables provided by the user in a new config file during each run.

config.yml:

param_a: 100 #filter dataset rule1
param_b: 200 #filter sample rule2
param_c: 300 #filter sample again rule3

config2.yml:

param_a: 150 #100->150
param_b: 200
param_c: 300

Snakefile:

rule rule1:
    #dataset is filtered by param_a
    output: {dataset}_{param_a}/{sample}

rule rule2:
    #sample is filtered by param_a
    output: {dataset}_{param_a}/{sample}_{param_b}

rule rule3:
    #sample is then filtered by param_c
    output: {dataset}_{param_a}/{sample}_{param_b}_{param_c}

The aim is making it possible for user to rerun the analyses with different options at different steps without having to run everything until the step with the param change again.

When we have too many of such parameters the directory and file names start to get too long, e.g.:

dataset1/sample-minSize200_samtools-F4-F1024-q20_mosdepth-minDepth4-maxDepth100_bedtools-merge-gap200_angsd-minQ20_loci-maxBase100/mysample.bam
dataset1/sample-minSize200_samtools-F4-F1024-q20_mosdepth-minDepth4-maxDepth100_bedtools-merge-gap200_angsd-minQ20_loci-maxBase200/mysample.bam

Is there any method for easier and more efficient naming, such as auto creating version names and saving parameter details to a text file?

I read about the shadow directory feature but I don't think it does what I am looking for.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

皇甫轩 2025-01-20 18:31:35

如果您想要非常奇特,您可以将参数编码为 SHA 哈希值或类似的值,并将其用作文件名,将哈希值和参数值记录在表中。您只需要一个函数来获取关键字参数并将其转换为哈希并将其用于所有规则输入。如果我是你,我会使用目录而不是平面文件名。

dataset1/sample-minSize200/samtools-F4-F1024-q20/mosdepth-minDepth4-maxDepth100/bedtools-merge-gap200/angsd-minQ20/loci-maxBase100/mysample.bam

这将使您更容易丢弃不再需要的所有参数集,并使目录列表更快。

If you want to be very fancy, you could encode the params into a SHA hash or similar and use that for the filename, recording the hash and parameter values in a table. You just need a function to take keyword params and translate that to the hash and use it for all your rule inputs. If I were you, I would use directories instead of flat filenames.

dataset1/sample-minSize200/samtools-F4-F1024-q20/mosdepth-minDepth4-maxDepth100/bedtools-merge-gap200/angsd-minQ20/loci-maxBase100/mysample.bam

That would make it easier to discard all of some parameter set that you don't need anymore and will make directory listing faster.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文