用于命名依赖于多个变量的 Snakemake 管道输出文件的不同方法有哪些?
我编写了一个 Snakemake 管道,旨在在每次运行期间使用用户在新配置文件中提供的不同变量再次运行。
config.yml:
param_a: 100 #filter dataset rule1
param_b: 200 #filter sample rule2
param_c: 300 #filter sample again rule3
config2.yml:
param_a: 150 #100->150
param_b: 200
param_c: 300
Snakefile:
rule rule1:
#dataset is filtered by param_a
output: {dataset}_{param_a}/{sample}
rule rule2:
#sample is filtered by param_a
output: {dataset}_{param_a}/{sample}_{param_b}
rule rule3:
#sample is then filtered by param_c
output: {dataset}_{param_a}/{sample}_{param_b}_{param_c}
目的是使用户可以在不同步骤中使用不同选项重新运行分析,而不必运行所有内容,直到参数再次更改的步骤。
当我们有太多这样的参数时,目录和文件名开始变得太长,例如:
dataset1/sample-minSize200_samtools-F4-F1024-q20_mosdepth-minDepth4-maxDepth100_bedtools-merge-gap200_angsd-minQ20_loci-maxBase100/mysample.bam
dataset1/sample-minSize200_samtools-F4-F1024-q20_mosdepth-minDepth4-maxDepth100_bedtools-merge-gap200_angsd-minQ20_loci-maxBase200/mysample.bam
有没有更简单、更有效的命名方法,例如自动创建版本名称并将参数详细信息保存到文本文件?
我读到了有关影子目录功能的信息,但我认为它没有达到我想要的效果。
I wrote a snakemake pipeline which is intended to be run again with different variables provided by the user in a new config file during each run.
config.yml:
param_a: 100 #filter dataset rule1
param_b: 200 #filter sample rule2
param_c: 300 #filter sample again rule3
config2.yml:
param_a: 150 #100->150
param_b: 200
param_c: 300
Snakefile:
rule rule1:
#dataset is filtered by param_a
output: {dataset}_{param_a}/{sample}
rule rule2:
#sample is filtered by param_a
output: {dataset}_{param_a}/{sample}_{param_b}
rule rule3:
#sample is then filtered by param_c
output: {dataset}_{param_a}/{sample}_{param_b}_{param_c}
The aim is making it possible for user to rerun the analyses with different options at different steps without having to run everything until the step with the param change again.
When we have too many of such parameters the directory and file names start to get too long, e.g.:
dataset1/sample-minSize200_samtools-F4-F1024-q20_mosdepth-minDepth4-maxDepth100_bedtools-merge-gap200_angsd-minQ20_loci-maxBase100/mysample.bam
dataset1/sample-minSize200_samtools-F4-F1024-q20_mosdepth-minDepth4-maxDepth100_bedtools-merge-gap200_angsd-minQ20_loci-maxBase200/mysample.bam
Is there any method for easier and more efficient naming, such as auto creating version names and saving parameter details to a text file?
I read about the shadow directory feature but I don't think it does what I am looking for.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果您想要非常奇特,您可以将参数编码为 SHA 哈希值或类似的值,并将其用作文件名,将哈希值和参数值记录在表中。您只需要一个函数来获取关键字参数并将其转换为哈希并将其用于所有规则输入。如果我是你,我会使用目录而不是平面文件名。
这将使您更容易丢弃不再需要的所有参数集,并使目录列表更快。
If you want to be very fancy, you could encode the params into a SHA hash or similar and use that for the filename, recording the hash and parameter values in a table. You just need a function to take keyword params and translate that to the hash and use it for all your rule inputs. If I were you, I would use directories instead of flat filenames.
That would make it easier to discard all of some parameter set that you don't need anymore and will make directory listing faster.