将规则_1 输出目标文件与规则_2 输入目标文件链接起来。文件夹[snakemake]
我正在尝试在snakemake中创建一个具有两个规则的工作流程:
pool_files
,它从保存在不同文件夹中的基因组列表中创建每个基因组的副本到同一文件夹run_pairwise
> 获取包含基因组副本的文件夹的路径,运行一个函数(在我的例子中是 ANI 计算,但不相关)并将所有结果保存在输出文件夹中
我的问题是第一条规则的输入和输出 <代码>pool_files 是单个文件,而第二条规则 run_pairwise
的输入和输出是文件夹。我的解决方法是提供 pool_files
的复制文件和 run_pairwise
的输出文件夹作为 rule all
的输入,但是,在最好的情况下场景中,我收到如下错误:
ChildIOException:文件/目录是另一个输出的子级
我读入的表(下例中的对象gnm_table
)包含所有基因组的路径,如下所示
dir file
0 _input/genomes/ref aaa_v1.0.fa
1 _input/genomes bbb.fa
2 _input/genomes ccc.fa
3 _input/genomes ddd.fa
:到目前为止,我想出的是这样的:
import os
rule all:
input:
expand("_results/pool_gnms/{target}", target=gnm_table.file),
"_plots/ANI"
rule pool_files:
input:
i_gnm = lambda wildcards: os.path.join(gnm_table.dir[gnm_table.file == wildcards.target].to_string(), wildcards.target)
output:
gnm_link = "_results/pool_gnms/{target}",
shell:
'ln -s '
'{input.i_gnm} '
'{output.gnm_link}'
rule calculate_ANI:
input:
pool_dir = "_results/pool_gnms",
output:
ANI_dir = directory("_results/ANI")
shell:
'average_nucleotide_identity.py '
'-o {output.ANI_dir} '
'-i {input.pool_dir}'
我应该遵循什么策略来完成这项任务?也许我应该使用检查点? 非常感谢您的任何意见!
I am trying to create a workflow in snakemake with two rules:
pool_files
that creates, from a list of genomes saved in different folders, a copy of each genome into a same folderrun_pairwise
that takes the path of the folder containing the genome copies, runs a function (in my case in ANI calculation, but is not relevant) and save all the results in a output folder
My issue is that input and output of the first rule pool_files
are single files, while the input and output of the second rule run_pairwise
are folders. My workaround is to provide both the copied files of pool_files
and the output folder of run_pairwise
as inputs for rule all
, however, in the best case scenario, I am getting an error like:
ChildIOException: File/directory is a child to another output
The table (object gnm_table
in the example below) that I read in and that contains the path of all genomes looks like this:
dir file
0 _input/genomes/ref aaa_v1.0.fa
1 _input/genomes bbb.fa
2 _input/genomes ccc.fa
3 _input/genomes ddd.fa
While a temptative code that I came up with so far looks like this:
import os
rule all:
input:
expand("_results/pool_gnms/{target}", target=gnm_table.file),
"_plots/ANI"
rule pool_files:
input:
i_gnm = lambda wildcards: os.path.join(gnm_table.dir[gnm_table.file == wildcards.target].to_string(), wildcards.target)
output:
gnm_link = "_results/pool_gnms/{target}",
shell:
'ln -s '
'{input.i_gnm} '
'{output.gnm_link}'
rule calculate_ANI:
input:
pool_dir = "_results/pool_gnms",
output:
ANI_dir = directory("_results/ANI")
shell:
'average_nucleotide_identity.py '
'-o {output.ANI_dir} '
'-i {input.pool_dir}'
What strategy should I follow to accomplish this task? Maybe I should use a checkpoint?
Many thanks for any input!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
不需要检查站。当您不知道规则将创建的文件(例如,您不知道算法找到的簇的数量)时,需要检查点。就您而言,
gnm_table
中包含您需要的一切。您可以定义一个规则,声明将rule pool_files
中需要复制的所有文件作为其输入。该规则的输出可能是一个 flag,并且此标志可以是规则calculate_ANI
的输入。There is no need in checkpoints. Checkpoints are needed when you don't know the files that a rule would create (e.g. you don't know the number of clusters that the algorithm finds). In your case you have everything you need in the
gnm_table
. You may define a rule that claims all files that need to be copied inrule pool_files
as its input. The output of this rule may be a flag, and this flag could be an input forrule calculate_ANI
.