将规则_1 输出目标文件与规则_2 输入目标文件链接起来。文件夹[snakemake]

发布于 2025-01-12 06:38:21 字数 1469 浏览 1 评论 0原文

我正在尝试在snakemake中创建一个具有两个规则的工作流程:

  • pool_files,它从保存在不同文件夹中的基因组列表中创建每个基因组的副本到同一文件夹
  • run_pairwise > 获取包含基因组副本的文件夹的路径,运行一个函数(在我的例子中是 ANI 计算,但不相关)并将所有结果保存在输出文件夹中

我的问题是第一条规则的输入和输出 <代码>pool_files 是单个文件,而第二条规则 run_pairwise 的输入和输出是文件夹。我的解决方法是提供 pool_files 的复制文件和 run_pairwise 的输出文件夹作为 rule all 的输入,但是,在最好的情况下场景中,我收到如下错误:

ChildIOException:文件/目录是另一个输出的子级

我读入的表(下例中的对象gnm_table)包含所有基因组的路径,如下所示

                  dir          file
0  _input/genomes/ref   aaa_v1.0.fa
1      _input/genomes        bbb.fa
2      _input/genomes        ccc.fa
3      _input/genomes        ddd.fa

:到目前为止,我想出的是这样的:

import os

rule all:
    input:
        expand("_results/pool_gnms/{target}", target=gnm_table.file),
        "_plots/ANI"


rule pool_files:
input:
    i_gnm = lambda wildcards: os.path.join(gnm_table.dir[gnm_table.file == wildcards.target].to_string(), wildcards.target)
output:
    gnm_link = "_results/pool_gnms/{target}",
shell:
    'ln -s '
    '{input.i_gnm} '
    '{output.gnm_link}'


rule calculate_ANI:
input:
    pool_dir = "_results/pool_gnms",
output:
    ANI_dir = directory("_results/ANI")
shell:
    'average_nucleotide_identity.py '
    '-o {output.ANI_dir} '
    '-i {input.pool_dir}'

我应该遵循什么策略来完成这项任务?也许我应该使用检查点? 非常感谢您的任何意见!

I am trying to create a workflow in snakemake with two rules:

  • pool_files that creates, from a list of genomes saved in different folders, a copy of each genome into a same folder
  • run_pairwise that takes the path of the folder containing the genome copies, runs a function (in my case in ANI calculation, but is not relevant) and save all the results in a output folder

My issue is that input and output of the first rule pool_files are single files, while the input and output of the second rule run_pairwise are folders. My workaround is to provide both the copied files of pool_files and the output folder of run_pairwise as inputs for rule all, however, in the best case scenario, I am getting an error like:

ChildIOException: File/directory is a child to another output

The table (object gnm_table in the example below) that I read in and that contains the path of all genomes looks like this:

                  dir          file
0  _input/genomes/ref   aaa_v1.0.fa
1      _input/genomes        bbb.fa
2      _input/genomes        ccc.fa
3      _input/genomes        ddd.fa

While a temptative code that I came up with so far looks like this:

import os

rule all:
    input:
        expand("_results/pool_gnms/{target}", target=gnm_table.file),
        "_plots/ANI"


rule pool_files:
input:
    i_gnm = lambda wildcards: os.path.join(gnm_table.dir[gnm_table.file == wildcards.target].to_string(), wildcards.target)
output:
    gnm_link = "_results/pool_gnms/{target}",
shell:
    'ln -s '
    '{input.i_gnm} '
    '{output.gnm_link}'


rule calculate_ANI:
input:
    pool_dir = "_results/pool_gnms",
output:
    ANI_dir = directory("_results/ANI")
shell:
    'average_nucleotide_identity.py '
    '-o {output.ANI_dir} '
    '-i {input.pool_dir}'

What strategy should I follow to accomplish this task? Maybe I should use a checkpoint?
Many thanks for any input!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

小ぇ时光︴ 2025-01-19 06:38:22

不需要检查站。当您不知道规则将创建的文件(例如,您不知道算法找到的簇的数量)时,需要检查点。就您而言,gnm_table 中包含您需要的一切。您可以定义一个规则,声明将 rule pool_files 中需要复制的所有文件作为其输入。该规则的输出可能是一个 flag,并且此标志可以是规则calculate_ANI的输入。

There is no need in checkpoints. Checkpoints are needed when you don't know the files that a rule would create (e.g. you don't know the number of clusters that the algorithm finds). In your case you have everything you need in the gnm_table. You may define a rule that claims all files that need to be copied in rule pool_files as its input. The output of this rule may be a flag, and this flag could be an input for rule calculate_ANI.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文