Snakemake：扩展错误定义输出文件夹

发布于 2025-02-13 08:52:04 字数 3519 浏览 0 评论 0原文

我有一个基于

但是，在将代码应用于大量文件夹之前，我想看看是否可以更好地了解较早版本的代码中出了什么问题。

这是Snakemake代码：

import pandas as pd
import os

data = pd.read_csv("mapping_list.csv").set_index('Subfolder', drop=False)
SAMPLES = data["Subfolder"].tolist()
OUTPREFIXES = data["Output"].tolist()

def get_input_folder(wildcards):
    return data.loc[wildcards.sample]["Input"]

def get_output_folder(wildcards):
    return data.loc[wildcards.sample]["Output"]

rule all:
    input:
        expand(os.path.join("{outf}","{sample}"), zip, outf=OUTPREFIXES, sample=SAMPLES)

rule copy_folders:
    input:
        infolder = get_input_folder,
        outfolder = get_output_folder
    output:
        subfolder = directory(os.path.join("{outf}","{sample}"))
    resources:
        mem_mb=1000,
        cpus=1
    shell:
        "cp -R {input.infolder} {input.outfolder}"

我认为问题是{outf}和{sample}变量未正确定义。

例如，假设{outf}可以进一步分为 {outf-prefix} 和 {outf-subfolder} ，所以{off}是{outf-prefix}/{outf-subfolder}。

这是我看到的错误消息，与那些占位持有人而不是观察到的值：

Building DAG of jobs...
InputFunctionException in line 22 of /path/to/Snakefile:
KeyError: '{outf-SUBFOLDER}'
Wildcards:
outf={outf-PREFIX}
sample={outf-SUBFOLDER}

换句话说，{sample}的值未使用。 i AM假设问题与faste命令有关。

相反，{outf}和{sample}是由可以定义完整{outf}的组件定义的前缀} 和 {OUTF-SUBFOLDER} ）。因此，我认为如果Snakemake改为创建以下映射，我可以解决问题：

outf={outf}
sample={sample}

我还遇到了以下代码的类似问题：

import pandas as pd
import os

data = pd.read_csv("mapping_list.csv").set_index('FullOutSubfolder', drop=False)
FULLOUTS = data["FullOutSubfolder"].tolist()

def get_input_folder(wildcards):
    return data.loc[wildcards.sample]["Input"]

def get_output_folder(wildcards):
    return data.loc[wildcards.sample]["Output"]
    
rule all:
    input:
        expand("{sample}", sample=FULLOUTS)

rule copy_folders:
    input:
        infolder = get_input_folder,
        outfolder = get_output_folder
    output:
        subfolder = directory("{sample}")
    resources:
        mem_mb=1000,
        cpus=1
    shell:
        "cp -R {input.infolder} {input.outfolder}"

在这种情况下，输出文件夹路径被截断为通配符（丢失了与原始{示例} ），类似于上面的截断{outf}。

有人可以解释问题或提供任何建议吗？

非常感谢！

真诚的，

Charles

更新（7/7/2022） ：我相信存在一些混乱，所以我希望其他信息会有所帮助。

这是一个示例，其中有2行占位符信息的示例，类似于 mapping_list.csv 中看到的内容：

FPID,Input,Output,Subfolder,FullOutSubfolder
fp1,/path/to/InputFolderA/SampleA,/path/to/OutputPrefixA/OutputFolderA,SampleA,/path/to/OutputPrefixA/OutputFolderA/SampleA
fp2,/path/to/InputFolderB/SampleB,/path/to/OutputPrefixB/OutputFolderB,SampleB,/path/to/OutputPrefixB/OutputFolderB/SampleB

要使用该示例，没有变量称为 {utf-prefix} 和 {utf-subfolder} 。

相反，这些是第一行的预期值：

{outf} = /path/to/outputprefixa/outputprefixa/output futperfoldera

{示例} = samplea

，这些是Snakemake定义的值错误：

{outf} = /path/to/outputprefixa

{sampe} = outputfoldera

因此，我的理解是{sample}的预期值未使用，并且两个变量都被定义为从{outf}将路径分开。

原文

I have a workaround based upon this discussion, so I don't think this problem is especially urgent.

However, before applying code to a larger number of folders, I would like to see if I can better understand what went wrong in an earlier version of the code.

Here is the Snakemake code:

import pandas as pd
import os

data = pd.read_csv("mapping_list.csv").set_index('Subfolder', drop=False)
SAMPLES = data["Subfolder"].tolist()
OUTPREFIXES = data["Output"].tolist()

def get_input_folder(wildcards):
    return data.loc[wildcards.sample]["Input"]

def get_output_folder(wildcards):
    return data.loc[wildcards.sample]["Output"]

rule all:
    input:
        expand(os.path.join("{outf}","{sample}"), zip, outf=OUTPREFIXES, sample=SAMPLES)

rule copy_folders:
    input:
        infolder = get_input_folder,
        outfolder = get_output_folder
    output:
        subfolder = directory(os.path.join("{outf}","{sample}"))
    resources:
        mem_mb=1000,
        cpus=1
    shell:
        "cp -R {input.infolder} {input.outfolder}"

I think that the problem is that the {outf} and {sample} variables are not being defined correctly.

For example, let's say {outf} can be further divided into {outf-PREFIX} and {outf-SUBFOLDER}, so {outf} is {outf-PREFIX}/{outf-SUBFOLDER}.

Here is the error message that I am seeing, with those placeholders instead of the observed values:

Building DAG of jobs...
InputFunctionException in line 22 of /path/to/Snakefile:
KeyError: '{outf-SUBFOLDER}'
Wildcards:
outf={outf-PREFIX}
sample={outf-SUBFOLDER}

In other words, the value of {sample} is not being used. I am assuming that the problem relates to the expand command.

Instead, {outf} and {sample} are being defined from components that would define the full {outf} ({outf-PREFIX} and {outf-SUBFOLDER}). So, I think the problem could solved if Snakemake instead created the following mapping:

outf={outf}
sample={sample}

I also encounter a similar problem with the following code:

import pandas as pd
import os

data = pd.read_csv("mapping_list.csv").set_index('FullOutSubfolder', drop=False)
FULLOUTS = data["FullOutSubfolder"].tolist()

def get_input_folder(wildcards):
    return data.loc[wildcards.sample]["Input"]

def get_output_folder(wildcards):
    return data.loc[wildcards.sample]["Output"]
    
rule all:
    input:
        expand("{sample}", sample=FULLOUTS)

rule copy_folders:
    input:
        infolder = get_input_folder,
        outfolder = get_output_folder
    output:
        subfolder = directory("{sample}")
    resources:
        mem_mb=1000,
        cpus=1
    shell:
        "cp -R {input.infolder} {input.outfolder}"

In that situation, the output folder path is being truncated as a wildcard (losing the equivalent of the original {sample}), similar to the truncated {outf} above.

Can anybody please explain the problem or provide any suggestions?

Thank you very much!

Sincerely,

Charles

Update (7/7/2022): I believe that there was some confusion, so I hope that the additional information helps.

Here is an example with placeholder information for 2 lines similar to what would be seen in mapping_list.csv:

FPID,Input,Output,Subfolder,FullOutSubfolder
fp1,/path/to/InputFolderA/SampleA,/path/to/OutputPrefixA/OutputFolderA,SampleA,/path/to/OutputPrefixA/OutputFolderA/SampleA
fp2,/path/to/InputFolderB/SampleB,/path/to/OutputPrefixB/OutputFolderB,SampleB,/path/to/OutputPrefixB/OutputFolderB/SampleB

To use that example, there are no variables called {outf-PREFIX} and {outf-SUBFOLDER}.

Instead, these are the intended values for the 1st row:

{outf}=/path/to/OutputPrefixA/OutputFolderA

{sample}=SampleA

and these are the values incorrectly defined by Snakemake:

{outf}=/path/to/OutputPrefixA

{sample}=OutputFolderA

So, my understanding is that the intended value of {sample} is not being used, and both variables are being defined from splitting the path from {outf}.

分享到QQ

分享到微博