Snakemake:扩展错误定义输出文件夹
我有一个基于
但是,在将代码应用于大量文件夹之前,我想看看是否可以更好地了解较早版本的代码中出了什么问题。
这是Snakemake代码:
import pandas as pd
import os
data = pd.read_csv("mapping_list.csv").set_index('Subfolder', drop=False)
SAMPLES = data["Subfolder"].tolist()
OUTPREFIXES = data["Output"].tolist()
def get_input_folder(wildcards):
return data.loc[wildcards.sample]["Input"]
def get_output_folder(wildcards):
return data.loc[wildcards.sample]["Output"]
rule all:
input:
expand(os.path.join("{outf}","{sample}"), zip, outf=OUTPREFIXES, sample=SAMPLES)
rule copy_folders:
input:
infolder = get_input_folder,
outfolder = get_output_folder
output:
subfolder = directory(os.path.join("{outf}","{sample}"))
resources:
mem_mb=1000,
cpus=1
shell:
"cp -R {input.infolder} {input.outfolder}"
我认为问题是{outf}
和{sample}
变量未正确定义。
例如,假设{outf}
可以进一步分为 {outf-prefix} 和 {outf-subfolder} ,所以{off}
是{outf-prefix}/{outf-subfolder}
。
这是我看到的错误消息,与那些占位持有人而不是观察到的值:
Building DAG of jobs...
InputFunctionException in line 22 of /path/to/Snakefile:
KeyError: '{outf-SUBFOLDER}'
Wildcards:
outf={outf-PREFIX}
sample={outf-SUBFOLDER}
换句话说,{sample}的值未使用。 i AM假设问题与faste
命令有关。
相反,{outf}
和{sample}
是由可以定义完整{outf}
的组件定义的前缀} 和 {OUTF-SUBFOLDER} )。因此,我认为如果Snakemake改为创建以下映射,我可以解决问题:
outf={outf}
sample={sample}
我还遇到了以下代码的类似问题:
import pandas as pd
import os
data = pd.read_csv("mapping_list.csv").set_index('FullOutSubfolder', drop=False)
FULLOUTS = data["FullOutSubfolder"].tolist()
def get_input_folder(wildcards):
return data.loc[wildcards.sample]["Input"]
def get_output_folder(wildcards):
return data.loc[wildcards.sample]["Output"]
rule all:
input:
expand("{sample}", sample=FULLOUTS)
rule copy_folders:
input:
infolder = get_input_folder,
outfolder = get_output_folder
output:
subfolder = directory("{sample}")
resources:
mem_mb=1000,
cpus=1
shell:
"cp -R {input.infolder} {input.outfolder}"
在这种情况下,输出文件夹路径被截断为通配符(丢失了与原始{示例} ),类似于上面的截断
{outf}
。
有人可以解释问题或提供任何建议吗?
非常感谢!
真诚的,
Charles
更新(7/7/2022) :我相信存在一些混乱,所以我希望其他信息会有所帮助。
这是一个示例,其中有2行占位符信息的示例,类似于 mapping_list.csv 中看到的内容:
FPID,Input,Output,Subfolder,FullOutSubfolder
fp1,/path/to/InputFolderA/SampleA,/path/to/OutputPrefixA/OutputFolderA,SampleA,/path/to/OutputPrefixA/OutputFolderA/SampleA
fp2,/path/to/InputFolderB/SampleB,/path/to/OutputPrefixB/OutputFolderB,SampleB,/path/to/OutputPrefixB/OutputFolderB/SampleB
要使用该示例,没有变量称为 {utf-prefix} 和 {utf-subfolder} 。
相反,这些是第一行的预期值:
{outf}
= /path/to/outputprefixa/outputprefixa/output futperfoldera
{示例}
= samplea
,这些是Snakemake定义的值 错误 :
{outf}
= /path/to/outputprefixa
{sampe}
= outputfoldera
因此,我的理解是{sample}
的预期值未使用,并且两个变量都被定义为从{outf}
将路径分开。
I have a workaround based upon this discussion, so I don't think this problem is especially urgent.
However, before applying code to a larger number of folders, I would like to see if I can better understand what went wrong in an earlier version of the code.
Here is the Snakemake code:
import pandas as pd
import os
data = pd.read_csv("mapping_list.csv").set_index('Subfolder', drop=False)
SAMPLES = data["Subfolder"].tolist()
OUTPREFIXES = data["Output"].tolist()
def get_input_folder(wildcards):
return data.loc[wildcards.sample]["Input"]
def get_output_folder(wildcards):
return data.loc[wildcards.sample]["Output"]
rule all:
input:
expand(os.path.join("{outf}","{sample}"), zip, outf=OUTPREFIXES, sample=SAMPLES)
rule copy_folders:
input:
infolder = get_input_folder,
outfolder = get_output_folder
output:
subfolder = directory(os.path.join("{outf}","{sample}"))
resources:
mem_mb=1000,
cpus=1
shell:
"cp -R {input.infolder} {input.outfolder}"
I think that the problem is that the {outf}
and {sample}
variables are not being defined correctly.
For example, let's say {outf}
can be further divided into {outf-PREFIX} and {outf-SUBFOLDER}, so {outf}
is {outf-PREFIX}/{outf-SUBFOLDER}
.
Here is the error message that I am seeing, with those placeholders instead of the observed values:
Building DAG of jobs...
InputFunctionException in line 22 of /path/to/Snakefile:
KeyError: '{outf-SUBFOLDER}'
Wildcards:
outf={outf-PREFIX}
sample={outf-SUBFOLDER}
In other words, the value of {sample} is not being used. I am assuming that the problem relates to the expand
command.
Instead, {outf}
and {sample}
are being defined from components that would define the full {outf}
({outf-PREFIX} and {outf-SUBFOLDER}). So, I think the problem could solved if Snakemake instead created the following mapping:
outf={outf}
sample={sample}
I also encounter a similar problem with the following code:
import pandas as pd
import os
data = pd.read_csv("mapping_list.csv").set_index('FullOutSubfolder', drop=False)
FULLOUTS = data["FullOutSubfolder"].tolist()
def get_input_folder(wildcards):
return data.loc[wildcards.sample]["Input"]
def get_output_folder(wildcards):
return data.loc[wildcards.sample]["Output"]
rule all:
input:
expand("{sample}", sample=FULLOUTS)
rule copy_folders:
input:
infolder = get_input_folder,
outfolder = get_output_folder
output:
subfolder = directory("{sample}")
resources:
mem_mb=1000,
cpus=1
shell:
"cp -R {input.infolder} {input.outfolder}"
In that situation, the output folder path is being truncated as a wildcard (losing the equivalent of the original {sample}
), similar to the truncated {outf}
above.
Can anybody please explain the problem or provide any suggestions?
Thank you very much!
Sincerely,
Charles
Update (7/7/2022): I believe that there was some confusion, so I hope that the additional information helps.
Here is an example with placeholder information for 2 lines similar to what would be seen in mapping_list.csv:
FPID,Input,Output,Subfolder,FullOutSubfolder
fp1,/path/to/InputFolderA/SampleA,/path/to/OutputPrefixA/OutputFolderA,SampleA,/path/to/OutputPrefixA/OutputFolderA/SampleA
fp2,/path/to/InputFolderB/SampleB,/path/to/OutputPrefixB/OutputFolderB,SampleB,/path/to/OutputPrefixB/OutputFolderB/SampleB
To use that example, there are no variables called {outf-PREFIX} and {outf-SUBFOLDER}.
Instead, these are the intended values for the 1st row:
{outf}
=/path/to/OutputPrefixA/OutputFolderA
{sample}
=SampleA
and these are the values incorrectly defined by Snakemake:
{outf}
=/path/to/OutputPrefixA
{sample}
=OutputFolderA
So, my understanding is that the intended value of {sample}
is not being used, and both variables are being defined from splitting the path from {outf}
.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我仍然很好奇是什么原因引起了早期问题。
但是,如果其他人遇到同样的问题,那么我确实有解决方法。
详细信息在以下文章中:
snakemake:不匹配的可变野马的可变值。输出“规则
基本上,我添加了额外的shell脚本命令,然后将一个小文件直接复制到输出中。然后,我将小文件用作端点,而不是复制的输出目录。
I am still curious what caused the earlier problem.
However, if somebody else encounters the same problem, then I do have a workaround.
The details are in the following post:
Snakemake: Mismatched Wildcards Variable Values for "output" Rule
Basically, I added extra shell script commands and I copied a small file into the output directly. I then used the small file as the endpoint, instead of the copied output directory.