Snakemake:“输出”的不匹配的通配符变量值规则
我正在遇到一个似乎在文件夹之间持续不断发生的问题。
本质上,我认为我有一条Snakemake管道,可以将文件复制到文件夹中(对于不同的子文件夹的目的地不同)。我目前正在使用一些Python词典以及2个通配符值来实现这一目标。
但是,我目前正在遇到一个我认为是由于{outf}
和{sample}
wildcards值之间的不匹配。
简要说明
我相信通配符是用规则all
来定义的:
rule all:
input:
expand(os.path.join("{outf}","{sample}","methods.txt"), outf=OUTPREFIXES, sample=SAMPLES)
在我将在下面描述的示例中:
- 配对
{outf}
和{sampe> {sampe} 是正确
输入
{outf}
和{sampe}
is is 配对不 在输出的日志输出中正确
{outf}
和{sample}
is 不是 在日志输出中正确的通配符
其他详细信息
我正在删除与确切格式相关的一些详细信息,但代码基本上如下:
import pandas as pd
import os
import re
data = pd.read_csv("mapping_list.csv").set_index('Subfolder', drop=False)
SAMPLES = data["Subfolder"].tolist()
OUTPREFIXES = data["Output"].tolist()
def get_input_folder(wildcards):
return data.loc[wildcards.sample]["Input"]
def get_output_folder(wildcards):
return data.loc[wildcards.sample]["Output"]
rule all:
input:
expand(os.path.join("{outf}","{sample}","methods.txt"), outf=OUTPREFIXES, sample=SAMPLES)
rule copy_folders:
input:
infolder = directory(get_input_folder),
outfolder = directory(get_output_folder),
output:
os.path.join("{outf}","{sample}","methods.txt"),
resources:
mem_mb=2000,
cpus=1
shell:
'''
SHOUT1={input.outfolder}
...
cp -R {input.infolder} $SHOUT1
TEMPSAMPLE=$(basename {input.infolder})
SHEND={input.outfolder}/$TEMPSAMPLE
...
cp ../methods.txt $SHEND
'''
我正在接收以下错误消息:
Waiting at most 5 seconds for missing files.
MissingOutputException in line 22 of /path/to/Snakefile:
Missing files after 5 seconds:
[Variable Destination Folder B]/[Sample A]/methods.txt
我相信我可以在日志的早期部分中看到问题:
rule copy_folders:
input: /common/folder/path/[Sample A], [Variable Destination Folder A]
output: [Variable Destination Folder B]/[Sample A]/methods.txt
jobid: 171
wildcards: outf=[Variable Destination Folder B], sample=[Sample A]
resources: mem_mb=2000, cpus=1
我有一个示例表与各种文件夹与唯一的示例ID配对。在给定的行上,您会找到[示例A]
和[变量目标文件夹A]
。 另一行,您会找到[示例B]
和[可变目标文件夹B]
等。
在 > Wildcards 在较早的步骤中匹配错误消息,因为它描述了该点不应创建的文件(因为{OUTF}
and code> and {sample}的值
不正确匹配,对于不同的行“ A”和“ B”)。
methods.txt 不需要严格需要。但是,在尝试使用目录作为端点时,我遇到了问题,因此我复制了一个额外的文件,并将其用作端点。如果有帮助,我可以共享早期的代码。但是,对于1个不同的文件夹,具有较少数量的子文件夹和较少复杂的目标文件夹,类似于当前代码似乎成功地工作了。
我有一个较早的代码版本,可以尝试确保每个文件夹的shell环境变量“本地”。我认为使用“本地”本身会引起问题,该错误消息表明只能在函数中使用。
但是,如果使用外壳代码的类似简化的部分,则路径 如下:
local SHOUT1=[Variable Destination Folder A]
...
cp -R /common/folder/path/[Sample A] $SHOUT1
local TEMPSAMPLE=$(basename /common/folder/path/[Sample A])
local SHEND=[Variable Destination Folder A]/$TEMPSAMPLE
...
cp ../methods.txt $SHEND
换句话说,它看起来像 shell命令的路径正确(所有示例映射文件中的“ a”)。我认为这是因为他们仅使用输入
通配符值,因为我注意到可变不匹配的问题。添加了一些故障排除,以便能够处理具有名称中空间的文件夹,其中同一脚本的不同部分需要使用“ \” versus “” 正确运行),但我排除了这些文件夹来尝试简化最直接的故障排除。但是,如果我无法正确指定输出
值,我将无法运行Snakemake脚本。
对故障排除的任何帮助都将不胜感激!
我认为这应该是一个相对简单的例子,开始学习基本上是cp -r $ inputsupsubfolder $ output -folder
的snakemake,但也许也许是比我意识到的更多的并发症。
真诚的
查尔斯
I am encountering a problem that doesn't seem to occur consistently between folders.
Essentially, I thought I had a Snakemake pipeline that would work to copy files into folders (with different destinations for different subfolders). I am currently accomplishing this with some Python dictionaries as well as 2 wildcard values.
However, I am currently encountering a problem that I believe is due to a mismatch between the {outf}
and {sample}
wildcards values.
Brief Description
I believe that the wildcards are defined with rule all
:
rule all:
input:
expand(os.path.join("{outf}","{sample}","methods.txt"), outf=OUTPREFIXES, sample=SAMPLES)
In the example that I will describe below:
- Pairing of
{outf}
and{sample}
is correct forinput
- Pairing of
{outf}
and{sample}
is not correct in the log output foroutput
- Pairing of
{outf}
and{sample}
is not correct in the log output forwildcards
Additional Details
I am removing some details related to the exact formatting, but the code is basically as follows:
import pandas as pd
import os
import re
data = pd.read_csv("mapping_list.csv").set_index('Subfolder', drop=False)
SAMPLES = data["Subfolder"].tolist()
OUTPREFIXES = data["Output"].tolist()
def get_input_folder(wildcards):
return data.loc[wildcards.sample]["Input"]
def get_output_folder(wildcards):
return data.loc[wildcards.sample]["Output"]
rule all:
input:
expand(os.path.join("{outf}","{sample}","methods.txt"), outf=OUTPREFIXES, sample=SAMPLES)
rule copy_folders:
input:
infolder = directory(get_input_folder),
outfolder = directory(get_output_folder),
output:
os.path.join("{outf}","{sample}","methods.txt"),
resources:
mem_mb=2000,
cpus=1
shell:
'''
SHOUT1={input.outfolder}
...
cp -R {input.infolder} $SHOUT1
TEMPSAMPLE=$(basename {input.infolder})
SHEND={input.outfolder}/$TEMPSAMPLE
...
cp ../methods.txt $SHEND
'''
I am receiving the following error message:
Waiting at most 5 seconds for missing files.
MissingOutputException in line 22 of /path/to/Snakefile:
Missing files after 5 seconds:
[Variable Destination Folder B]/[Sample A]/methods.txt
I believe that I can see the problem in an earlier part of the log :
rule copy_folders:
input: /common/folder/path/[Sample A], [Variable Destination Folder A]
output: [Variable Destination Folder B]/[Sample A]/methods.txt
jobid: 171
wildcards: outf=[Variable Destination Folder B], sample=[Sample A]
resources: mem_mb=2000, cpus=1
I have a sample sheet where various folders are paired with a unique sample ID. On a given line, you would find [Sample A]
and [Variable Destination Folder A]
. On a different line, you would find [Sample B]
and [Variable Destination Folder B]
, etc..
In other words, the mismatch for the wildcards
at the earlier step matches the error message in that it describes a file that should not be created at that point (because the values for {outf}
and {sample}
are not matched correctly, for different lines "A" and "B").
The methods.txt file is not strictly needed. However, I encountered problems when trying to use a directory as the endpoint, so I copied an extra file and I used that as the endpoint. If it helps, I can share the earlier code. However, for 1 different folder with a smaller number of subfolders to copy and less complicated destination folders, something similar to the current code appeared to work successfully.
I had an earlier version of the code to try and make sure that the shell environment variables were "local" to each folder. I think the use of "local" caused a problem in itself, which an error message indicating that can only be used within a function.
However, if use the similarly simplified portion of the shell code, then the paths were filled in as follows:
local SHOUT1=[Variable Destination Folder A]
...
cp -R /common/folder/path/[Sample A] $SHOUT1
local TEMPSAMPLE=$(basename /common/folder/path/[Sample A])
local SHEND=[Variable Destination Folder A]/$TEMPSAMPLE
...
cp ../methods.txt $SHEND
In other words, it looks like the paths for the shell command were correct (all for line "A" in the sample mapping file). I assume this is because they only used input
wildcards values, because I noticed a problem with the variable mismatching. Some troubleshooting was added to be able to handle a folder with a space in the name where different parts of the same script need to use "\ " versus " " to run correctly), but I am excluding those folders to try and simplify the most immediate troubleshooting. However, I can't run the Snakemake script if I can't specify the output
value correctly.
Any assistance with troubleshooting would be greatly appreciated!
I thought this should be a relatively simple example to start learning Snakemake for what is basically cp -R $INPUTSUBFOLDER $OUTPUTFOLDER
, but perhaps there are more complications than I realized.
Sincerely,
Charles
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
对我来说,它看起来好像将输入与您的
copy_folders
正确规则,因为您使用的是仅使用示例
通配符获取的输入函数。但是,对于输出,有一个不匹配全部。如果您只想将
配对[示例A]
与[变量目标文件夹A]
等等,您需要更改snakemake如何处理展开( )
在中规则所有
。目前,您拥有的是
OUTPREFIXES中的所有前缀
与samples
中的所有样本,这是evage> evagn> evagn()
的标准行为。 您可以指定不同的组合功能Expand()
- 但是 - 如果您只想将第一个示例与第一个目标相结合,则第二个目标与第二个目标等。 > zip ,喜欢:To me it looks like it pairs the input to your
copy_folders
rule correctly because you're using an input function that only uses yoursample
wildcard to get it. For the output, though, there's a mismatch because if you run the Snakefile without specifying another target, it wants all combinations ofsample
andoutf
that you specified inrule all
.If you only want to pair
[Sample A]
with[Variable Destination Folder A]
and so on, you'll need to change how Snakemake handles yourexpand()
inrule all
.Right now, what you have is
This pairs all prefixes in
OUTPREFIXES
with all samples inSAMPLES
, which is the standard behavior ofexpand()
. You can specify a different combinatoric function inexpand()
, though - if you only want to combine the first sample with the first destination, the second with the second etc., yourrule all
should instead usezip
, like so:如果您提供了示例表和最小的蛇子来复制错误,这将有所帮助。
从我看到的内容来看,您会缺少
[变量目标文件夹B]/[示例A]/Methods.txt
的文件错误,因为您没有实际创建该文件的代码。另外,在Input
中列出的Outfolder
有点奇怪,但这可能是由于管道中早期发生的事情吗?我会这样做:我使用
touch
创建虚拟文件methods.txt
,该向完成规则完成 - 可能还有其他/更好的方法来处理情况。请注意,
目录
函数不应在输入指令中使用。It would help if you provided a sample sheet and minimal Snakefile to reproduce the error.
From what I can see, you get missing file error for
[Variable Destination Folder B]/[Sample A]/methods.txt
because you don't have code that actually creates that file. Also, it is a bit odd to haveoutfolder
listed ininput
, but that may be due to something that happens earlier in the pipeline? I would do:I use
touch
to create the dummy filemethods.txt
that signals the completion of the rule - there may be other/better ways to handle the situation.Note that the
directory
function shouldn't be used in the input directive.