Snakemake：“输出”的不匹配的通配符变量值规则

发布于 2025-02-11 01:26:54 字数 3777 浏览 2 评论 0原文

我正在遇到一个似乎在文件夹之间持续不断发生的问题。

本质上，我认为我有一条Snakemake管道，可以将文件复制到文件夹中（对于不同的子文件夹的目的地不同）。我目前正在使用一些Python词典以及2个通配符值来实现这一目标。

但是，我目前正在遇到一个我认为是由于{outf}和{sample} wildcards值之间的不匹配。

简要说明

我相信通配符是用规则all来定义的：

rule all:
    input:
        expand(os.path.join("{outf}","{sample}","methods.txt"), outf=OUTPREFIXES, sample=SAMPLES)

在我将在下面描述的示例中：

配对{outf}和{sampe> {sampe} 是正确输入
{outf}和{sampe} is is 配对不 在输出的日志输出中正确
{outf}和{sample} is 不是在日志输出中正确的通配符

其他详细信息

我正在删除与确切格式相关的一些详细信息，但代码基本上如下：

import pandas as pd
import os
import re

data = pd.read_csv("mapping_list.csv").set_index('Subfolder', drop=False)
SAMPLES = data["Subfolder"].tolist()
OUTPREFIXES = data["Output"].tolist()

def get_input_folder(wildcards):
    return data.loc[wildcards.sample]["Input"]

def get_output_folder(wildcards):
    return data.loc[wildcards.sample]["Output"]
    
rule all:
    input:
        expand(os.path.join("{outf}","{sample}","methods.txt"), outf=OUTPREFIXES, sample=SAMPLES)

rule copy_folders:
    input:
        infolder = directory(get_input_folder),
        outfolder = directory(get_output_folder),
    output:
        os.path.join("{outf}","{sample}","methods.txt"),
    resources:
        mem_mb=2000,
        cpus=1
    shell:
        '''
        SHOUT1={input.outfolder}
        ...
        cp -R {input.infolder} $SHOUT1
        
        TEMPSAMPLE=$(basename {input.infolder})
        SHEND={input.outfolder}/$TEMPSAMPLE
        ...
        cp ../methods.txt $SHEND
        '''

我正在接收以下错误消息：

Waiting at most 5 seconds for missing files.
MissingOutputException in line 22 of /path/to/Snakefile:
Missing files after 5 seconds:
[Variable Destination Folder B]/[Sample A]/methods.txt

我相信我可以在日志的早期部分中看到问题：

rule copy_folders:
    input: /common/folder/path/[Sample A], [Variable Destination Folder A]
    output: [Variable Destination Folder B]/[Sample A]/methods.txt
    jobid: 171
    wildcards: outf=[Variable Destination Folder B], sample=[Sample A]
    resources: mem_mb=2000, cpus=1

我有一个示例表与各种文件夹与唯一的示例ID配对。在给定的行上，您会找到[示例A]和[变量目标文件夹A]。另一行，您会找到[示例B]和[可变目标文件夹B]等。

在 > Wildcards 在较早的步骤中匹配错误消息，因为它描述了该点不应创建的文件（因为{OUTF} and code> and {sample}的值不正确匹配，对于不同的行“ A”和“ B”）。

methods.txt 不需要严格需要。但是，在尝试使用目录作为端点时，我遇到了问题，因此我复制了一个额外的文件，并将其用作端点。如果有帮助，我可以共享早期的代码。但是，对于1个不同的文件夹，具有较少数量的子文件夹和较少复杂的目标文件夹，类似于当前代码似乎成功地工作了。

我有一个较早的代码版本，可以尝试确保每个文件夹的shell环境变量“本地”。我认为使用“本地”本身会引起问题，该错误消息表明只能在函数中使用。

但是，如果使用外壳代码的类似简化的部分，则路径如下：

        local SHOUT1=[Variable Destination Folder A]
        ...
        cp -R /common/folder/path/[Sample A] $SHOUT1
        
        local TEMPSAMPLE=$(basename /common/folder/path/[Sample A])
        local SHEND=[Variable Destination Folder A]/$TEMPSAMPLE
        ...
        cp ../methods.txt $SHEND

换句话说，它看起来像 shell命令的路径正确（所有示例映射文件中的“ a”）。我认为这是因为他们仅使用输入通配符值，因为我注意到可变不匹配的问题。添加了一些故障排除，以便能够处理具有名称中空间的文件夹，其中同一脚本的不同部分需要使用“ \” versus “” 正确运行），但我排除了这些文件夹来尝试简化最直接的故障排除。但是，如果我无法正确指定输出值，我将无法运行Snakemake脚本。

对故障排除的任何帮助都将不胜感激！

我认为这应该是一个相对简单的例子，开始学习基本上是cp -r $ inputsupsubfolder $ output -folder的snakemake，但也许也许是比我意识到的更多的并发症。

真诚的

查尔斯

原文

I am encountering a problem that doesn't seem to occur consistently between folders.

Essentially, I thought I had a Snakemake pipeline that would work to copy files into folders (with different destinations for different subfolders). I am currently accomplishing this with some Python dictionaries as well as 2 wildcard values.

However, I am currently encountering a problem that I believe is due to a mismatch between the {outf} and {sample} wildcards values.

Brief Description

I believe that the wildcards are defined with rule all:

rule all:
    input:
        expand(os.path.join("{outf}","{sample}","methods.txt"), outf=OUTPREFIXES, sample=SAMPLES)

In the example that I will describe below:

Pairing of {outf} and {sample} is correct for input
Pairing of {outf} and {sample} is not correct in the log output for output
Pairing of {outf} and {sample} is not correct in the log output for wildcards

Additional Details

I am removing some details related to the exact formatting, but the code is basically as follows:

import pandas as pd
import os
import re

data = pd.read_csv("mapping_list.csv").set_index('Subfolder', drop=False)
SAMPLES = data["Subfolder"].tolist()
OUTPREFIXES = data["Output"].tolist()

def get_input_folder(wildcards):
    return data.loc[wildcards.sample]["Input"]

def get_output_folder(wildcards):
    return data.loc[wildcards.sample]["Output"]
    
rule all:
    input:
        expand(os.path.join("{outf}","{sample}","methods.txt"), outf=OUTPREFIXES, sample=SAMPLES)

rule copy_folders:
    input:
        infolder = directory(get_input_folder),
        outfolder = directory(get_output_folder),
    output:
        os.path.join("{outf}","{sample}","methods.txt"),
    resources:
        mem_mb=2000,
        cpus=1
    shell:
        '''
        SHOUT1={input.outfolder}
        ...
        cp -R {input.infolder} $SHOUT1
        
        TEMPSAMPLE=$(basename {input.infolder})
        SHEND={input.outfolder}/$TEMPSAMPLE
        ...
        cp ../methods.txt $SHEND
        '''

I am receiving the following error message:

Waiting at most 5 seconds for missing files.
MissingOutputException in line 22 of /path/to/Snakefile:
Missing files after 5 seconds:
[Variable Destination Folder B]/[Sample A]/methods.txt

I believe that I can see the problem in an earlier part of the log :

rule copy_folders:
    input: /common/folder/path/[Sample A], [Variable Destination Folder A]
    output: [Variable Destination Folder B]/[Sample A]/methods.txt
    jobid: 171
    wildcards: outf=[Variable Destination Folder B], sample=[Sample A]
    resources: mem_mb=2000, cpus=1

I have a sample sheet where various folders are paired with a unique sample ID. On a given line, you would find [Sample A] and [Variable Destination Folder A]. On a different line, you would find [Sample B] and [Variable Destination Folder B], etc..

In other words, the mismatch for the wildcards at the earlier step matches the error message in that it describes a file that should not be created at that point (because the values for {outf} and {sample} are not matched correctly, for different lines "A" and "B").

The methods.txt file is not strictly needed. However, I encountered problems when trying to use a directory as the endpoint, so I copied an extra file and I used that as the endpoint. If it helps, I can share the earlier code. However, for 1 different folder with a smaller number of subfolders to copy and less complicated destination folders, something similar to the current code appeared to work successfully.

I had an earlier version of the code to try and make sure that the shell environment variables were "local" to each folder. I think the use of "local" caused a problem in itself, which an error message indicating that can only be used within a function.

However, if use the similarly simplified portion of the shell code, then the paths were filled in as follows:

        local SHOUT1=[Variable Destination Folder A]
        ...
        cp -R /common/folder/path/[Sample A] $SHOUT1
        
        local TEMPSAMPLE=$(basename /common/folder/path/[Sample A])
        local SHEND=[Variable Destination Folder A]/$TEMPSAMPLE
        ...
        cp ../methods.txt $SHEND

In other words, it looks like the paths for the shell command were correct (all for line "A" in the sample mapping file). I assume this is because they only used input wildcards values, because I noticed a problem with the variable mismatching. Some troubleshooting was added to be able to handle a folder with a space in the name where different parts of the same script need to use "\ " versus " " to run correctly), but I am excluding those folders to try and simplify the most immediate troubleshooting. However, I can't run the Snakemake script if I can't specify the output value correctly.

Any assistance with troubleshooting would be greatly appreciated!

I thought this should be a relatively simple example to start learning Snakemake for what is basically cp -R $INPUTSUBFOLDER $OUTPUTFOLDER, but perhaps there are more complications than I realized.

Sincerely,

Charles

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

挽清梦 2025-02-18 01:26:54

对我来说，它看起来好像将输入与您的copy_folders正确规则，因为您使用的是仅使用示例通配符获取的输入函数。但是，对于输出，有一个不匹配全部。

如果您只想将配对[示例A]与[变量目标文件夹A]等等，您需要更改snakemake如何处理展开（）在中规则所有。

目前，您拥有的是

rule all:
    input:
        expand(os.path.join("{outf}","{sample}","methods.txt"), outf=OUTPREFIXES, sample=SAMPLES)

OUTPREFIXES中的所有前缀与samples中的所有样本，这是evage> evagn> evagn（）的标准行为。您可以指定不同的组合功能 Expand（） - 但是 - 如果您只想将第一个示例与第一个目标相结合，则第二个目标与第二个目标等。 > zip ，喜欢：

rule all:
    input:
        expand(os.path.join("{outf}","{sample}","methods.txt"), zip, outf=OUTPREFIXES, sample=SAMPLES)

To me it looks like it pairs the input to your copy_folders rule correctly because you're using an input function that only uses your sample wildcard to get it. For the output, though, there's a mismatch because if you run the Snakefile without specifying another target, it wants all combinations of sample and outf that you specified in rule all.

If you only want to pair [Sample A] with [Variable Destination Folder A] and so on, you'll need to change how Snakemake handles your expand() in rule all.

Right now, what you have is

rule all:
    input:
        expand(os.path.join("{outf}","{sample}","methods.txt"), outf=OUTPREFIXES, sample=SAMPLES)

This pairs all prefixes in OUTPREFIXES with all samples in SAMPLES, which is the standard behavior of expand(). You can specify a different combinatoric function in expand(), though - if you only want to combine the first sample with the first destination, the second with the second etc., your rule all should instead use zip, like so:

rule all:
    input:
        expand(os.path.join("{outf}","{sample}","methods.txt"), zip, outf=OUTPREFIXES, sample=SAMPLES)

回复收藏 0 原文

戴着白色围巾的女孩 2025-02-18 01:26:54

如果您提供了示例表和最小的蛇子来复制错误，这将有所帮助。

从我看到的内容来看，您会缺少[变量目标文件夹B]/[示例A]/Methods.txt的文件错误，因为您没有实际创建该文件的代码。另外，在Input中列出的Outfolder有点奇怪，但这可能是由于管道中早期发生的事情吗？我会这样做：

rule copy_folders:
    input:
        infolder = get_input_folder,
    output:
        outfolder = directory(get_output_folder),
        touch(os.path.join("{outf}","{sample}","methods.txt")),
    resources: ...
    shell: ...

我使用touch创建虚拟文件methods.txt，该向完成规则完成 - 可能还有其他/更好的方法来处理情况。

请注意，目录函数不应在输入指令中使用。

It would help if you provided a sample sheet and minimal Snakefile to reproduce the error.

From what I can see, you get missing file error for [Variable Destination Folder B]/[Sample A]/methods.txt because you don't have code that actually creates that file. Also, it is a bit odd to have outfolder listed in input, but that may be due to something that happens earlier in the pipeline? I would do:

rule copy_folders:
    input:
        infolder = get_input_folder,
    output:
        outfolder = directory(get_output_folder),
        touch(os.path.join("{outf}","{sample}","methods.txt")),
    resources: ...
    shell: ...

I use touch to create the dummy file methods.txt that signals the completion of the rule - there may be other/better ways to handle the situation.

Note that the directory function shouldn't be used in the input directive.

回复收藏 0 原文

~没有更多了~