当之前的规则有一些输出失败时执行下游 Snakemake 规则

发布于 2025-01-14 05:11:45 字数 2392 浏览 2 评论 0原文

在规则脚本中，我有一些示例失败，但大多数都通过了。我希望 Snakemake 看到这些失败并继续使用下游规则 build_script_table。我不太确定该怎么做。对此的任何帮助将不胜感激。目前我有一个粗略的 .py 脚本来处理这个问题，但如果可能的话希望自动化它。

rule script:
    input: input_files
    output:
        'script_out/{sampleID}/{sampleID}.out.tsv'
    threads: 8
    params:
        Toys = config['Toys_dir'],
        db = config['Toys_db'],
    run:
        shell('export PATH={params.Toys}/samtools-0.1.19:$PATH; \
                rm -r script_out/{wildcards.sampleID}; \
                {params.Toys}/Toys.pl \
                -name {wildcards.sampleID} \
                -o script_out/{wildcards.sampleID} \
                -db {params.db} \
                -p {threads} \
                {input}')

rule script_copy:
    input: rules.script.output
    output: 'script_calls/{sampleID}_out_filtered.tsv'
    run:
        shell('cp {input} {output}')

rule build_script_table:
    input: expand('script_calls/{sampleID}_out_filtered.tsv', sampleID=sampleIDs)
    output: 'tables/all_script.txt'
    params:
        span = config['length'],
    run:
        dfs = []
        for fname in input:
            df = pandas.read_csv(fname, sep='\t')
            if len(df) > 0:
                df['sampleID'] = fname.split('/')[-1].split('_')[0]
                df['Toyscript'] = 1
                df['Match'] = df.apply(lambda row: sorted_Match(row['ToyName1'], row['ToyName2']), axis=1)
                df['supporting_prices'] = df.spanningdates
                df['total_price'] = df['supporting_prices'].groupby(df['Match']).transform('sum') # combine fusions that are A|B and B|A
                df.drop_duplicates('Match', inplace=True) # only keep the first row of each fusion now that support reads are summed
                df = df[df['total_price'] >= params.length] # remove fusions with too few supporting reads
                scores = list(range(1, len(df) + 1))
                scores.reverse() # you want the fusions with the most reads getting the highest score
                df.sort_values(by=['total_price'], ascending=False, inplace=True)
                df['script_rank'] = scores
                df['script_score'] = df['script_rank'].apply(lambda x: float(x)/len(df)) # percent scores for each fusion with 1 being top fusion
                dfs.append(df)

        dfsc = pandas.concat(dfs)
        dfsc.to_csv(output[0], sep='\t', index=False)

原文

In rule Script, I have a few samples that fail out, but the majority pass. I would like Snakemake to see these failures and continue with the downstream rule rule build_script_table. I am not really sure how to do this. Any help on this would be much appreciated. Currently I have a crude .py script that handles this, but want to automate this if possible.

rule script:
    input: input_files
    output:
        'script_out/{sampleID}/{sampleID}.out.tsv'
    threads: 8
    params:
        Toys = config['Toys_dir'],
        db = config['Toys_db'],
    run:
        shell('export PATH={params.Toys}/samtools-0.1.19:$PATH; \
                rm -r script_out/{wildcards.sampleID}; \
                {params.Toys}/Toys.pl \
                -name {wildcards.sampleID} \
                -o script_out/{wildcards.sampleID} \
                -db {params.db} \
                -p {threads} \
                {input}')

rule script_copy:
    input: rules.script.output
    output: 'script_calls/{sampleID}_out_filtered.tsv'
    run:
        shell('cp {input} {output}')

rule build_script_table:
    input: expand('script_calls/{sampleID}_out_filtered.tsv', sampleID=sampleIDs)
    output: 'tables/all_script.txt'
    params:
        span = config['length'],
    run:
        dfs = []
        for fname in input:
            df = pandas.read_csv(fname, sep='\t')
            if len(df) > 0:
                df['sampleID'] = fname.split('/')[-1].split('_')[0]
                df['Toyscript'] = 1
                df['Match'] = df.apply(lambda row: sorted_Match(row['ToyName1'], row['ToyName2']), axis=1)
                df['supporting_prices'] = df.spanningdates
                df['total_price'] = df['supporting_prices'].groupby(df['Match']).transform('sum') # combine fusions that are A|B and B|A
                df.drop_duplicates('Match', inplace=True) # only keep the first row of each fusion now that support reads are summed
                df = df[df['total_price'] >= params.length] # remove fusions with too few supporting reads
                scores = list(range(1, len(df) + 1))
                scores.reverse() # you want the fusions with the most reads getting the highest score
                df.sort_values(by=['total_price'], ascending=False, inplace=True)
                df['script_rank'] = scores
                df['script_score'] = df['script_rank'].apply(lambda x: float(x)/len(df)) # percent scores for each fusion with 1 being top fusion
                dfs.append(df)

        dfsc = pandas.concat(dfs)
        dfsc.to_csv(output[0], sep='\t', index=False)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

凉城 2025-01-21 05:11:45

也许不再优雅，但我认为你可以将你的工作流程分成两半并调用snakemake两次。

# partial.smk
rule may_fail:
    output: {sample}.out

# rest.smk
def build_table_input(wildcards):
    return expand('{sample}.out', sample=glob_wildcards('{sample}.out').sample)

rule build_table:
    input: build_table_input
    output: output.txt

要运行，您需要使用 --keep-going 标志执行部分蛇文件，一旦完成，您就执行其余部分：

snakemake -s partial.smk --keep-going ; snakemake -s rest.smk

我认为正确的方法是捕获“失败”发生在脚本规则中，而不是抛出 exit 1，而是创建一个空文件，并在 build_table 规则中单独处理。

Perhaps not any more elegant, but I think you could split your workflow in half and invoke snakemake twice.

# partial.smk
rule may_fail:
    output: {sample}.out

# rest.smk
def build_table_input(wildcards):
    return expand('{sample}.out', sample=glob_wildcards('{sample}.out').sample)

rule build_table:
    input: build_table_input
    output: output.txt

To run, you execute the partial snakefile with the --keep-going flag and once that is done you execute the rest:

snakemake -s partial.smk --keep-going ; snakemake -s rest.smk

I think the right way to go about it is to catch the "fail out" happening in the script rule and instead of throwing an exit 1 you create an empty file that you handle separately in the build_table rule.

回复收藏 0 原文

送舟行 2025-01-21 05:11:45

看看这是否有帮助。正如我在评论中所说，编写一个虚拟文件，以防出现可接受的失败，并使用它来决定要做什么：

rule script_can_fail:
    input: input_files
    output:
        'script_out/{sampleID}/{sampleID}.out.tsv'
    run:
        cmd = "samtools ..."
        p = subprocess.Popen(cmd, shell= True, stdout= subprocess.PIPE, stderr= subprocess.PIPE)
        stdout, stderr= p.communicate()

        if p.returncode != 0:
            # Analyze exit code and stderr. If the error can be ignored write dummy a file
            if 'SOME KNOWN ERROR' in stderr.decode():
                with open(output) as fout:
                    fout.write('FAILED')
            else:
                # This is an error that should not be ignored
                raise Exception(...)
             

rule script_copy:
    input: rules.script.output
    output: 'script_calls/{sampleID}_out_filtered.tsv'
    run:
        exit_status = open(input).readline()[0].strip()
        if exit_status == 'FAILED':
            # Write another dummy file signalling acceptable error upstream
            shell('echo "FAILED" > {output}')
        else:
            shell("do something else...")
            

rule build_script_table:
    input: expand('script_calls/{sampleID}_out_filtered.tsv', sampleID=sampleIDs)
    output: 'tables/all_script.txt'
    run:
        for fname in input:
            exit_status = open(fname).readline()[0].strip()
            if exit_status == 'FAILED':
                # Skip this file or do something meanigful with this fname
            else:
                df = pandas.read_csv(fname, sep='\t')
                # etc...

See if this helps. As I say in comments, write a dummy file in case of acceptable failure and use it downstrean to decide what to do:

rule script_can_fail:
    input: input_files
    output:
        'script_out/{sampleID}/{sampleID}.out.tsv'
    run:
        cmd = "samtools ..."
        p = subprocess.Popen(cmd, shell= True, stdout= subprocess.PIPE, stderr= subprocess.PIPE)
        stdout, stderr= p.communicate()

        if p.returncode != 0:
            # Analyze exit code and stderr. If the error can be ignored write dummy a file
            if 'SOME KNOWN ERROR' in stderr.decode():
                with open(output) as fout:
                    fout.write('FAILED')
            else:
                # This is an error that should not be ignored
                raise Exception(...)
             

rule script_copy:
    input: rules.script.output
    output: 'script_calls/{sampleID}_out_filtered.tsv'
    run:
        exit_status = open(input).readline()[0].strip()
        if exit_status == 'FAILED':
            # Write another dummy file signalling acceptable error upstream
            shell('echo "FAILED" > {output}')
        else:
            shell("do something else...")
            

rule build_script_table:
    input: expand('script_calls/{sampleID}_out_filtered.tsv', sampleID=sampleIDs)
    output: 'tables/all_script.txt'
    run:
        for fname in input:
            exit_status = open(fname).readline()[0].strip()
            if exit_status == 'FAILED':
                # Skip this file or do something meanigful with this fname
            else:
                df = pandas.read_csv(fname, sep='\t')
                # etc...

回复收藏 0 原文

~没有更多了~