请向我解释为什么我在Snakemake中遇到这个错误？我几天都在努力，请告知我怎么了

发布于 2025-02-12 16:05:32 字数 6144 浏览 1 评论 0原文

我在snakemake中写了这条管道来处理我的fastq文件并获得原始计数，但是由于某种原因，我在最后一个规则（featurecounts）中不了解我的错误：我遇到了这个错误：

/mnt/c/users/manso/desktop/hel/pe.py的第175行中的Wildcarderror：输入文件中的通配符无法从输出文件中确定：“示例”

其他规则使用与featurecounts规则相同的输入，因此我不明白为什么它为该特定规则返回此错误。

我真的很感谢您的帮助。

这是我的蛇：

(SAMPLE,FRR) = glob_wildcards("rawReads/{sample}_{frr}.fastq.gz")

rule all:
    input:
        #raw_FASTQC
        expand("rawQC/fastqc/{sample}_{frr}_fastqc.html", sample=SAMPLE, frr=FRR),
        expand("rawQC/fastqc/{sample}_{frr}_fastqc.zip", sample=SAMPLE, frr=FRR),
        #raw_MultiQC
        "rawQC/multiqc_report.html",
        #FASTP
        expand("trimmedReads/{sample}_1.fastq.gz", sample=SAMPLE),
        expand("trimmedReads/{sample}_2.fastq.gz", sample=SAMPLE),
        expand("trimmedReads/{sample}_fastp_report.html", sample=SAMPLE),        
        #trimmed_FASTQC
        expand("trimmedQC/fastqc/{sample}_{frr}_fastqc.html", sample=SAMPLE, frr=FRR),
        expand("trimmedQC/fastqc/{sample}_{frr}_fastqc.zip", sample=SAMPLE, frr=FRR),
        #trimmed_MultiQC
        "trimmedQC/multiqc_report.html",
        #get fa and gtf files
        "genome/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa",
        "genome/Homo_sapiens.GRCh38.106.gtf.gz",
        #HISAT2_index
        ["index."  + str(i) + ".ht2" for i in range(1,9)],
        #HISAT_align
        expand("aligned/{sample}.bam", sample=SAMPLE),
        #samtools
        expand("aligned/{sample}.sorted.bam", sample=SAMPLE),
        expand("samtools_stats/{sample}.stats.txt", sample=SAMPLE),
        expand("samtools_stats/{sample}.flagstat.txt", sample=SAMPLE),
        #rawCounts
        "raw_Counts"


rule raw_FASTQC:
    input:
        "rawReads/{sample}_{frr}.fastq.gz",
    output:
        html="rawQC/fastqc/{sample}_{frr}_fastqc.html",
        zip= "rawQC/fastqc/{sample}_{frr}_fastqc.zip", # the suffix _fastqc.zip is necessary for multiqc to find the file. If not using multiqc, you are free to choose an arbitrary filename
    params: "--quiet"
    log:
        "logs/fastqc/{sample}_{frr}.log"
    threads: 16
    wrapper:
        "v1.7.0/bio/fastqc"


rule raw_MultiQC:
    input:
        expand("rawQC/fastqc/{sample}_{frr}_fastqc.zip", sample=SAMPLE, frr=FRR),
    params:
        path="rawQC/fastqc"
    output:
       "rawQC/multiqc_report.html"
    shell:
        "multiqc --force -n {output} {params.path}" 


rule FASTP:
    input:
         read1="rawReads/{sample}_1.fastq.gz",
         read2="rawReads/{sample}_2.fastq.gz",
    output:
        trimmed1="trimmedReads/{sample}_1.fastq.gz",
        trimmed2="trimmedReads/{sample}_2.fastq.gz",
        report_html= "trimmedReads/{sample}_fastp_report.html",
    threads: 16
    shell:
         " fastp --thread {threads} -i {input.read1} -I {input.read2} -o {output.trimmed1} -O {output.trimmed2} -h {output.report_html} "


rule trimmed_FASTQC:
    input:
        "trimmedReads/{sample}_{frr}.fastq.gz"
    output:
        html="trimmedQC/fastqc/{sample}_{frr}_fastqc.html", 
        zip="trimmedQC/fastqc/{sample}_{frr}_fastqc.zip",  # the suffix _fastqc.zip is necessary for multiqc to find the file. If not using multiqc, you are free to choose an arbitrary filename
    params: "--quiet"
    log:
        "logs/fastqc/{sample}_{frr}.log"
    threads: 16
    wrapper:
        "v1.7.0/bio/fastqc"
        

rule trimmed_MultiQC:
    input:
        expand("trimmedQC/fastqc/{sample}_{frr}_fastqc.zip", sample=SAMPLE, frr=FRR),
    params:
        path="trimmedQC/fastqc"
    output:
       "trimmedQC/multiqc_report.html"
    shell:
        "multiqc --force -n {output} {params.path} "


#Get annotation GTF
rule get_genome_gtf:
    "Downloading Genome annotation file from Ensemble, Homo sapiens primary assembly (GRCh38)"
    output:
        gtf = "genome/Homo_sapiens.GRCh38.106.gtf.gz"
    shell:
        "cd genome"
        " && wget ftp://ftp.ensembl.org/pub/release-106/gtf/homo_sapiens/Homo_sapiens.GRCh38.106.gtf.gz"
        " && gunzip -k Homo_sapiens.GRCh38.106.gtf.gz "


# Get genome fa
rule get_genome_fa:
    "Downloading Genome sequence, Homo sapiens primary assembly (GRCh38)"
    output:
        fa = "genome/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa"
    shell:
        "cd genome"
        " && wget ftp://ftp.ensembl.org/pub/release-106/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz"
        " && gunzip -k Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa "


rule HISAT2_index:
    input:
        fa = rules.get_genome_fa.output.fa
    output:
        ["index." + str(i) + ".ht2" for i in range(1,9)],
    message:
        "indexing genome"
    threads: 16
    shell:
        " hisat2-build -p {threads} {input.fa} index --quiet"


rule HISAT2_align:
    input:
        read1=rules.FASTP.output.trimmed1,
        read2=rules.FASTP.output.trimmed2,
        index=rules.HISAT2_index.output
    output:
        bam="aligned/{sample}.bam",
        metrics="logs/{sample}_HISATmetrics.txt"
    threads: 16
    shell:
        " hisat2 --threads {threads} -x index -1 {input.read1} -2 {input.read2} 2> {output.metrics}"
        " | samtools view -Sbh -o {output.bam} "


rule samtools_sort:
    input:
        aligned=rules.HISAT2_align.output.bam
        #"aligned/{sample}.bam"
    output:
        "aligned/{sample}.sorted.bam"
    threads: 8 
    shell:
       "samtools sort {input.aligned} -o {output}"


rule samtools_stats:
    input:
        "aligned/{sample}.sorted.bam",
    output:
        "samtools_stats/{sample}.stats.txt",
    shell:
       "samtools stats {input} > {output} "

rule samtools_flagstat:
    input:
        "aligned/{sample}.sorted.bam",
    output:
        "samtools_stats/{sample}.flagstat.txt",
    shell:
        "samtools flagstat {input} > {output} "
        

rule featureCounts:
    input:
        samples="aligned/{sample}.sorted.bam",   
        gtf=rules.get_genome_gtf.output.gtf
    output:
        "raw_Counts"
    threads:
        16
    shell:
        "featureCounts -T {threads} -a {input.gtf} -o {output} {input.samples}"
´´´

原文

I wrote this pipeline in snakemake to process my fastq files and get the raw counts but for some reason that I don't understand in the last rule (featurecounts) I get this error:

WildcardError in line 175 of /mnt/c/Users/manso/Desktop/hel/pe.py:
Wildcards in input files cannot be determined from output files: 'sample'

Other rules use the same input as featureCounts rule so I don't understand why it returns this error for that specific rule.

I'd really appreciate your help.

Here is my snakefile:

(SAMPLE,FRR) = glob_wildcards("rawReads/{sample}_{frr}.fastq.gz")

rule all:
    input:
        #raw_FASTQC
        expand("rawQC/fastqc/{sample}_{frr}_fastqc.html", sample=SAMPLE, frr=FRR),
        expand("rawQC/fastqc/{sample}_{frr}_fastqc.zip", sample=SAMPLE, frr=FRR),
        #raw_MultiQC
        "rawQC/multiqc_report.html",
        #FASTP
        expand("trimmedReads/{sample}_1.fastq.gz", sample=SAMPLE),
        expand("trimmedReads/{sample}_2.fastq.gz", sample=SAMPLE),
        expand("trimmedReads/{sample}_fastp_report.html", sample=SAMPLE),        
        #trimmed_FASTQC
        expand("trimmedQC/fastqc/{sample}_{frr}_fastqc.html", sample=SAMPLE, frr=FRR),
        expand("trimmedQC/fastqc/{sample}_{frr}_fastqc.zip", sample=SAMPLE, frr=FRR),
        #trimmed_MultiQC
        "trimmedQC/multiqc_report.html",
        #get fa and gtf files
        "genome/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa",
        "genome/Homo_sapiens.GRCh38.106.gtf.gz",
        #HISAT2_index
        ["index."  + str(i) + ".ht2" for i in range(1,9)],
        #HISAT_align
        expand("aligned/{sample}.bam", sample=SAMPLE),
        #samtools
        expand("aligned/{sample}.sorted.bam", sample=SAMPLE),
        expand("samtools_stats/{sample}.stats.txt", sample=SAMPLE),
        expand("samtools_stats/{sample}.flagstat.txt", sample=SAMPLE),
        #rawCounts
        "raw_Counts"


rule raw_FASTQC:
    input:
        "rawReads/{sample}_{frr}.fastq.gz",
    output:
        html="rawQC/fastqc/{sample}_{frr}_fastqc.html",
        zip= "rawQC/fastqc/{sample}_{frr}_fastqc.zip", # the suffix _fastqc.zip is necessary for multiqc to find the file. If not using multiqc, you are free to choose an arbitrary filename
    params: "--quiet"
    log:
        "logs/fastqc/{sample}_{frr}.log"
    threads: 16
    wrapper:
        "v1.7.0/bio/fastqc"


rule raw_MultiQC:
    input:
        expand("rawQC/fastqc/{sample}_{frr}_fastqc.zip", sample=SAMPLE, frr=FRR),
    params:
        path="rawQC/fastqc"
    output:
       "rawQC/multiqc_report.html"
    shell:
        "multiqc --force -n {output} {params.path}" 


rule FASTP:
    input:
         read1="rawReads/{sample}_1.fastq.gz",
         read2="rawReads/{sample}_2.fastq.gz",
    output:
        trimmed1="trimmedReads/{sample}_1.fastq.gz",
        trimmed2="trimmedReads/{sample}_2.fastq.gz",
        report_html= "trimmedReads/{sample}_fastp_report.html",
    threads: 16
    shell:
         " fastp --thread {threads} -i {input.read1} -I {input.read2} -o {output.trimmed1} -O {output.trimmed2} -h {output.report_html} "


rule trimmed_FASTQC:
    input:
        "trimmedReads/{sample}_{frr}.fastq.gz"
    output:
        html="trimmedQC/fastqc/{sample}_{frr}_fastqc.html", 
        zip="trimmedQC/fastqc/{sample}_{frr}_fastqc.zip",  # the suffix _fastqc.zip is necessary for multiqc to find the file. If not using multiqc, you are free to choose an arbitrary filename
    params: "--quiet"
    log:
        "logs/fastqc/{sample}_{frr}.log"
    threads: 16
    wrapper:
        "v1.7.0/bio/fastqc"
        

rule trimmed_MultiQC:
    input:
        expand("trimmedQC/fastqc/{sample}_{frr}_fastqc.zip", sample=SAMPLE, frr=FRR),
    params:
        path="trimmedQC/fastqc"
    output:
       "trimmedQC/multiqc_report.html"
    shell:
        "multiqc --force -n {output} {params.path} "


#Get annotation GTF
rule get_genome_gtf:
    "Downloading Genome annotation file from Ensemble, Homo sapiens primary assembly (GRCh38)"
    output:
        gtf = "genome/Homo_sapiens.GRCh38.106.gtf.gz"
    shell:
        "cd genome"
        " && wget ftp://ftp.ensembl.org/pub/release-106/gtf/homo_sapiens/Homo_sapiens.GRCh38.106.gtf.gz"
        " && gunzip -k Homo_sapiens.GRCh38.106.gtf.gz "


# Get genome fa
rule get_genome_fa:
    "Downloading Genome sequence, Homo sapiens primary assembly (GRCh38)"
    output:
        fa = "genome/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa"
    shell:
        "cd genome"
        " && wget ftp://ftp.ensembl.org/pub/release-106/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz"
        " && gunzip -k Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa "


rule HISAT2_index:
    input:
        fa = rules.get_genome_fa.output.fa
    output:
        ["index." + str(i) + ".ht2" for i in range(1,9)],
    message:
        "indexing genome"
    threads: 16
    shell:
        " hisat2-build -p {threads} {input.fa} index --quiet"


rule HISAT2_align:
    input:
        read1=rules.FASTP.output.trimmed1,
        read2=rules.FASTP.output.trimmed2,
        index=rules.HISAT2_index.output
    output:
        bam="aligned/{sample}.bam",
        metrics="logs/{sample}_HISATmetrics.txt"
    threads: 16
    shell:
        " hisat2 --threads {threads} -x index -1 {input.read1} -2 {input.read2} 2> {output.metrics}"
        " | samtools view -Sbh -o {output.bam} "


rule samtools_sort:
    input:
        aligned=rules.HISAT2_align.output.bam
        #"aligned/{sample}.bam"
    output:
        "aligned/{sample}.sorted.bam"
    threads: 8 
    shell:
       "samtools sort {input.aligned} -o {output}"


rule samtools_stats:
    input:
        "aligned/{sample}.sorted.bam",
    output:
        "samtools_stats/{sample}.stats.txt",
    shell:
       "samtools stats {input} > {output} "

rule samtools_flagstat:
    input:
        "aligned/{sample}.sorted.bam",
    output:
        "samtools_stats/{sample}.flagstat.txt",
    shell:
        "samtools flagstat {input} > {output} "
        

rule featureCounts:
    input:
        samples="aligned/{sample}.sorted.bam",   
        gtf=rules.get_genome_gtf.output.gtf
    output:
        "raw_Counts"
    threads:
        16
    shell:
        "featureCounts -T {threads} -a {input.gtf} -o {output} {input.samples}"
´´´

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

朮生 2025-02-19 16:05:33

Snakemake使用输出中的模式来推断要使用的输入。在最后一个规则中，输出为raw_counts，它没有指出{sample> {sample} Wildcard的使用。将其更改为这样的东西可能适用于您的用例：

rule featureCounts:
    input:
        samples="aligned/{sample}.sorted.bam",   
        gtf=rules.get_genome_gtf.output.gtf
    output:
        "raw_Counts_{sample}.txt"

这需要将扩展的版本添加到规则all：

    # add this target to rule all
    expand("raw_Counts_{sample}.txt", sample=SAMPLE),

edit：如果此规则作为汇总打算，则在输入指令中，您将需要通过替换所有值来删除通配符搜索。

rule featureCounts:
    input:
        samples=expand("aligned/{sample}.sorted.bam", sample=SAMPLE),   
        gtf=rules.get_genome_gtf.output.gtf
    output:
        "raw_Counts"

编辑2：请注意，glob_wildcards不会返回每个通配符的唯一值，而是与每个GlobBed文件关联的通配符。如果您想要唯一的值，那么实现这一目标的一种简单方法是将示例转换为集合（专门针对此规则）。

rule featureCounts:
    input:
        samples=expand("aligned/{sample}.sorted.bam", sample=set(SAMPLE)),   
        gtf=rules.get_genome_gtf.output.gtf
    output:
        "raw_Counts"

Snakemake uses pattern in the output to infer which inputs to use. In the last rule, output is raw_Counts, which gives no indication of what to use for {sample} wildcard. Changing it to something like this might work for your use case:

rule featureCounts:
    input:
        samples="aligned/{sample}.sorted.bam",   
        gtf=rules.get_genome_gtf.output.gtf
    output:
        "raw_Counts_{sample}.txt"

This will require adding the expanded version to rule all:

    # add this target to rule all
    expand("raw_Counts_{sample}.txt", sample=SAMPLE),

Edit: if this rule is intended as an aggregate, then in the input directive you will want to remove wildcard search by substituting all the values.

rule featureCounts:
    input:
        samples=expand("aligned/{sample}.sorted.bam", sample=SAMPLE),   
        gtf=rules.get_genome_gtf.output.gtf
    output:
        "raw_Counts"

Edit 2: note that glob_wildcards does not return unique values for each wildcard, but rather the wildcards associated with each globbed file. If you want unique values, then one easy way to achieve that is to convert SAMPLE to a set (specifically for this rule).

rule featureCounts:
    input:
        samples=expand("aligned/{sample}.sorted.bam", sample=set(SAMPLE)),   
        gtf=rules.get_genome_gtf.output.gtf
    output:
        "raw_Counts"

回复收藏 0 原文

~没有更多了~