请向我解释为什么我在Snakemake中遇到这个错误?我几天都在努力,请告知我怎么了
我在snakemake中写了这条管道来处理我的fastq文件并获得原始计数,但是由于某种原因,我在最后一个规则(featurecounts)中不了解我的错误:我遇到了这个错误:
/mnt/c/users/manso/desktop/hel/pe.py的第175行中的Wildcarderror: 输入文件中的通配符无法从输出文件中确定:“示例”
其他规则使用与featurecounts规则相同的输入,因此我不明白为什么它为该特定规则返回此错误。
我真的很感谢您的帮助。
这是我的蛇:
(SAMPLE,FRR) = glob_wildcards("rawReads/{sample}_{frr}.fastq.gz")
rule all:
input:
#raw_FASTQC
expand("rawQC/fastqc/{sample}_{frr}_fastqc.html", sample=SAMPLE, frr=FRR),
expand("rawQC/fastqc/{sample}_{frr}_fastqc.zip", sample=SAMPLE, frr=FRR),
#raw_MultiQC
"rawQC/multiqc_report.html",
#FASTP
expand("trimmedReads/{sample}_1.fastq.gz", sample=SAMPLE),
expand("trimmedReads/{sample}_2.fastq.gz", sample=SAMPLE),
expand("trimmedReads/{sample}_fastp_report.html", sample=SAMPLE),
#trimmed_FASTQC
expand("trimmedQC/fastqc/{sample}_{frr}_fastqc.html", sample=SAMPLE, frr=FRR),
expand("trimmedQC/fastqc/{sample}_{frr}_fastqc.zip", sample=SAMPLE, frr=FRR),
#trimmed_MultiQC
"trimmedQC/multiqc_report.html",
#get fa and gtf files
"genome/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa",
"genome/Homo_sapiens.GRCh38.106.gtf.gz",
#HISAT2_index
["index." + str(i) + ".ht2" for i in range(1,9)],
#HISAT_align
expand("aligned/{sample}.bam", sample=SAMPLE),
#samtools
expand("aligned/{sample}.sorted.bam", sample=SAMPLE),
expand("samtools_stats/{sample}.stats.txt", sample=SAMPLE),
expand("samtools_stats/{sample}.flagstat.txt", sample=SAMPLE),
#rawCounts
"raw_Counts"
rule raw_FASTQC:
input:
"rawReads/{sample}_{frr}.fastq.gz",
output:
html="rawQC/fastqc/{sample}_{frr}_fastqc.html",
zip= "rawQC/fastqc/{sample}_{frr}_fastqc.zip", # the suffix _fastqc.zip is necessary for multiqc to find the file. If not using multiqc, you are free to choose an arbitrary filename
params: "--quiet"
log:
"logs/fastqc/{sample}_{frr}.log"
threads: 16
wrapper:
"v1.7.0/bio/fastqc"
rule raw_MultiQC:
input:
expand("rawQC/fastqc/{sample}_{frr}_fastqc.zip", sample=SAMPLE, frr=FRR),
params:
path="rawQC/fastqc"
output:
"rawQC/multiqc_report.html"
shell:
"multiqc --force -n {output} {params.path}"
rule FASTP:
input:
read1="rawReads/{sample}_1.fastq.gz",
read2="rawReads/{sample}_2.fastq.gz",
output:
trimmed1="trimmedReads/{sample}_1.fastq.gz",
trimmed2="trimmedReads/{sample}_2.fastq.gz",
report_html= "trimmedReads/{sample}_fastp_report.html",
threads: 16
shell:
" fastp --thread {threads} -i {input.read1} -I {input.read2} -o {output.trimmed1} -O {output.trimmed2} -h {output.report_html} "
rule trimmed_FASTQC:
input:
"trimmedReads/{sample}_{frr}.fastq.gz"
output:
html="trimmedQC/fastqc/{sample}_{frr}_fastqc.html",
zip="trimmedQC/fastqc/{sample}_{frr}_fastqc.zip", # the suffix _fastqc.zip is necessary for multiqc to find the file. If not using multiqc, you are free to choose an arbitrary filename
params: "--quiet"
log:
"logs/fastqc/{sample}_{frr}.log"
threads: 16
wrapper:
"v1.7.0/bio/fastqc"
rule trimmed_MultiQC:
input:
expand("trimmedQC/fastqc/{sample}_{frr}_fastqc.zip", sample=SAMPLE, frr=FRR),
params:
path="trimmedQC/fastqc"
output:
"trimmedQC/multiqc_report.html"
shell:
"multiqc --force -n {output} {params.path} "
#Get annotation GTF
rule get_genome_gtf:
"Downloading Genome annotation file from Ensemble, Homo sapiens primary assembly (GRCh38)"
output:
gtf = "genome/Homo_sapiens.GRCh38.106.gtf.gz"
shell:
"cd genome"
" && wget ftp://ftp.ensembl.org/pub/release-106/gtf/homo_sapiens/Homo_sapiens.GRCh38.106.gtf.gz"
" && gunzip -k Homo_sapiens.GRCh38.106.gtf.gz "
# Get genome fa
rule get_genome_fa:
"Downloading Genome sequence, Homo sapiens primary assembly (GRCh38)"
output:
fa = "genome/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa"
shell:
"cd genome"
" && wget ftp://ftp.ensembl.org/pub/release-106/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz"
" && gunzip -k Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa "
rule HISAT2_index:
input:
fa = rules.get_genome_fa.output.fa
output:
["index." + str(i) + ".ht2" for i in range(1,9)],
message:
"indexing genome"
threads: 16
shell:
" hisat2-build -p {threads} {input.fa} index --quiet"
rule HISAT2_align:
input:
read1=rules.FASTP.output.trimmed1,
read2=rules.FASTP.output.trimmed2,
index=rules.HISAT2_index.output
output:
bam="aligned/{sample}.bam",
metrics="logs/{sample}_HISATmetrics.txt"
threads: 16
shell:
" hisat2 --threads {threads} -x index -1 {input.read1} -2 {input.read2} 2> {output.metrics}"
" | samtools view -Sbh -o {output.bam} "
rule samtools_sort:
input:
aligned=rules.HISAT2_align.output.bam
#"aligned/{sample}.bam"
output:
"aligned/{sample}.sorted.bam"
threads: 8
shell:
"samtools sort {input.aligned} -o {output}"
rule samtools_stats:
input:
"aligned/{sample}.sorted.bam",
output:
"samtools_stats/{sample}.stats.txt",
shell:
"samtools stats {input} > {output} "
rule samtools_flagstat:
input:
"aligned/{sample}.sorted.bam",
output:
"samtools_stats/{sample}.flagstat.txt",
shell:
"samtools flagstat {input} > {output} "
rule featureCounts:
input:
samples="aligned/{sample}.sorted.bam",
gtf=rules.get_genome_gtf.output.gtf
output:
"raw_Counts"
threads:
16
shell:
"featureCounts -T {threads} -a {input.gtf} -o {output} {input.samples}"
´´´
I wrote this pipeline in snakemake to process my fastq files and get the raw counts but for some reason that I don't understand in the last rule (featurecounts) I get this error:
WildcardError in line 175 of /mnt/c/Users/manso/Desktop/hel/pe.py:
Wildcards in input files cannot be determined from output files: 'sample'
Other rules use the same input as featureCounts rule so I don't understand why it returns this error for that specific rule.
I'd really appreciate your help.
Here is my snakefile:
(SAMPLE,FRR) = glob_wildcards("rawReads/{sample}_{frr}.fastq.gz")
rule all:
input:
#raw_FASTQC
expand("rawQC/fastqc/{sample}_{frr}_fastqc.html", sample=SAMPLE, frr=FRR),
expand("rawQC/fastqc/{sample}_{frr}_fastqc.zip", sample=SAMPLE, frr=FRR),
#raw_MultiQC
"rawQC/multiqc_report.html",
#FASTP
expand("trimmedReads/{sample}_1.fastq.gz", sample=SAMPLE),
expand("trimmedReads/{sample}_2.fastq.gz", sample=SAMPLE),
expand("trimmedReads/{sample}_fastp_report.html", sample=SAMPLE),
#trimmed_FASTQC
expand("trimmedQC/fastqc/{sample}_{frr}_fastqc.html", sample=SAMPLE, frr=FRR),
expand("trimmedQC/fastqc/{sample}_{frr}_fastqc.zip", sample=SAMPLE, frr=FRR),
#trimmed_MultiQC
"trimmedQC/multiqc_report.html",
#get fa and gtf files
"genome/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa",
"genome/Homo_sapiens.GRCh38.106.gtf.gz",
#HISAT2_index
["index." + str(i) + ".ht2" for i in range(1,9)],
#HISAT_align
expand("aligned/{sample}.bam", sample=SAMPLE),
#samtools
expand("aligned/{sample}.sorted.bam", sample=SAMPLE),
expand("samtools_stats/{sample}.stats.txt", sample=SAMPLE),
expand("samtools_stats/{sample}.flagstat.txt", sample=SAMPLE),
#rawCounts
"raw_Counts"
rule raw_FASTQC:
input:
"rawReads/{sample}_{frr}.fastq.gz",
output:
html="rawQC/fastqc/{sample}_{frr}_fastqc.html",
zip= "rawQC/fastqc/{sample}_{frr}_fastqc.zip", # the suffix _fastqc.zip is necessary for multiqc to find the file. If not using multiqc, you are free to choose an arbitrary filename
params: "--quiet"
log:
"logs/fastqc/{sample}_{frr}.log"
threads: 16
wrapper:
"v1.7.0/bio/fastqc"
rule raw_MultiQC:
input:
expand("rawQC/fastqc/{sample}_{frr}_fastqc.zip", sample=SAMPLE, frr=FRR),
params:
path="rawQC/fastqc"
output:
"rawQC/multiqc_report.html"
shell:
"multiqc --force -n {output} {params.path}"
rule FASTP:
input:
read1="rawReads/{sample}_1.fastq.gz",
read2="rawReads/{sample}_2.fastq.gz",
output:
trimmed1="trimmedReads/{sample}_1.fastq.gz",
trimmed2="trimmedReads/{sample}_2.fastq.gz",
report_html= "trimmedReads/{sample}_fastp_report.html",
threads: 16
shell:
" fastp --thread {threads} -i {input.read1} -I {input.read2} -o {output.trimmed1} -O {output.trimmed2} -h {output.report_html} "
rule trimmed_FASTQC:
input:
"trimmedReads/{sample}_{frr}.fastq.gz"
output:
html="trimmedQC/fastqc/{sample}_{frr}_fastqc.html",
zip="trimmedQC/fastqc/{sample}_{frr}_fastqc.zip", # the suffix _fastqc.zip is necessary for multiqc to find the file. If not using multiqc, you are free to choose an arbitrary filename
params: "--quiet"
log:
"logs/fastqc/{sample}_{frr}.log"
threads: 16
wrapper:
"v1.7.0/bio/fastqc"
rule trimmed_MultiQC:
input:
expand("trimmedQC/fastqc/{sample}_{frr}_fastqc.zip", sample=SAMPLE, frr=FRR),
params:
path="trimmedQC/fastqc"
output:
"trimmedQC/multiqc_report.html"
shell:
"multiqc --force -n {output} {params.path} "
#Get annotation GTF
rule get_genome_gtf:
"Downloading Genome annotation file from Ensemble, Homo sapiens primary assembly (GRCh38)"
output:
gtf = "genome/Homo_sapiens.GRCh38.106.gtf.gz"
shell:
"cd genome"
" && wget ftp://ftp.ensembl.org/pub/release-106/gtf/homo_sapiens/Homo_sapiens.GRCh38.106.gtf.gz"
" && gunzip -k Homo_sapiens.GRCh38.106.gtf.gz "
# Get genome fa
rule get_genome_fa:
"Downloading Genome sequence, Homo sapiens primary assembly (GRCh38)"
output:
fa = "genome/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa"
shell:
"cd genome"
" && wget ftp://ftp.ensembl.org/pub/release-106/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz"
" && gunzip -k Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa "
rule HISAT2_index:
input:
fa = rules.get_genome_fa.output.fa
output:
["index." + str(i) + ".ht2" for i in range(1,9)],
message:
"indexing genome"
threads: 16
shell:
" hisat2-build -p {threads} {input.fa} index --quiet"
rule HISAT2_align:
input:
read1=rules.FASTP.output.trimmed1,
read2=rules.FASTP.output.trimmed2,
index=rules.HISAT2_index.output
output:
bam="aligned/{sample}.bam",
metrics="logs/{sample}_HISATmetrics.txt"
threads: 16
shell:
" hisat2 --threads {threads} -x index -1 {input.read1} -2 {input.read2} 2> {output.metrics}"
" | samtools view -Sbh -o {output.bam} "
rule samtools_sort:
input:
aligned=rules.HISAT2_align.output.bam
#"aligned/{sample}.bam"
output:
"aligned/{sample}.sorted.bam"
threads: 8
shell:
"samtools sort {input.aligned} -o {output}"
rule samtools_stats:
input:
"aligned/{sample}.sorted.bam",
output:
"samtools_stats/{sample}.stats.txt",
shell:
"samtools stats {input} > {output} "
rule samtools_flagstat:
input:
"aligned/{sample}.sorted.bam",
output:
"samtools_stats/{sample}.flagstat.txt",
shell:
"samtools flagstat {input} > {output} "
rule featureCounts:
input:
samples="aligned/{sample}.sorted.bam",
gtf=rules.get_genome_gtf.output.gtf
output:
"raw_Counts"
threads:
16
shell:
"featureCounts -T {threads} -a {input.gtf} -o {output} {input.samples}"
´´´
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
Snakemake使用输出中的模式来推断要使用的输入。在最后一个规则中,输出为
raw_counts
,它没有指出{sample> {sample}
Wildcard的使用。将其更改为这样的东西可能适用于您的用例:这需要将扩展的版本添加到规则
all
:edit:如果此规则作为汇总打算,则在输入指令中,您将需要通过替换所有值来删除通配符搜索。
编辑2:请注意,
glob_wildcards
不会返回每个通配符的唯一值,而是与每个GlobBed文件关联的通配符。如果您想要唯一的值,那么实现这一目标的一种简单方法是将示例
转换为集合(专门针对此规则)。Snakemake uses pattern in the output to infer which inputs to use. In the last rule, output is
raw_Counts
, which gives no indication of what to use for{sample}
wildcard. Changing it to something like this might work for your use case:This will require adding the expanded version to rule
all
:Edit: if this rule is intended as an aggregate, then in the input directive you will want to remove wildcard search by substituting all the values.
Edit 2: note that
glob_wildcards
does not return unique values for each wildcard, but rather the wildcards associated with each globbed file. If you want unique values, then one easy way to achieve that is to convertSAMPLE
to a set (specifically for this rule).