Snakemake 中的互连变量
假设我有示例 SAMPLE_A
,分为两个文件 SAMPLE_A_1、SAMPLE_A_2
并与条形码 AATT、TTAA
和 SAMPLE_B
关联code>与条形码CCGG、GGCC、GCGC
关联,分为4个文件SAMPLE_B_1...SAMPLE_B_4
。
我可以创建 getSampleNames()
来获取 [SAMPLE_A,SAMPLE_A,SAMPLE_B,SAMPLE_B,SAMPLE_B,SAMPLE_B]
和 [1,2,1,2,3, 4]
,然后压缩它们以获得组合{sample}_{id}
。然后我可以对条形码执行相同的操作:[SAMPLE_A,SAMPLE_A,SAMPLE_B,SAMPLE_B,SAMPLE_B]
和 [AATT, TTAA,CCGG, GGCC, GCGC]
。
SAMPLES_ID,IDs = getSampleNames()
SAMPLES_BC,BCs = getBCs(set(SAMPLES_ID))
rule refine:
input:
'{sample}/demultiplex/{sample}_{id}.demultiplex.bam'
output:
bam = '{sample}/polyA_trimming/{sample}_{id}.fltnc.bam',
shell:
"isoseq3 refine {input} "
rule split:
input:
expand('{sample}/polyA_trimming/{sample}_{id}.fltnc.bam', zip, sample = SAMPLES_ID, id = IDs),
output:
expand("{sample}/cells/{barcode}_{sample}/fltnc.bam", zip, sample = SAMPLES_BC, barcode = BCs),
shell:
"python {params.script_dir}/split_cells_bam.py"
rule dedup_split:
input:
"{sample}/cells/{barcode}_{sample}/fltnc.bam"
output:
bam = "{sample}/cells/{barcode}_{sample}/dedup/dedup.bam",
shell:
"isoseq3 dedup {input} {output.bam} "
rule merge:
input:
expand("{sample}/cells/{barcode}_{sample}/dedup/dedup.bam",
zip, sample = SAMPLES_BC, barcode = BCs),
如何防止规则拆分成为管道中的瓶颈?目前,它等待对所有样本完成细化规则,虽然没有必要,每个样本都应该独立运行,但我不能,因为每个样本的条形码集都不同。有没有办法获得类似
expand("{sample}/cells/{barcode}_{sample}/fltnc.bam", zip, Sample = SAMPLES_BC, Barcode = BCs[SAMPLES_BC])
,其中 SAMPLES_BC
的每个 {sample}
都是 BCs
字典中的键? ID
也一样吗?我知道我可以使用函数,但我不确定如何通过规则传播 {barcode}
Let's say I have sample SAMPLE_A
, divided into two files SAMPLE_A_1, SAMPLE_A_2
and associated to barcodes AATT, TTAA
, and SAMPLE_B
associated to barcodes CCGG, GGCC, GCGC
, divided in 4 files SAMPLE_B_1...SAMPLE_B_4
.
I can create getSampleNames()
to get [SAMPLE_A,SAMPLE_A,SAMPLE_B,SAMPLE_B,SAMPLE_B,SAMPLE_B]
and [1,2,1,2,3,4]
and then zip them to get the combination {sample}_{id}
. And then I can do the same thing for the barcodes: [SAMPLE_A,SAMPLE_A,SAMPLE_B,SAMPLE_B,SAMPLE_B]
and [AATT, TTAA,CCGG, GGCC, GCGC]
.
SAMPLES_ID,IDs = getSampleNames()
SAMPLES_BC,BCs = getBCs(set(SAMPLES_ID))
rule refine:
input:
'{sample}/demultiplex/{sample}_{id}.demultiplex.bam'
output:
bam = '{sample}/polyA_trimming/{sample}_{id}.fltnc.bam',
shell:
"isoseq3 refine {input} "
rule split:
input:
expand('{sample}/polyA_trimming/{sample}_{id}.fltnc.bam', zip, sample = SAMPLES_ID, id = IDs),
output:
expand("{sample}/cells/{barcode}_{sample}/fltnc.bam", zip, sample = SAMPLES_BC, barcode = BCs),
shell:
"python {params.script_dir}/split_cells_bam.py"
rule dedup_split:
input:
"{sample}/cells/{barcode}_{sample}/fltnc.bam"
output:
bam = "{sample}/cells/{barcode}_{sample}/dedup/dedup.bam",
shell:
"isoseq3 dedup {input} {output.bam} "
rule merge:
input:
expand("{sample}/cells/{barcode}_{sample}/dedup/dedup.bam",
zip, sample = SAMPLES_BC, barcode = BCs),
How can I prevent the rule split to be a bottleneck in my pipeline? For now it waits for the refine rule to be done for all samples while it's not necessary, every sample should run independently, but I can't because the set of barcodes is different for each sample. Is there a way to have something like
expand("{sample}/cells/{barcode}_{sample}/fltnc.bam", zip, sample = SAMPLES_BC, barcode = BCs[SAMPLES_BC])
, where each {sample}
of SAMPLES_BC
is a key in BCs
dictionary ? And same for IDs
? I know I can use functions, but then I'm not sure how to propagate the {barcode}
through the rules
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
根据您的评论,有几种方法可以采用,其中涉及更改保存样本、条形码和 ID 的数据结构。现在,您可以为每个样本创建一条规则:
您不需要在扩展中使用 zip,因为 ids 和 bcs 用于单个样本。我认为这通常不是最好的方法,但对于您当前的工作流程来说是最简单的。
只需注意您的 shell 命令,您如何将输入/输出传递给脚本?
Based on your comment, there are a few routes to take which would involve changing your data structure holding the samples, barcodes, and ids. For now, you can just create a rule per sample:
You don't need zip in the expand since the ids and bcs are for the single sample. I don't think this is the best way in general, but will be easiest for your current workflow.
Just noticing your shell command, how are you passing the input/output to your script?
我找到了如何通过函数使用字典,这解决了我的问题!
此解决方案的主要默认设置是您必须创建一个虚拟文件作为分割规则的输出,而不是检查是否创建了每个“{sample}/cells/{barcode}_{sample}/fltnc.bam”文件,所以我仍在寻找更优雅的东西......
I found how to use dictionaries through functions, which solved my problem!
The major default of this solution is that you have to create a dummy file as output of the split rule, instead of checking if each '{sample}/cells/{barcode}_{sample}/fltnc.bam' file is created, so I am still looking for something more elegant...