Snakemake 中的互连变量

发布于 2025-01-10 07:13:13 字数 1796 浏览 1 评论 0原文

假设我有示例 SAMPLE_A，分为两个文件 SAMPLE_A_1、SAMPLE_A_2 并与条形码 AATT、TTAA 和 SAMPLE_B 关联code>与条形码CCGG、GGCC、GCGC关联，分为4个文件SAMPLE_B_1...SAMPLE_B_4。

我可以创建 getSampleNames() 来获取 [SAMPLE_A,SAMPLE_A,SAMPLE_B,SAMPLE_B,SAMPLE_B,SAMPLE_B] 和 [1,2,1,2,3, 4]，然后压缩它们以获得组合{sample}_{id}。然后我可以对条形码执行相同的操作：[SAMPLE_A,SAMPLE_A,SAMPLE_B,SAMPLE_B,SAMPLE_B] 和 [AATT, TTAA,CCGG, GGCC, GCGC]。

SAMPLES_ID,IDs = getSampleNames()
SAMPLES_BC,BCs = getBCs(set(SAMPLES_ID))

rule refine:
input:
    '{sample}/demultiplex/{sample}_{id}.demultiplex.bam'
output:
    bam = '{sample}/polyA_trimming/{sample}_{id}.fltnc.bam',
shell:
    "isoseq3 refine {input} "


rule split:
input:
    expand('{sample}/polyA_trimming/{sample}_{id}.fltnc.bam', zip, sample = SAMPLES_ID, id = IDs),
output:
    expand("{sample}/cells/{barcode}_{sample}/fltnc.bam", zip, sample = SAMPLES_BC, barcode = BCs),
shell:
    "python {params.script_dir}/split_cells_bam.py"


rule dedup_split:
input:
    "{sample}/cells/{barcode}_{sample}/fltnc.bam"
output:
    bam = "{sample}/cells/{barcode}_{sample}/dedup/dedup.bam",
shell:
    "isoseq3 dedup {input} {output.bam} "

rule merge:
input:
    expand("{sample}/cells/{barcode}_{sample}/dedup/dedup.bam",
        zip, sample = SAMPLES_BC, barcode = BCs),

如何防止规则拆分成为管道中的瓶颈？目前，它等待对所有样本完成细化规则，虽然没有必要，每个样本都应该独立运行，但我不能，因为每个样本的条形码集都不同。有没有办法获得类似

expand("{sample}/cells/{barcode}_{sample}/fltnc.bam", zip, Sample = SAMPLES_BC, Barcode = BCs[SAMPLES_BC]) ，其中 SAMPLES_BC 的每个 {sample} 都是 BCs 字典中的键？ ID 也一样吗？我知道我可以使用函数，但我不确定如何通过规则传播 {barcode}

原文

Let's say I have sample SAMPLE_A, divided into two files SAMPLE_A_1, SAMPLE_A_2 and associated to barcodes AATT, TTAA, and SAMPLE_Bassociated to barcodes CCGG, GGCC, GCGC, divided in 4 files SAMPLE_B_1...SAMPLE_B_4.

I can create getSampleNames() to get [SAMPLE_A,SAMPLE_A,SAMPLE_B,SAMPLE_B,SAMPLE_B,SAMPLE_B] and [1,2,1,2,3,4] and then zip them to get the combination {sample}_{id}. And then I can do the same thing for the barcodes: [SAMPLE_A,SAMPLE_A,SAMPLE_B,SAMPLE_B,SAMPLE_B] and [AATT, TTAA,CCGG, GGCC, GCGC].

SAMPLES_ID,IDs = getSampleNames()
SAMPLES_BC,BCs = getBCs(set(SAMPLES_ID))

rule refine:
input:
    '{sample}/demultiplex/{sample}_{id}.demultiplex.bam'
output:
    bam = '{sample}/polyA_trimming/{sample}_{id}.fltnc.bam',
shell:
    "isoseq3 refine {input} "


rule split:
input:
    expand('{sample}/polyA_trimming/{sample}_{id}.fltnc.bam', zip, sample = SAMPLES_ID, id = IDs),
output:
    expand("{sample}/cells/{barcode}_{sample}/fltnc.bam", zip, sample = SAMPLES_BC, barcode = BCs),
shell:
    "python {params.script_dir}/split_cells_bam.py"


rule dedup_split:
input:
    "{sample}/cells/{barcode}_{sample}/fltnc.bam"
output:
    bam = "{sample}/cells/{barcode}_{sample}/dedup/dedup.bam",
shell:
    "isoseq3 dedup {input} {output.bam} "

rule merge:
input:
    expand("{sample}/cells/{barcode}_{sample}/dedup/dedup.bam",
        zip, sample = SAMPLES_BC, barcode = BCs),

How can I prevent the rule split to be a bottleneck in my pipeline? For now it waits for the refine rule to be done for all samples while it's not necessary, every sample should run independently, but I can't because the set of barcodes is different for each sample. Is there a way to have something like

expand("{sample}/cells/{barcode}_{sample}/fltnc.bam", zip, sample = SAMPLES_BC, barcode = BCs[SAMPLES_BC]), where each {sample} of SAMPLES_BC is a key in BCs dictionary ? And same for IDs? I know I can use functions, but then I'm not sure how to propagate the {barcode} through the rules

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

懷念過去 2025-01-17 07:13:13

根据您的评论，有几种方法可以采用，其中涉及更改保存样本、条形码和 ID 的数据结构。现在，您可以为每个样本创建一条规则：

for sample in set(SAMPLES_ID):  # get uniq samples
    # get ids and barcodes for this sample
    ids = [tup[1] for tup in zip(SAMPLES_ID, IDs) if tup[0] == sample]
    bcs = [tup[1] for tup in zip(SAMPLES_BC, BCs) if tup[0] == sample]

    rule:
        name: f'{sample}_split'
        input:
            expand('{sample}/polyA_trimming/{sample}_{id}.fltnc.bam', 
                   sample = sample, id = ids),
        output:
            expand("{sample}/cells/{barcode}_{sample}/fltnc.bam", 
                   sample = sample, barcode = bcs),
        shell:
            "python {params.script_dir}/split_cells_bam.py"

您不需要在扩展中使用 zip，因为 ids 和 bcs 用于单个样本。我认为这通常不是最好的方法，但对于您当前的工作流程来说是最简单的。

只需注意您的 shell 命令，您如何将输入/输出传递给脚本？

Based on your comment, there are a few routes to take which would involve changing your data structure holding the samples, barcodes, and ids. For now, you can just create a rule per sample:

for sample in set(SAMPLES_ID):  # get uniq samples
    # get ids and barcodes for this sample
    ids = [tup[1] for tup in zip(SAMPLES_ID, IDs) if tup[0] == sample]
    bcs = [tup[1] for tup in zip(SAMPLES_BC, BCs) if tup[0] == sample]

    rule:
        name: f'{sample}_split'
        input:
            expand('{sample}/polyA_trimming/{sample}_{id}.fltnc.bam', 
                   sample = sample, id = ids),
        output:
            expand("{sample}/cells/{barcode}_{sample}/fltnc.bam", 
                   sample = sample, barcode = bcs),
        shell:
            "python {params.script_dir}/split_cells_bam.py"

You don't need zip in the expand since the ids and bcs are for the single sample. I don't think this is the best way in general, but will be easiest for your current workflow.

Just noticing your shell command, how are you passing the input/output to your script?

回复收藏 0 原文

茶底世界 2025-01-17 07:13:13

我找到了如何通过函数使用字典，这解决了我的问题！

此解决方案的主要默认设置是您必须创建一个虚拟文件作为分割规则的输出，而不是检查是否创建了每个“{sample}/cells/{barcode}_{sample}/fltnc.bam”文件，所以我仍在寻找更优雅的东西......

IDs = getSampleNames() #{SAMPLE_A:[1,2], SAMPLE_B:[1,2,3,4]}
SAMPLES = list(IDs.keys()) 
BCs = getBCs(SAMPLES) #{SAMPLE_A:[AATT, TTAA], SAMPLE_B:[CCGG,GGCC,GCGC]}
    
# function linking IDs and SAMPLE
def sample2ids(wildcards):
    return expand('{{sample}}/polyA_trimming/{{sample}}_{id}.fltnc.bam', 
               id = IDs[wildcards.sample])

# function linking BCs and SAMPLE
def sample2ids(wildcards):
    return expand('{{sample}}/cells/{barcode}_{{sample}}/dedup/dedup.bam',
               barcode = BCs[wildcards.sample])

rule refine:
input:
    '{sample}/demultiplex/{sample}_{id}.demultiplex.bam'
output:
    bam = '{sample}/polyA_trimming/{sample}_{id}.fltnc.bam',

rule split:
input:
    sample2ids
output:
    # cannot use a function here, so I create a dummy file to pipe
    'dummy_file.txt'

rule dedup_split:
input:
    'dummy_file.txt'
output:
    bam = "{sample}/cells/{barcode}_{sample}/dedup/dedup.bam",


rule merge:
input:
    sample2bc

I found how to use dictionaries through functions, which solved my problem!

The major default of this solution is that you have to create a dummy file as output of the split rule, instead of checking if each '{sample}/cells/{barcode}_{sample}/fltnc.bam' file is created, so I am still looking for something more elegant...

IDs = getSampleNames() #{SAMPLE_A:[1,2], SAMPLE_B:[1,2,3,4]}
SAMPLES = list(IDs.keys()) 
BCs = getBCs(SAMPLES) #{SAMPLE_A:[AATT, TTAA], SAMPLE_B:[CCGG,GGCC,GCGC]}
    
# function linking IDs and SAMPLE
def sample2ids(wildcards):
    return expand('{{sample}}/polyA_trimming/{{sample}}_{id}.fltnc.bam', 
               id = IDs[wildcards.sample])

# function linking BCs and SAMPLE
def sample2ids(wildcards):
    return expand('{{sample}}/cells/{barcode}_{{sample}}/dedup/dedup.bam',
               barcode = BCs[wildcards.sample])

rule refine:
input:
    '{sample}/demultiplex/{sample}_{id}.demultiplex.bam'
output:
    bam = '{sample}/polyA_trimming/{sample}_{id}.fltnc.bam',

rule split:
input:
    sample2ids
output:
    # cannot use a function here, so I create a dummy file to pipe
    'dummy_file.txt'

rule dedup_split:
input:
    'dummy_file.txt'
output:
    bam = "{sample}/cells/{barcode}_{sample}/dedup/dedup.bam",


rule merge:
input:
    sample2bc

回复收藏 0 原文

~没有更多了~