BWA-MEM和SAMBAMBA读取小组线错误

发布于 2025-02-08 23:35:57 字数 1077 浏览 1 评论 0原文

这是一个分为两部分的问题:

  1. 帮助解释错误;
  2. 帮助编码。

  1. 我正在尝试运行bwa-memsambamba 到Aling Raw读取到参考基因组并按位置进行排序。这些是我正在使用的命令:
bwa mem  \
  -K 100000000 -v 3 -t 6 -Y \
  -R '\@S200031047L1C001R002\S*[1-2]' \
  /path/to/reference/GCF_009858895.2_ASM985889v3_genomic.fna \
  /path/to/raw-fastq/'S\S[^_]*_L01_[0-9]+-[0-9]+'_1.fq.gz \
  /path/to/raw-fastq/'S\S[^_]*_L01_[0-9]+-[0-9]+'_2.fq.gz | \
/path/to/genomics/sambamba-0.8.2 view -S -f bam \
  /dev/stdin | \
/path/to/genomics/sambamba-0.8.2 sort  \
  /dev/stdin \
  --out host_removal/${SAMPLE}/${SAMPLE}.hybrid.sorted.bam

这是我收到的错误消息:[e :: bwa_set_rg]读取组行不是@rg

我的序列是用MGI Sequencer生成的,并且像This @S200031047L1C001R0020000243/1一样,确定了读取组,即,他们不会与@RG一起求婚。如何指定Sambamba我的读取组以@s而不是@RG开头?


  1. 上面写的命令是我正在修改自己的研究的已发表管道。但是,在几个更改中,我对如何定义代码的最后一行中所述的样本ID没有信心:- OUT host_removal/$ {sample}/$ {sample} .hybrid.sorted。 BAM(我是指$ {sample})。有见地吗?

非常感谢!

This is a two-part question:

  1. help interpreting an error;
  2. help with coding.

  1. I'm trying to run bwa-mem and sambamba to aling raw reads to a reference genome and to sort by position. These are the commands I'm using:
bwa mem  \
  -K 100000000 -v 3 -t 6 -Y \
  -R '\@S200031047L1C001R002\S*[1-2]' \
  /path/to/reference/GCF_009858895.2_ASM985889v3_genomic.fna \
  /path/to/raw-fastq/'S\S[^_]*_L01_[0-9]+-[0-9]+'_1.fq.gz \
  /path/to/raw-fastq/'S\S[^_]*_L01_[0-9]+-[0-9]+'_2.fq.gz | \
/path/to/genomics/sambamba-0.8.2 view -S -f bam \
  /dev/stdin | \
/path/to/genomics/sambamba-0.8.2 sort  \
  /dev/stdin \
  --out host_removal/${SAMPLE}/${SAMPLE}.hybrid.sorted.bam

This is the error message I'm getting: [E::bwa_set_rg] the read group line is not started with @RG.

My sequences were generated with an MGI sequencer and the readgroups are identified like this @S200031047L1C001R0020000243/1, i.e., they don't beging with an @RG. How can I specify to sambamba that my readgroups start with @S and not @RG?


  1. The commands written above are a published pipeline I'm modifying for my own research. However, among several changes, I'm not confident on how to define sample id as such stated in the last line of the code: --out host_removal/${SAMPLE}/${SAMPLE}.hybrid.sorted.bam (I'm referring to ${SAMPLE}). Any insights?

Thank you very much!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

原谅我要高飞 2025-02-15 23:35:57

1。指定读取组,

您的读取组字符串未正确格式。应该像
'@rg \ tid:$ id \ tsm:$ sm \ tlb:$ lb \ tpu:$ pu \ tpl:$ pl',其中以$ sign开头的零件应替换为信息特定于您的测序运行和样品。并非所有这些都是为了所有目的。请参阅此 gatk team 读取组文档例子。

读取组规范始终以@RG开头。那是山姆格式的一部分。音序器不会产生读取组。我认为您可能会使他们与FastQ头条线相混淆。读取组字符串中的条目由选项卡分开,用\ t表示。标签及其值通过分开:

$ id(读取组ID)和$ sm(示例ID)之间的区别在于,样本是单个或生物样本,可以在不同的库中进行几次测序($ LB)。在GATK文档中,它们将流循环和库组合到读取组ID中。样本和库可以在小型项目中制作直观的读取组ID。如果您正在从事自己的项目不属于更大的测序工作的一部分,则可以根据需要定义阅读组。如果有几个人在同一项目中工作,则应保持一致,以避免以后出现问题。

2。变量替换,

我不确定我是否正确理解了您,但是如果您想知道什么$ {示例}在命令中表示,它是一个称为示例的变量,当时将被其值代替。该命令是运行的。卷曲括号保护名称,以使外壳不会将变量名称与接下来的字符混淆。 请参阅此处,请参见此处

1. Specifying read groups

Your read group string is not correctly formatted. It should be like
'@RG\tID:$ID\tSM:$SM\tLB:$LB\tPU:$PU\tPL:$PL' where the parts beginning with a $ sign should be replaced with the information specific to your sequencing run and sample. Not all of them are required for all purposes. See this read group documentation by GATK team for an example.

Read group specification always begins with @RG. That's part of SAM format. Sequencers do not produce read groups. I think you may be confusing them with fastq header lines. Entries in the read group string are separated by tabs, denoted with \t. Tags and their values are separated by :.

The difference between $ID (read group id) and $SM (sample id) is that sample is the individual or biological sample which may have been sequenced several times in different libraries ($LB). In the GATK documentation they combine flowcell and library into the read group id. Sample and library could make an intuitive read group id in small projects. If you are working on your own project that is not part of a larger sequencing effort, you can define the read groups as you like. If several people work in the same project, you should be consistent to avoid problems later.

2. Variable substitution

I'm not sure if I understood you correctly, but if you are wondering what ${SAMPLE} means in the command, it's a variable called SAMPLE that will be replaced by its value when the command is run. The curly brackets protect the name so that the shell does not confuse the variable name with characters coming after it. See here for examples.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文