Bio::DB::Sam - 获取 bam 文件中所有读取的映射数量

发布于 2024-12-27 14:01:58 字数 1031 浏览 2 评论 0原文

我想计算转录本表达式,因此我需要获取 bam 文件中所有读取的映射数量。我当前的程序是查看整个转录本并使用 Bio::DB::Sam 获取映射到其上的读数。结果存储在哈希中,其中 read_name 作为键(10 个字母),number_of_mappings 作为值(整数)。

这是我正在使用的代码:

use Bio::DB:Sam;
use strict;

my %global_read_occurrences;


sub getGlobalReadOccurrences {

 my ($ids, $bam_file) = @_;

 $sam = Bio::DB::Sam -> new (-bam => $bam_file);

 foreach my $id (@{$ids}){
   my $alignments = $sam -> get_features_by_location(-seq_id => $transcript_id, -iterator => 1);


  while (my $alignment = $alignments -> next_seq){

   my $read_name = $alignment -> query -> name;

   if (exists($global_read_occurrences{$read_name})){
    $global_read_occurrences{$read_name}++;
   }
   else {
    $global_read_occurrences{$read_name} = 1;
   }
  }
 }
}

我的问题: 是否还有其他可能性,我可以直接获取每次读取的全局映射数量,并且无需查看所有转录本?我在 Bio::DB::Sam 中找不到任何像 $sam -> 那样的子项 -> getNumberOfMappings($read_name);

我使用的 bam 文件具有超过 5000 万个映射读取,因此哈希将需要大量内存资源(有时约为 40 GB)这实际上可能吗还是来自其他地方?还有其他可能以更少的内存存储数据吗?

多谢!

I want to calculate transcript expressions and therefore I need to get the number of mappings for all reads in the bam file. My current procedure is to go overall transcripts and get the reads that mapped on it using Bio::DB::Sam. The results are stored in a hash with read_name as key (10 letters) and number_of_mappings as value (integer).

Here is the code that I am using:

use Bio::DB:Sam;
use strict;

my %global_read_occurrences;


sub getGlobalReadOccurrences {

 my ($ids, $bam_file) = @_;

 $sam = Bio::DB::Sam -> new (-bam => $bam_file);

 foreach my $id (@{$ids}){
   my $alignments = $sam -> get_features_by_location(-seq_id => $transcript_id, -iterator => 1);


  while (my $alignment = $alignments -> next_seq){

   my $read_name = $alignment -> query -> name;

   if (exists($global_read_occurrences{$read_name})){
    $global_read_occurrences{$read_name}++;
   }
   else {
    $global_read_occurrences{$read_name} = 1;
   }
  }
 }
}

My questions:
Is there any other possibility where I can get the number of global mappings per read directly and where I don't have to go over all transcripts? I couldn't find any subs in Bio::DB::Sam like $sam -> getNumberOfMappings($read_name);

I am using bam files with more than 50 million mapped reads, so the hash is going to need huge memory resources (sometimes about 40 GB) Is this actually possible or does this come from somewhere else? And is there any other possibility to store the data with less mem?

Thanks a lot!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

野稚 2025-01-03 14:01:58

BAM 文件通常按染色体位置而不是按片段名称排序,因此片段映射可以位于文件中的任何位置。您要做的最简单的事情就是转到 SAM 文件并运行一个简单的 shell 命令:

 cut -f1,1 myfile.sam | sort | uniq -c

这将生成如下文件:

  2 HWI-EAS299_4_30M2BAAXX:2:99:965:826
  2 HWI-EAS299_4_30M2BAAXX:2:99:966:1932
  2 HWI-EAS299_4_30M2BAAXX:2:99:971:146
  2 HWI-EAS299_4_30M2BAAXX:2:9:997:1263
  2 HWI-EAS299_4_30M2BAAXX:2:99:972:281
  2 HWI-EAS299_4_30M2BAAXX:2:99:973:1904
  1 HWI-EAS299_4_30M2BAAXX:2:99:976:186
  2 HWI-EAS299_4_30M2BAAXX:2:99:986:687
  6 HWI-EAS299_4_30M2BAAXX:2:99:987:165
  2 HWI-EAS299_4_30M2BAAXX:2:99:99:1582
  2 HWI-EAS299_4_30M2BAAXX:2:99:99:160
  2 HWI-EAS299_4_30M2BAAXX:2:99:998:1139

第一列是映射计数。第二个是读取的名称。

BAM files are usually sorted by chromosomal location and not by the name of the read, so the read mappings can be located anywhere in the file. The easiest thing for you to do is to go to the SAM file and run a simple shell command:

 cut -f1,1 myfile.sam | sort | uniq -c

This will produce a file like this:

  2 HWI-EAS299_4_30M2BAAXX:2:99:965:826
  2 HWI-EAS299_4_30M2BAAXX:2:99:966:1932
  2 HWI-EAS299_4_30M2BAAXX:2:99:971:146
  2 HWI-EAS299_4_30M2BAAXX:2:9:997:1263
  2 HWI-EAS299_4_30M2BAAXX:2:99:972:281
  2 HWI-EAS299_4_30M2BAAXX:2:99:973:1904
  1 HWI-EAS299_4_30M2BAAXX:2:99:976:186
  2 HWI-EAS299_4_30M2BAAXX:2:99:986:687
  6 HWI-EAS299_4_30M2BAAXX:2:99:987:165
  2 HWI-EAS299_4_30M2BAAXX:2:99:99:1582
  2 HWI-EAS299_4_30M2BAAXX:2:99:99:160
  2 HWI-EAS299_4_30M2BAAXX:2:99:998:1139

The first column is the count of mappings. The second is the read name.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文