使用 R 分割字符串和计算字符的更快方法?

发布于 2024-08-25 04:48:32 字数 1853 浏览 3 评论 0 原文

我正在寻找一种更快的方法来计算从 FASTA 文件读取的 DNA 字符串的 GC 含量。这归结为获取一个字符串并计算字母“G”或“C”出现的次数。我还想指定要考虑的字符范围。

我有一个工作函数相当慢,它导致我的代码出现瓶颈。它看起来像这样:

##
## count the number of GCs in the characters between start and stop
##
gcCount <-  function(line, st, sp){
  chars = strsplit(as.character(line),"")[[1]]
  numGC = 0
  for(j in st:sp){
    ##nested ifs faster than an OR (|) construction
    if(chars[[j]] == "g"){
      numGC <- numGC + 1
    }else if(chars[[j]] == "G"){
      numGC <- numGC + 1
    }else if(chars[[j]] == "c"){
      numGC <- numGC + 1
    }else if(chars[[j]] == "C"){
      numGC <- numGC + 1
    }
  }
  return(numGC)
}

运行 Rprof 会给出以下输出:

> a = "GCCCAAAATTTTCCGGatttaagcagacataaattcgagg"
> Rprof(filename="Rprof.out")
> for(i in 1:500000){gcCount(a,1,40)};
> Rprof(NULL)
> summaryRprof(filename="Rprof.out")

                   self.time self.pct total.time total.pct
"gcCount"          77.36     76.8     100.74     100.0
"=="               18.30     18.2      18.30      18.2
"strsplit"          3.58      3.6       3.64       3.6
"+"                 1.14      1.1       1.14       1.1
":"                 0.30      0.3       0.30       0.3
"as.logical"        0.04      0.0       0.04       0.0
"as.character"      0.02      0.0       0.02       0.0

$by.total
               total.time total.pct self.time self.pct
"gcCount"          100.74     100.0     77.36     76.8
"=="                18.30      18.2     18.30     18.2
"strsplit"           3.64       3.6      3.58      3.6
"+"                  1.14       1.1      1.14      1.1
":"                  0.30       0.3      0.30      0.3
"as.logical"         0.04       0.0      0.04      0.0
"as.character"       0.02       0.0      0.02      0.0

$sampling.time
[1] 100.74

对于使此代码更快有什么建议吗?

I'm looking for a faster way to calculate GC content for DNA strings read in from a FASTA file. This boils down to taking a string and counting the number of times that the letter 'G' or 'C' appears. I also want to specify the range of characters to consider.

I have a working function that is fairly slow, and it's causing a bottleneck in my code. It looks like this:

##
## count the number of GCs in the characters between start and stop
##
gcCount <-  function(line, st, sp){
  chars = strsplit(as.character(line),"")[[1]]
  numGC = 0
  for(j in st:sp){
    ##nested ifs faster than an OR (|) construction
    if(chars[[j]] == "g"){
      numGC <- numGC + 1
    }else if(chars[[j]] == "G"){
      numGC <- numGC + 1
    }else if(chars[[j]] == "c"){
      numGC <- numGC + 1
    }else if(chars[[j]] == "C"){
      numGC <- numGC + 1
    }
  }
  return(numGC)
}

Running Rprof gives me the following output:

> a = "GCCCAAAATTTTCCGGatttaagcagacataaattcgagg"
> Rprof(filename="Rprof.out")
> for(i in 1:500000){gcCount(a,1,40)};
> Rprof(NULL)
> summaryRprof(filename="Rprof.out")

                   self.time self.pct total.time total.pct
"gcCount"          77.36     76.8     100.74     100.0
"=="               18.30     18.2      18.30      18.2
"strsplit"          3.58      3.6       3.64       3.6
"+"                 1.14      1.1       1.14       1.1
":"                 0.30      0.3       0.30       0.3
"as.logical"        0.04      0.0       0.04       0.0
"as.character"      0.02      0.0       0.02       0.0

$by.total
               total.time total.pct self.time self.pct
"gcCount"          100.74     100.0     77.36     76.8
"=="                18.30      18.2     18.30     18.2
"strsplit"           3.64       3.6      3.58      3.6
"+"                  1.14       1.1      1.14      1.1
":"                  0.30       0.3      0.30      0.3
"as.logical"         0.04       0.0      0.04      0.0
"as.character"       0.02       0.0      0.02      0.0

$sampling.time
[1] 100.74

Any advice for making this code faster?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

流绪微梦 2024-09-01 04:48:32

最好根本不拆分,只计算匹配数:

gcCount2 <-  function(line, st, sp){
  sum(gregexpr('[GCgc]', substr(line, st, sp))[[1]] > 0)
}

这会快一个数量级。

一个仅迭代字符的小型 C 函数将会快一个数量级。

Better to not split at all, just count the matches:

gcCount2 <-  function(line, st, sp){
  sum(gregexpr('[GCgc]', substr(line, st, sp))[[1]] > 0)
}

That's an order of magnitude faster.

A small C function that just iterates over the characters would be yet another order of magnitude faster.

热鲨 2024-09-01 04:48:32

一班轮:

table(strsplit(toupper(a), '')[[1]])

A one liner:

table(strsplit(toupper(a), '')[[1]])
耶耶耶 2024-09-01 04:48:32

我不知道它是否更快,但您可能想查看 R 包 seqinR - http://pbil.univ-lyon1.fr/software/seqinr/home.php?lang=eng。它是一个优秀的通用生物信息学软件包,具有多种序列分析方法。它在 CRAN 中(在我写这篇文章时它似乎已经关闭了)。

GC 内容将是:

mysequence <- s2c("agtctggggggccccttttaagtagatagatagctagtcgta")
    GC(mysequence)  # 0.4761905

这是来自字符串,您还可以使用“read.fasta()”读取 fasta 文件。

I don't know that it's any faster, but you might want to look at the R package seqinR - http://pbil.univ-lyon1.fr/software/seqinr/home.php?lang=eng. It is an excellent, general bioinformatics package with many methods for sequence analysis. It's in CRAN (which seems to be down as I write this).

GC content would be:

mysequence <- s2c("agtctggggggccccttttaagtagatagatagctagtcgta")
    GC(mysequence)  # 0.4761905

That's from a string, you can also read in a fasta file using "read.fasta()".

梦回梦里 2024-09-01 04:48:32

这里不需要使用循环。

试试这个:

gcCount <-  function(line, st, sp){
  chars = strsplit(as.character(line),"")[[1]][st:sp]
  length(which(tolower(chars) == "g" | tolower(chars) == "c"))
}

There's no need to use a loop here.

Try this:

gcCount <-  function(line, st, sp){
  chars = strsplit(as.character(line),"")[[1]][st:sp]
  length(which(tolower(chars) == "g" | tolower(chars) == "c"))
}
歌枕肩 2024-09-01 04:48:32

尝试使用 stringi 包中的这个函数

> stri_count_fixed("GCCCAAAATTTTCCGG",c("G","C"))
[1] 3 5

,或者您可以使用正则表达式版本来计算 g 和 G

> stri_count_regex("GCCCAAAATTTTCCGGggcc",c("G|g|C|c"))
[1] 12

,或者您可以先使用 tolower 函数,然后使用 stri_count

> stri_trans_tolower("GCCCAAAATTTTCCGGggcc")
[1] "gcccaaaattttccggggcc"

时间性能

    > microbenchmark(gcCount(x,1,40),gcCount2(x,1,40), stri_count_regex(x,c("[GgCc]")))
Unit: microseconds
                             expr     min     lq  median      uq     max neval
                gcCount(x, 1, 40) 109.568 112.42 113.771 116.473 146.492   100
               gcCount2(x, 1, 40)  15.010  16.51  18.312  19.213  40.826   100
 stri_count_regex(x, c("[GgCc]"))  15.610  16.51  18.912  20.112  61.239   100

另一个较长字符串的示例。 stri_dup 复制字符串 n 次

> stri_dup("abc",3)
[1] "abcabcabc"

如您所见,对于较长的序列 stri_count 更快:)

> y <- stri_dup("GCCCAAAATTTTCCGGatttaagcagacataaattcgagg",100)
    > microbenchmark(gcCount(y,1,40*100),gcCount2(y,1,40*100), stri_count_regex(y,c("[GgCc]")))
    Unit: microseconds
                                 expr       min         lq     median        uq       max neval
              gcCount(y, 1, 40 * 100) 10367.880 10597.5235 10744.4655 11655.685 12523.828   100
             gcCount2(y, 1, 40 * 100)   360.225   369.5315   383.6400   399.100   438.274   100
     stri_count_regex(y, c("[GgCc]"))   131.483   137.9370   151.8955   176.511   221.839   100

Try this function from stringi package

> stri_count_fixed("GCCCAAAATTTTCCGG",c("G","C"))
[1] 3 5

or you can use regex version to count g and G

> stri_count_regex("GCCCAAAATTTTCCGGggcc",c("G|g|C|c"))
[1] 12

or you can use tolower function first and then stri_count

> stri_trans_tolower("GCCCAAAATTTTCCGGggcc")
[1] "gcccaaaattttccggggcc"

time performance

    > microbenchmark(gcCount(x,1,40),gcCount2(x,1,40), stri_count_regex(x,c("[GgCc]")))
Unit: microseconds
                             expr     min     lq  median      uq     max neval
                gcCount(x, 1, 40) 109.568 112.42 113.771 116.473 146.492   100
               gcCount2(x, 1, 40)  15.010  16.51  18.312  19.213  40.826   100
 stri_count_regex(x, c("[GgCc]"))  15.610  16.51  18.912  20.112  61.239   100

another example for longer string. stri_dup replicates string n-times

> stri_dup("abc",3)
[1] "abcabcabc"

As you can see, for longer sequence stri_count is faster :)

> y <- stri_dup("GCCCAAAATTTTCCGGatttaagcagacataaattcgagg",100)
    > microbenchmark(gcCount(y,1,40*100),gcCount2(y,1,40*100), stri_count_regex(y,c("[GgCc]")))
    Unit: microseconds
                                 expr       min         lq     median        uq       max neval
              gcCount(y, 1, 40 * 100) 10367.880 10597.5235 10744.4655 11655.685 12523.828   100
             gcCount2(y, 1, 40 * 100)   360.225   369.5315   383.6400   399.100   438.274   100
     stri_count_regex(y, c("[GgCc]"))   131.483   137.9370   151.8955   176.511   221.839   100
夢归不見 2024-09-01 04:48:32

感谢这篇文章的所有人,

为了优化我想要计算 200bp 的 100M 序列的 GC 含量的脚本,我最终测试了此处提出的不同方法。 Ken Williams 的方法表现最好(2.5 小时),优于 seqinr(3.6 小时)。使用 stringr str_count 减少到 1.5 小时。

最后我用 C++ 编写了它,并使用 Rcpp 调用它,这将计算时间减少到了 10 分钟!

这是 C++ 代码:

#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
float pGC_cpp(std::string s) {
  int count = 0;

  for (int i = 0; i < s.size(); i++) 
    if (s[i] == 'G') count++;
    else if (s[i] == 'C') count++;

  float pGC = (float)count / s.size();
  pGC = pGC * 100;
  return pGC;
}

我从 R 键入中调用它:

sourceCpp("pGC_cpp.cpp")
pGC_cpp("ATGCCC")

Thanks to all for this post,

To optimize a script in which I want to calculate GC content of 100M sequences of 200bp, I ended up testing different methods proposed here. Ken Williams' method performed best (2.5 hours), better than seqinr (3.6 hours). Using stringr str_count reduced to 1.5 hour.

In the end I coded it in C++ and called it using Rcpp, which cuts the computation time down to 10 minutes!

here is the C++ code:

#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
float pGC_cpp(std::string s) {
  int count = 0;

  for (int i = 0; i < s.size(); i++) 
    if (s[i] == 'G') count++;
    else if (s[i] == 'C') count++;

  float pGC = (float)count / s.size();
  pGC = pGC * 100;
  return pGC;
}

Which I call from R typing:

sourceCpp("pGC_cpp.cpp")
pGC_cpp("ATGCCC")
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文