如何重写“sapply”命令来提高性能？

发布于 2024-10-22 01:57:05 字数 1213 浏览 2 评论 0原文

我有一个名为“d”的 data.frame，约 1,300,000 行和 4 列，另一个名为“gc”的 data.frame，约 12,000 行和 2 列（但请参阅下面的较小示例）。

d <- data.frame( gene=rep(c("a","b","c"),4), val=rnorm(12), ind=c( rep(rep("i1",3),2), rep(rep("i2",3),2) ), exp=c( rep("e1",3), rep("e2",3), rep("e1",3), rep("e2",3) ) )
gc <- data.frame( gene=c("a","b","c"), chr=c("c1","c2","c3") )

这是“d”的样子：

   gene         val ind exp
1     a  1.38711902  i1  e1
2     b -0.25578496  i1  e1
3     c  0.49331256  i1  e1
4     a -1.38015272  i1  e2
5     b  1.46779219  i1  e2
6     c -0.84946320  i1  e2
7     a  0.01188061  i2  e1
8     b -0.13225808  i2  e1
9     c  0.16508404  i2  e1
10    a  0.70949804  i2  e2
11    b -0.64950167  i2  e2
12    c  0.12472479  i2  e2

这是“gc”：

  gene chr
1    a  c1
2    b  c2
3    c  c3

我想通过合并“gc”中与“d”第一列匹配的数据来向“d”添加第五列。目前我正在使用sapply。

d$chr <- sapply( 1:nrow(d), function(x) gc[ gc$gene==d[x,1], ]$chr )

但在真实数据上，它需要“非常长”的时间（我正在使用“system.time()”运行命令超过30分钟，但它仍然没有完成）。

你知道我如何以巧妙的方式重写这个吗？或者我应该考虑使用plyr，也许使用“并行”选项（我的计算机上有四个核心）？在这种情况下，最好的语法是什么？

提前致谢。

原文

I have a data.frame named "d" of ~1,300,000 lines and 4 columns and another data.frame named "gc" of ~12,000 lines and 2 columns (but see the smaller example below).

d <- data.frame( gene=rep(c("a","b","c"),4), val=rnorm(12), ind=c( rep(rep("i1",3),2), rep(rep("i2",3),2) ), exp=c( rep("e1",3), rep("e2",3), rep("e1",3), rep("e2",3) ) )
gc <- data.frame( gene=c("a","b","c"), chr=c("c1","c2","c3") )

Here is how "d" looks like:

   gene         val ind exp
1     a  1.38711902  i1  e1
2     b -0.25578496  i1  e1
3     c  0.49331256  i1  e1
4     a -1.38015272  i1  e2
5     b  1.46779219  i1  e2
6     c -0.84946320  i1  e2
7     a  0.01188061  i2  e1
8     b -0.13225808  i2  e1
9     c  0.16508404  i2  e1
10    a  0.70949804  i2  e2
11    b -0.64950167  i2  e2
12    c  0.12472479  i2  e2

And here is "gc":

  gene chr
1    a  c1
2    b  c2
3    c  c3

I want to add a 5th column to "d" by incorporating data from "gc" that match with the 1st column of "d". For the moment I am using sapply.

d$chr <- sapply( 1:nrow(d), function(x) gc[ gc$gene==d[x,1], ]$chr )

But on the real data, it takes a "very long" time (I am running the command with "system.time()" since more than 30 minutes and it's still not finished).

Do you have any idea of how I could rewrite this in a clever way? Or should I consider using plyr, maybe with the "parallel" option (I have four cores on my computer)? In such a case, what would be the best syntax?

Thanks in advance.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

Spring初心 2024-10-29 01:57:05

我认为您可以使用该因子作为索引：

gc[ d[,1], 2]
 [1] c1 c2 c3 c1 c2 c3 c1 c2 c3 c1 c2 c3
Levels: c1 c2 c3

与以下内容相同：

 sapply( 1:nrow(d), function(x) gc[ gc$gene==d[x,1], ]$chr )
 [1] c1 c2 c3 c1 c2 c3 c1 c2 c3 c1 c2 c3
Levels: c1 c2 c3

但速度更快：

> system.time(replicate(1000,sapply( 1:nrow(d), function(x) gc[ gc$gene==d[x,1], ]$chr )))
   user  system elapsed 
   5.03    0.00    5.02 
> 
> system.time(replicate(1000,gc[ d[,1], 2]))
   user  system elapsed 
   0.12    0.00    0.13

编辑：

稍微扩展一下我的评论。 gc 数据框需要每个 gene 级别按级别顺序排列一行才能正常工作：

 d <- data.frame( gene=rep(c("a","b","c"),4), val=rnorm(12), ind=c( rep(rep("i1",3),2), rep(rep("i2",3),2) ), exp=c( rep("e1",3), rep("e2",3), rep("e1",3), rep("e2",3) ) )
gc <- data.frame( gene=c("c","a","b"), chr=c("c1","c2","c3") )

gc[ d[,1], 2]
 [1] c1 c2 c3 c1 c2 c3 c1 c2 c3 c1 c2 c3
Levels: c1 c2 c3

sapply( 1:nrow(d), function(x) gc[ gc$gene==d[x,1], ]$chr )
 [1] c2 c3 c1 c2 c3 c1 c2 c3 c1 c2 c3 c1
Levels: c1 c2 c3

但解决这个问题并不难：

levels(gc$gene) <- levels(d$gene) # Seems redundant as this is done right quite often automatically
gc <- gc[order(gc$gene),]


gc[ d[,1], 2]
 [1] c2 c3 c1 c2 c3 c1 c2 c3 c1 c2 c3 c1
Levels: c1 c2 c3

sapply( 1:nrow(d), function(x) gc[ gc$gene==d[x,1], ]$chr )
 [1] c2 c3 c1 c2 c3 c1 c2 c3 c1 c2 c3 c1
Levels: c1 c2 c3

I think you can just use the factor as index:

gc[ d[,1], 2]
 [1] c1 c2 c3 c1 c2 c3 c1 c2 c3 c1 c2 c3
Levels: c1 c2 c3

does the same as:

 sapply( 1:nrow(d), function(x) gc[ gc$gene==d[x,1], ]$chr )
 [1] c1 c2 c3 c1 c2 c3 c1 c2 c3 c1 c2 c3
Levels: c1 c2 c3

But is much faster:

> system.time(replicate(1000,sapply( 1:nrow(d), function(x) gc[ gc$gene==d[x,1], ]$chr )))
   user  system elapsed 
   5.03    0.00    5.02 
> 
> system.time(replicate(1000,gc[ d[,1], 2]))
   user  system elapsed 
   0.12    0.00    0.13

Edit:

To expand a bit on my comment. The gc dataframe requires one row for each level of gene in the order of the levels for this to work:

 d <- data.frame( gene=rep(c("a","b","c"),4), val=rnorm(12), ind=c( rep(rep("i1",3),2), rep(rep("i2",3),2) ), exp=c( rep("e1",3), rep("e2",3), rep("e1",3), rep("e2",3) ) )
gc <- data.frame( gene=c("c","a","b"), chr=c("c1","c2","c3") )

gc[ d[,1], 2]
 [1] c1 c2 c3 c1 c2 c3 c1 c2 c3 c1 c2 c3
Levels: c1 c2 c3

sapply( 1:nrow(d), function(x) gc[ gc$gene==d[x,1], ]$chr )
 [1] c2 c3 c1 c2 c3 c1 c2 c3 c1 c2 c3 c1
Levels: c1 c2 c3

But it is not hard to fix that:

levels(gc$gene) <- levels(d$gene) # Seems redundant as this is done right quite often automatically
gc <- gc[order(gc$gene),]


gc[ d[,1], 2]
 [1] c2 c3 c1 c2 c3 c1 c2 c3 c1 c2 c3 c1
Levels: c1 c2 c3

sapply( 1:nrow(d), function(x) gc[ gc$gene==d[x,1], ]$chr )
 [1] c2 c3 c1 c2 c3 c1 c2 c3 c1 c2 c3 c1
Levels: c1 c2 c3

回复收藏 0 原文

嗳卜坏 2024-10-29 01:57:05

另一种解决方案在时间上不会击败 Sasha 的方法，但更具有通用性和可读性，即简单地合并两个数据帧：

d <- merge(d, gc)

我的系统速度较慢，所以这是我的时间安排

> system.time(replicate(1000,sapply( 1:nrow(d), function(x) gc[ gc$gene==d[x,1], ]$chr )))
   user  system elapsed 
  11.22    0.12   11.86 
> system.time(replicate(1000,gc[ d[,1], 2])) 
   user  system elapsed 
   0.34    0.00    0.35 
> system.time(replicate(1000, merge(d, gc, by="gene"))) 
   user  system elapsed 
   3.35    0.02    3.40

：是你可以有多个键，对不匹配的项目进行精细控制等。

An alternative solution that does not beat Sasha's approach timing-wise, but is more generalizable and readable, is to simply merge the two data frames:

d <- merge(d, gc)

I have a slower system, so here are my timings:

> system.time(replicate(1000,sapply( 1:nrow(d), function(x) gc[ gc$gene==d[x,1], ]$chr )))
   user  system elapsed 
  11.22    0.12   11.86 
> system.time(replicate(1000,gc[ d[,1], 2])) 
   user  system elapsed 
   0.34    0.00    0.35 
> system.time(replicate(1000, merge(d, gc, by="gene"))) 
   user  system elapsed 
   3.35    0.02    3.40

The benefit is that you could have multiple keys, fine control over non-matching items, etc.

回复收藏 0 原文

~没有更多了~