如何重写“sapply”命令来提高性能?
我有一个名为“d”的 data.frame,约 1,300,000 行和 4 列,另一个名为“gc”的 data.frame,约 12,000 行和 2 列(但请参阅下面的较小示例)。
d <- data.frame( gene=rep(c("a","b","c"),4), val=rnorm(12), ind=c( rep(rep("i1",3),2), rep(rep("i2",3),2) ), exp=c( rep("e1",3), rep("e2",3), rep("e1",3), rep("e2",3) ) )
gc <- data.frame( gene=c("a","b","c"), chr=c("c1","c2","c3") )
这是“d”的样子:
gene val ind exp
1 a 1.38711902 i1 e1
2 b -0.25578496 i1 e1
3 c 0.49331256 i1 e1
4 a -1.38015272 i1 e2
5 b 1.46779219 i1 e2
6 c -0.84946320 i1 e2
7 a 0.01188061 i2 e1
8 b -0.13225808 i2 e1
9 c 0.16508404 i2 e1
10 a 0.70949804 i2 e2
11 b -0.64950167 i2 e2
12 c 0.12472479 i2 e2
这是“gc”:
gene chr
1 a c1
2 b c2
3 c c3
我想通过合并“gc”中与“d”第一列匹配的数据来向“d”添加第五列。目前我正在使用sapply。
d$chr <- sapply( 1:nrow(d), function(x) gc[ gc$gene==d[x,1], ]$chr )
但在真实数据上,它需要“非常长”的时间(我正在使用“system.time()”运行命令超过30分钟,但它仍然没有完成)。
你知道我如何以巧妙的方式重写这个吗?或者我应该考虑使用plyr,也许使用“并行”选项(我的计算机上有四个核心)?在这种情况下,最好的语法是什么?
提前致谢。
I have a data.frame named "d" of ~1,300,000 lines and 4 columns and another data.frame named "gc" of ~12,000 lines and 2 columns (but see the smaller example below).
d <- data.frame( gene=rep(c("a","b","c"),4), val=rnorm(12), ind=c( rep(rep("i1",3),2), rep(rep("i2",3),2) ), exp=c( rep("e1",3), rep("e2",3), rep("e1",3), rep("e2",3) ) )
gc <- data.frame( gene=c("a","b","c"), chr=c("c1","c2","c3") )
Here is how "d" looks like:
gene val ind exp
1 a 1.38711902 i1 e1
2 b -0.25578496 i1 e1
3 c 0.49331256 i1 e1
4 a -1.38015272 i1 e2
5 b 1.46779219 i1 e2
6 c -0.84946320 i1 e2
7 a 0.01188061 i2 e1
8 b -0.13225808 i2 e1
9 c 0.16508404 i2 e1
10 a 0.70949804 i2 e2
11 b -0.64950167 i2 e2
12 c 0.12472479 i2 e2
And here is "gc":
gene chr
1 a c1
2 b c2
3 c c3
I want to add a 5th column to "d" by incorporating data from "gc" that match with the 1st column of "d". For the moment I am using sapply.
d$chr <- sapply( 1:nrow(d), function(x) gc[ gc$gene==d[x,1], ]$chr )
But on the real data, it takes a "very long" time (I am running the command with "system.time()" since more than 30 minutes and it's still not finished).
Do you have any idea of how I could rewrite this in a clever way? Or should I consider using plyr, maybe with the "parallel" option (I have four cores on my computer)? In such a case, what would be the best syntax?
Thanks in advance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我认为您可以使用该因子作为索引:
与以下内容相同:
但速度更快:
编辑:
稍微扩展一下我的评论。
gc
数据框需要每个gene
级别按级别顺序排列一行才能正常工作:但解决这个问题并不难:
I think you can just use the factor as index:
does the same as:
But is much faster:
Edit:
To expand a bit on my comment. The
gc
dataframe requires one row for each level ofgene
in the order of the levels for this to work:But it is not hard to fix that:
另一种解决方案在时间上不会击败 Sasha 的方法,但更具有通用性和可读性,即简单地
合并
两个数据帧:我的系统速度较慢,所以这是我的时间安排
:是你可以有多个键,对不匹配的项目进行精细控制等。
An alternative solution that does not beat Sasha's approach timing-wise, but is more generalizable and readable, is to simply
merge
the two data frames:I have a slower system, so here are my timings:
The benefit is that you could have multiple keys, fine control over non-matching items, etc.