用于“左连接”的惯用R方法两个数据框
我有两个数据框,它们都有一个列,其中包含如下所示的因子:
> head(test.data)
var0 var1 date store
1 109.5678 109.5678 1990-03-30 Store1
2 109.3009 108.4261 1990-06-30 Store1
3 108.8262 106.2517 1990-09-30 Store1
4 108.2443 108.6417 1990-12-30 Store1
5 109.5678 109.5678 1991-03-30 Store1
6 109.3009 108.4261 1991-06-30 Store1
> summary(test.data)
var0 var1 date store
Min. : -9.72 Min. : -2.297 Min. :1990-03-30 Store1 : 8
1st Qu.: 68.32 1st Qu.: 71.305 1st Qu.:1990-09-07 Store2 : 8
Median :102.19 Median :101.192 Median :1991-02-13 Store3 : 8
Mean :101.09 Mean :103.042 Mean :1991-02-13 Store4 : 8
3rd Qu.:151.22 3rd Qu.:151.940 3rd Qu.:1991-07-23 Store5 : 8
Max. :196.55 Max. :201.099 Max. :1991-12-30 Store6 : 8
(Other):48
>
> head(test.clusters)
store cluster
1 Store1 A
2 Store2 C
3 Store3 A
4 Store4 B
5 Store5 D
6 Store6 A
>
> summary(test.clusters)
store cluster
Store1 :1 A:5
Store2 :1 B:4
Store3 :1 C:2
Store4 :1 D:1
Store5 :1
Store6 :1
(Other):6
>
我想向 test.data
添加一列,其中包含每行的 cluster
,基于它们的共享存储
。我目前正在使用双重嵌套循环来执行此操作:
new_col <- rep(test.clusters$cluster[1], length(test.data$store))
for (i in seq(test.data$store)){
for (j in seq(test.clusters$store)){
if (test.data$store[i] == test.clusters$store[j]){
new_col[i] <- test.clusters$cluster[j]
break
}
}
}
test.data$cluster <- new_col
这是非常冗长、效率极低且坦率地说丑陋的。在 R 中执行此操作的惯用方法是什么?
I have two data frames that both have a column containing a factor like the following:
> head(test.data)
var0 var1 date store
1 109.5678 109.5678 1990-03-30 Store1
2 109.3009 108.4261 1990-06-30 Store1
3 108.8262 106.2517 1990-09-30 Store1
4 108.2443 108.6417 1990-12-30 Store1
5 109.5678 109.5678 1991-03-30 Store1
6 109.3009 108.4261 1991-06-30 Store1
> summary(test.data)
var0 var1 date store
Min. : -9.72 Min. : -2.297 Min. :1990-03-30 Store1 : 8
1st Qu.: 68.32 1st Qu.: 71.305 1st Qu.:1990-09-07 Store2 : 8
Median :102.19 Median :101.192 Median :1991-02-13 Store3 : 8
Mean :101.09 Mean :103.042 Mean :1991-02-13 Store4 : 8
3rd Qu.:151.22 3rd Qu.:151.940 3rd Qu.:1991-07-23 Store5 : 8
Max. :196.55 Max. :201.099 Max. :1991-12-30 Store6 : 8
(Other):48
>
> head(test.clusters)
store cluster
1 Store1 A
2 Store2 C
3 Store3 A
4 Store4 B
5 Store5 D
6 Store6 A
>
> summary(test.clusters)
store cluster
Store1 :1 A:5
Store2 :1 B:4
Store3 :1 C:2
Store4 :1 D:1
Store5 :1
Store6 :1
(Other):6
>
I want to add a column to test.data
containing each row's cluster
, based on their shared store
. I am currently doing this using a doubly nested loop:
new_col <- rep(test.clusters$cluster[1], length(test.data$store))
for (i in seq(test.data$store)){
for (j in seq(test.clusters$store)){
if (test.data$store[i] == test.clusters$store[j]){
new_col[i] <- test.clusters$cluster[j]
break
}
}
}
test.data$cluster <- new_col
This is very verbose, grossly inefficient and frankly ugly. What is the idiomatic method for doing this in R?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
使用
plyr
中的join
函数。它在内部使用匹配,因此应该很快(比合并快 4-10 倍)。Use the
join
function fromplyr
. It uses match internally, so should be fast (4-10x faster than merge).使用
合并
功能。Use
merge
function.我推荐
匹配
。它会比合并
快得多。它应该看起来像这样:
I recommend
match
. It will be much faster thenmerge
.It should look like this: