用于“左连接”的惯用R方法两个数据框

发布于 2024-09-26 11:22:04 字数 1782 浏览 11 评论 0原文

我有两个数据框，它们都有一个列，其中包含如下所示的因子：

> head(test.data)
      var0     var1       date  store
1 109.5678 109.5678 1990-03-30 Store1
2 109.3009 108.4261 1990-06-30 Store1
3 108.8262 106.2517 1990-09-30 Store1
4 108.2443 108.6417 1990-12-30 Store1
5 109.5678 109.5678 1991-03-30 Store1
6 109.3009 108.4261 1991-06-30 Store1
> summary(test.data)
      var0             var1              date                store   
 Min.   : -9.72   Min.   : -2.297   Min.   :1990-03-30   Store1 : 8  
 1st Qu.: 68.32   1st Qu.: 71.305   1st Qu.:1990-09-07   Store2 : 8  
 Median :102.19   Median :101.192   Median :1991-02-13   Store3 : 8  
 Mean   :101.09   Mean   :103.042   Mean   :1991-02-13   Store4 : 8  
 3rd Qu.:151.22   3rd Qu.:151.940   3rd Qu.:1991-07-23   Store5 : 8  
 Max.   :196.55   Max.   :201.099   Max.   :1991-12-30   Store6 : 8  
                                                         (Other):48  
>
> head(test.clusters)
   store cluster
1 Store1       A
2 Store2       C
3 Store3       A
4 Store4       B
5 Store5       D
6 Store6       A
>
> summary(test.clusters)
     store   cluster
 Store1 :1   A:5    
 Store2 :1   B:4    
 Store3 :1   C:2    
 Store4 :1   D:1    
 Store5 :1          
 Store6 :1          
 (Other):6          
>

我想向 test.data 添加一列，其中包含每行的 cluster，基于它们的共享存储。我目前正在使用双重嵌套循环来执行此操作：

new_col <- rep(test.clusters$cluster[1], length(test.data$store))
for (i in seq(test.data$store)){
  for (j in seq(test.clusters$store)){
    if (test.data$store[i] == test.clusters$store[j]){
      new_col[i] <- test.clusters$cluster[j]
      break
    }       
  }
}
test.data$cluster <- new_col

这是非常冗长、效率极低且坦率地说丑陋的。在 R 中执行此操作的惯用方法是什么？

原文

I have two data frames that both have a column containing a factor like the following:

> head(test.data)
      var0     var1       date  store
1 109.5678 109.5678 1990-03-30 Store1
2 109.3009 108.4261 1990-06-30 Store1
3 108.8262 106.2517 1990-09-30 Store1
4 108.2443 108.6417 1990-12-30 Store1
5 109.5678 109.5678 1991-03-30 Store1
6 109.3009 108.4261 1991-06-30 Store1
> summary(test.data)
      var0             var1              date                store   
 Min.   : -9.72   Min.   : -2.297   Min.   :1990-03-30   Store1 : 8  
 1st Qu.: 68.32   1st Qu.: 71.305   1st Qu.:1990-09-07   Store2 : 8  
 Median :102.19   Median :101.192   Median :1991-02-13   Store3 : 8  
 Mean   :101.09   Mean   :103.042   Mean   :1991-02-13   Store4 : 8  
 3rd Qu.:151.22   3rd Qu.:151.940   3rd Qu.:1991-07-23   Store5 : 8  
 Max.   :196.55   Max.   :201.099   Max.   :1991-12-30   Store6 : 8  
                                                         (Other):48  
>
> head(test.clusters)
   store cluster
1 Store1       A
2 Store2       C
3 Store3       A
4 Store4       B
5 Store5       D
6 Store6       A
>
> summary(test.clusters)
     store   cluster
 Store1 :1   A:5    
 Store2 :1   B:4    
 Store3 :1   C:2    
 Store4 :1   D:1    
 Store5 :1          
 Store6 :1          
 (Other):6          
>

I want to add a column to test.data containing each row's cluster, based on their shared store. I am currently doing this using a doubly nested loop:

new_col <- rep(test.clusters$cluster[1], length(test.data$store))
for (i in seq(test.data$store)){
  for (j in seq(test.clusters$store)){
    if (test.data$store[i] == test.clusters$store[j]){
      new_col[i] <- test.clusters$cluster[j]
      break
    }       
  }
}
test.data$cluster <- new_col

This is very verbose, grossly inefficient and frankly ugly. What is the idiomatic method for doing this in R?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

对岸观火 2024-10-03 11:22:04

使用 plyr 中的 join 函数。它在内部使用匹配，因此应该很快（比合并快 4-10 倍）。

回复收藏 0 原文

今天小雨转甜 2024-10-03 11:22:04

使用合并功能。

回复收藏 0 原文

属性 2024-10-03 11:22:04

我推荐匹配。它会比合并快得多。

它应该看起来像这样：

test.data$cluster <- test.clusters$cluster[match(test.data$store, test.clusters$store)]

I recommend match. It will be much faster then merge.

It should look like this:

test.data$cluster <- test.clusters$cluster[match(test.data$store, test.clusters$store)]

回复收藏 0 原文

~没有更多了~

关于作者

久随

暂无简介

0 文章

0 评论

23 人气

关注发私信

胡图图

文章 0 评论 0

关注

zt006

文章 0 评论 0

关注

z祗昰~

文章 0 评论 0

关注

冰葑

文章 0 评论 0

关注

野の

文章 0 评论 0

关注

天空

文章 0 评论 0

友情链接

文江博客

用于“左连接”的惯用R方法两个数据框

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

用于“左连接”的惯用R方法两个数据框

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。