使用 R 合并不同排序的表

发布于 2024-11-03 01:15:11 字数 392 浏览 0 评论 0 原文

我有两张大桌子。一个包含标识符(未排序),另一个包含标识符列表(包含第一个表中的所有标识符)以及一个变量的关联值。 我想向第一个表添加一列,其中包含第二个表中的关联值。有没有一种聪明的方法来继续使用 R 的实现功能?

table 1
id
8979786
62782
6268768
6776566

table 2
id        var
1          5
2          2
3          NA
…
9999999    6

结果应该是

table1
id       var
8979786   5
62782     NA
6268768   7
4776566   4

提前感谢

I have 2 big tables. One with identifiers (unsorted), another with a list of identifiers (containing all which are in the first table) plus the associated values for one variable.
I want to add a column to my first table containing the associated values that are in the second table. Is there an smart way to proceed using implemented functions of R ?

i.e.

table 1
id
8979786
62782
6268768
6776566

table 2
id        var
1          5
2          2
3          NA
…
9999999    6

and the result should be

table1
id       var
8979786   5
62782     NA
6268768   7
4776566   4

Thanks in advance

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

眼泪也成诗 2024-11-10 01:15:11

那么 id 列在两个表中都存在吗?您可以将它们合并在一起:merge(table1, table2, sort = FALSE)。有很多合并选项可供探索,可让您模拟不同类型的联接,类似于 SQL 中的内联接、左联接、右联接和外联接。我在此处添加了附加参数 sort 以保留 table1 的原始顺序。

如果表 1 中有 id,但表 2 中没有,并且您想要显示这些 NA,请添加 all.x = TRUE 作为参数。这相当于左连接。 all.y 是右连接,all = TRUE 相当于完全外连接。

可重现的例子:

> set.seed(1)
> table1 <- data.frame( id = sample(1:5, 5, FALSE))
> table1
  id
1  2
2  5
3  4
4  3
5  1
> table2 <- data.frame( id = 1:5, var = rnorm(5))
> table2
  id        var
1  1  1.2724293
2  2  0.4146414
3  3 -1.5399500
4  4 -0.9285670
5  5 -0.2947204
> merge(table1, table2, sort = FALSE)
  id        var
1  2  0.4146414
2  5 -0.2947204
3  4 -0.9285670
4  3 -1.5399500
5  1  1.2724293

So the id column is in both tables? You can merge them together: merge(table1, table2, sort = FALSE). There are lots of options to explore for merge that let you emulate different types of joins, similar to inner, left, right, and outer joins in SQL. I added the additional parameter sort here to preserve the original order of table1.

If there are ids in table1 but not table 2 and you want to show NAs for those, add all.x = TRUE as a parameter. This is equivalent to a left join. all.y is a right join, and all = TRUE is equivalent to a full outer join.

Reproducible example:

> set.seed(1)
> table1 <- data.frame( id = sample(1:5, 5, FALSE))
> table1
  id
1  2
2  5
3  4
4  3
5  1
> table2 <- data.frame( id = 1:5, var = rnorm(5))
> table2
  id        var
1  1  1.2724293
2  2  0.4146414
3  3 -1.5399500
4  4 -0.9285670
5  5 -0.2947204
> merge(table1, table2, sort = FALSE)
  id        var
1  2  0.4146414
2  5 -0.2947204
3  4 -0.9285670
4  3 -1.5399500
5  1  1.2724293
孤星 2024-11-10 01:15:11

这是一种 data.table 方法,以防数据很大且速度成为问题。更多信息请参考?data.table的帮助页面:

当 i 是一个 data.table 时,x(即外部 data.table)必须有一个
钥匙。 i(即内部 data.table)使用键连接到 x 并
返回 x 中匹配的行。执行等值连接
i 中的每一列到 x 的键中的每一列之间。这场比赛是一场
在编译后的 C 中进行二分查找,时间复杂度为 O(log n)。如果我的列数较少
比 x 的键大,那么 x 的许多行可能与 i 的每一行匹配。如果我有
比 x 的键更多的列,i 的列不参与
join 包含在结果中。如果我也有一把钥匙,那就是我的钥匙
用于匹配 x 的键列和二进制合并的列
执行两个表的比较。

请注意,我对 Chase 提供的示例数据进行了一些调整,以使 data.table 中的匹配的某些点更加明显:

require(data.table)
#Version 1.7.7
set.seed(1)
table1 <- data.table(id = sample(3:7, 5, FALSE), var1 = rnorm(5), key="id")
table2 <- data.table(id = 5:10, var2 = rnorm(6), key="id")

#Default: If id in table 1 is not in table 2, return NA
table2[table1]
#      id         var2       var1
# [1,]  3           NA -0.2947204
# [2,]  4           NA  1.2724293
# [3,]  5 -0.005767173 -0.9285670
# [4,]  6  2.404653389 -1.5399500
# [5,]  7  0.763593461  0.4146414

#If one wants to get rid of the NAs
table2[table1, nomatch=0]
#      id         var2       var1
# [1,]  5 -0.005767173 -0.9285670
# [2,]  6  2.404653389 -1.5399500
# [3,]  7  0.763593461  0.4146414

#Or the other way around: get all ids of table 2
table1[table2]
#      id       var1         var2
# [1,]  5 -0.9285670 -0.005767173
# [2,]  6 -1.5399500  2.404653389
# [3,]  7  0.4146414  0.763593461
# [4,]  8         NA -0.799009249
# [5,]  9         NA -1.147657009
# [6,] 10         NA -0.289461574

强制性速度测试:

set.seed(10)
df1 <- data.frame(id = sample(1:5e6, 5e6, FALSE))
df2 <- data.frame(id = sample(1:5e6, 5e6, FALSE), var = rnorm(5e6))
system.time(df_solution <- merge(df1, df2, sort = TRUE))
#    user  system elapsed 
#   33.10    0.32   33.54
merge_dt <- function(df1, df2) {
  dt1 <- setkey(as.data.table(df1), "id")
  dt2 <- setkey(as.data.table(df2), "id")
  return(dt1[dt2])
}
system.time(dt_solution <- merge_dt(df1, df2))
#    user  system elapsed 
#   12.94    0.01   12.95 
all.equal(df_solution, as.data.frame(dt_solution))
#[1] TRUE

以及我通常的免责声明: 我仍在学习很多东西以及有关此软件包的信息,因此您可以在软件包主页找到更多信息。

Here is a data.table way of doing this, in case the data is big and speed is an issue. For more information, refer to the help page of ?data.table:

When i is a data.table, x (that is the outer data.table) must have a
key. i (that is the inner data.table) is joined to x using the key and
the rows in x that match are returned. An equi-join is performed
between each column in i to each column in x's key. The match is a
binary search in compiled C in O(log n) time. If i has less columns
than x's key then many rows of x may match to each row of i. If i has
more columns than x's key, the columns of i not involved in the
join are included in the result. If i also has a key, it is i's key
columns that are used to match to x's key columns and a binary merge
of the two tables is carried out.

Note that I adjusted the sample data provided by Chase a little to make certain points about the matching in data.table more obvious:

require(data.table)
#Version 1.7.7
set.seed(1)
table1 <- data.table(id = sample(3:7, 5, FALSE), var1 = rnorm(5), key="id")
table2 <- data.table(id = 5:10, var2 = rnorm(6), key="id")

#Default: If id in table 1 is not in table 2, return NA
table2[table1]
#      id         var2       var1
# [1,]  3           NA -0.2947204
# [2,]  4           NA  1.2724293
# [3,]  5 -0.005767173 -0.9285670
# [4,]  6  2.404653389 -1.5399500
# [5,]  7  0.763593461  0.4146414

#If one wants to get rid of the NAs
table2[table1, nomatch=0]
#      id         var2       var1
# [1,]  5 -0.005767173 -0.9285670
# [2,]  6  2.404653389 -1.5399500
# [3,]  7  0.763593461  0.4146414

#Or the other way around: get all ids of table 2
table1[table2]
#      id       var1         var2
# [1,]  5 -0.9285670 -0.005767173
# [2,]  6 -1.5399500  2.404653389
# [3,]  7  0.4146414  0.763593461
# [4,]  8         NA -0.799009249
# [5,]  9         NA -1.147657009
# [6,] 10         NA -0.289461574

The obligatory speed test:

set.seed(10)
df1 <- data.frame(id = sample(1:5e6, 5e6, FALSE))
df2 <- data.frame(id = sample(1:5e6, 5e6, FALSE), var = rnorm(5e6))
system.time(df_solution <- merge(df1, df2, sort = TRUE))
#    user  system elapsed 
#   33.10    0.32   33.54
merge_dt <- function(df1, df2) {
  dt1 <- setkey(as.data.table(df1), "id")
  dt2 <- setkey(as.data.table(df2), "id")
  return(dt1[dt2])
}
system.time(dt_solution <- merge_dt(df1, df2))
#    user  system elapsed 
#   12.94    0.01   12.95 
all.equal(df_solution, as.data.frame(dt_solution))
#[1] TRUE

And my usual disclaimer: I'm still learning a lot about this package as well, so you find better information at the package homepage.

错々过的事 2024-11-10 01:15:11

我刚刚实现了一个解决此问题的函数(合并两个 data.frame 对象,同时保持两个对象之一的顺序),您可以在此处查看其使用的代码和示例:

http://www.r-statistics.com/2012/01/merging-two-data-frame-objects-while-preserving-the-rows-order/

I just implemented a function that solves this issue (of merging two data.frame objects while keeping the order to be by one of the two objects), you can see the code and examples for its use here:

http://www.r-statistics.com/2012/01/merging-two-data-frame-objects-while-preserving-the-rows-order/

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文