使用 R 合并不同排序的表

发布于 2024-11-03 01:15:11 字数 392 浏览 8 评论 0 原文

我有两张大桌子。一个包含标识符（未排序），另一个包含标识符列表（包含第一个表中的所有标识符）以及一个变量的关联值。我想向第一个表添加一列，其中包含第二个表中的关联值。有没有一种聪明的方法来继续使用 R 的实现功能？

即

table 1
id
8979786
62782
6268768
6776566

table 2
id        var
1          5
2          2
3          NA
…
9999999    6

结果应该是

table1
id       var
8979786   5
62782     NA
6268768   7
4776566   4

提前感谢

原文

I have 2 big tables. One with identifiers (unsorted), another with a list of identifiers (containing all which are in the first table) plus the associated values for one variable.
I want to add a column to my first table containing the associated values that are in the second table. Is there an smart way to proceed using implemented functions of R ?

i.e.

table 1
id
8979786
62782
6268768
6776566

table 2
id        var
1          5
2          2
3          NA
…
9999999    6

and the result should be

table1
id       var
8979786   5
62782     NA
6268768   7
4776566   4

Thanks in advance

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

眼泪也成诗 2024-11-10 01:15:11

那么 id 列在两个表中都存在吗？您可以将它们合并在一起：merge(table1, table2, sort = FALSE)。有很多合并选项可供探索，可让您模拟不同类型的联接，类似于 SQL 中的内联接、左联接、右联接和外联接。我在此处添加了附加参数 sort 以保留 table1 的原始顺序。

如果表 1 中有 id，但表 2 中没有，并且您想要显示这些 NA，请添加 all.x = TRUE 作为参数。这相当于左连接。 all.y 是右连接，all = TRUE 相当于完全外连接。

可重现的例子：

> set.seed(1)
> table1 <- data.frame( id = sample(1:5, 5, FALSE))
> table1
  id
1  2
2  5
3  4
4  3
5  1
> table2 <- data.frame( id = 1:5, var = rnorm(5))
> table2
  id        var
1  1  1.2724293
2  2  0.4146414
3  3 -1.5399500
4  4 -0.9285670
5  5 -0.2947204
> merge(table1, table2, sort = FALSE)
  id        var
1  2  0.4146414
2  5 -0.2947204
3  4 -0.9285670
4  3 -1.5399500
5  1  1.2724293

So the id column is in both tables? You can merge them together: merge(table1, table2, sort = FALSE). There are lots of options to explore for merge that let you emulate different types of joins, similar to inner, left, right, and outer joins in SQL. I added the additional parameter sort here to preserve the original order of table1.

If there are ids in table1 but not table 2 and you want to show NAs for those, add all.x = TRUE as a parameter. This is equivalent to a left join. all.y is a right join, and all = TRUE is equivalent to a full outer join.

Reproducible example:

> set.seed(1)
> table1 <- data.frame( id = sample(1:5, 5, FALSE))
> table1
  id
1  2
2  5
3  4
4  3
5  1
> table2 <- data.frame( id = 1:5, var = rnorm(5))
> table2
  id        var
1  1  1.2724293
2  2  0.4146414
3  3 -1.5399500
4  4 -0.9285670
5  5 -0.2947204
> merge(table1, table2, sort = FALSE)
  id        var
1  2  0.4146414
2  5 -0.2947204
3  4 -0.9285670
4  3 -1.5399500
5  1  1.2724293

回复收藏 0 原文

孤星 2024-11-10 01:15:11

这是一种 data.table 方法，以防数据很大且速度成为问题。更多信息请参考?data.table的帮助页面：

当 i 是一个 data.table 时，x（即外部 data.table）必须有一个
钥匙。 i（即内部 data.table）使用键连接到 x 并
返回 x 中匹配的行。执行等值连接
i 中的每一列到 x 的键中的每一列之间。这场比赛是一场
在编译后的 C 中进行二分查找，时间复杂度为 O(log n)。如果我的列数较少
比 x 的键大，那么 x 的许多行可能与 i 的每一行匹配。如果我有
比 x 的键更多的列，i 的列不参与
join 包含在结果中。如果我也有一把钥匙，那就是我的钥匙
用于匹配 x 的键列和二进制合并的列
执行两个表的比较。

请注意，我对 Chase 提供的示例数据进行了一些调整，以使 data.table 中的匹配的某些点更加明显：

require(data.table)
#Version 1.7.7
set.seed(1)
table1 <- data.table(id = sample(3:7, 5, FALSE), var1 = rnorm(5), key="id")
table2 <- data.table(id = 5:10, var2 = rnorm(6), key="id")

#Default: If id in table 1 is not in table 2, return NA
table2[table1]
#      id         var2       var1
# [1,]  3           NA -0.2947204
# [2,]  4           NA  1.2724293
# [3,]  5 -0.005767173 -0.9285670
# [4,]  6  2.404653389 -1.5399500
# [5,]  7  0.763593461  0.4146414

#If one wants to get rid of the NAs
table2[table1, nomatch=0]
#      id         var2       var1
# [1,]  5 -0.005767173 -0.9285670
# [2,]  6  2.404653389 -1.5399500
# [3,]  7  0.763593461  0.4146414

#Or the other way around: get all ids of table 2
table1[table2]
#      id       var1         var2
# [1,]  5 -0.9285670 -0.005767173
# [2,]  6 -1.5399500  2.404653389
# [3,]  7  0.4146414  0.763593461
# [4,]  8         NA -0.799009249
# [5,]  9         NA -1.147657009
# [6,] 10         NA -0.289461574

强制性速度测试：

set.seed(10)
df1 <- data.frame(id = sample(1:5e6, 5e6, FALSE))
df2 <- data.frame(id = sample(1:5e6, 5e6, FALSE), var = rnorm(5e6))
system.time(df_solution <- merge(df1, df2, sort = TRUE))
#    user  system elapsed 
#   33.10    0.32   33.54
merge_dt <- function(df1, df2) {
  dt1 <- setkey(as.data.table(df1), "id")
  dt2 <- setkey(as.data.table(df2), "id")
  return(dt1[dt2])
}
system.time(dt_solution <- merge_dt(df1, df2))
#    user  system elapsed 
#   12.94    0.01   12.95 
all.equal(df_solution, as.data.frame(dt_solution))
#[1] TRUE

以及我通常的免责声明：我仍在学习很多东西以及有关此软件包的信息，因此您可以在软件包主页找到更多信息。

Here is a data.table way of doing this, in case the data is big and speed is an issue. For more information, refer to the help page of ?data.table:

When i is a data.table, x (that is the outer data.table) must have a
key. i (that is the inner data.table) is joined to x using the key and
the rows in x that match are returned. An equi-join is performed
between each column in i to each column in x's key. The match is a
binary search in compiled C in O(log n) time. If i has less columns
than x's key then many rows of x may match to each row of i. If i has
more columns than x's key, the columns of i not involved in the
join are included in the result. If i also has a key, it is i's key
columns that are used to match to x's key columns and a binary merge
of the two tables is carried out.

Note that I adjusted the sample data provided by Chase a little to make certain points about the matching in data.table more obvious:

require(data.table)
#Version 1.7.7
set.seed(1)
table1 <- data.table(id = sample(3:7, 5, FALSE), var1 = rnorm(5), key="id")
table2 <- data.table(id = 5:10, var2 = rnorm(6), key="id")

#Default: If id in table 1 is not in table 2, return NA
table2[table1]
#      id         var2       var1
# [1,]  3           NA -0.2947204
# [2,]  4           NA  1.2724293
# [3,]  5 -0.005767173 -0.9285670
# [4,]  6  2.404653389 -1.5399500
# [5,]  7  0.763593461  0.4146414

#If one wants to get rid of the NAs
table2[table1, nomatch=0]
#      id         var2       var1
# [1,]  5 -0.005767173 -0.9285670
# [2,]  6  2.404653389 -1.5399500
# [3,]  7  0.763593461  0.4146414

#Or the other way around: get all ids of table 2
table1[table2]
#      id       var1         var2
# [1,]  5 -0.9285670 -0.005767173
# [2,]  6 -1.5399500  2.404653389
# [3,]  7  0.4146414  0.763593461
# [4,]  8         NA -0.799009249
# [5,]  9         NA -1.147657009
# [6,] 10         NA -0.289461574

The obligatory speed test:

set.seed(10)
df1 <- data.frame(id = sample(1:5e6, 5e6, FALSE))
df2 <- data.frame(id = sample(1:5e6, 5e6, FALSE), var = rnorm(5e6))
system.time(df_solution <- merge(df1, df2, sort = TRUE))
#    user  system elapsed 
#   33.10    0.32   33.54
merge_dt <- function(df1, df2) {
  dt1 <- setkey(as.data.table(df1), "id")
  dt2 <- setkey(as.data.table(df2), "id")
  return(dt1[dt2])
}
system.time(dt_solution <- merge_dt(df1, df2))
#    user  system elapsed 
#   12.94    0.01   12.95 
all.equal(df_solution, as.data.frame(dt_solution))
#[1] TRUE

And my usual disclaimer: I'm still learning a lot about this package as well, so you find better information at the package homepage.

回复收藏 0 原文