基于匹配每个数据帧中的两个可交换列来合并两个数据帧
我在 R 中有两个数据帧。
dataframe 1
A B C D E F G
1 2 a a a a a
2 3 b b b c c
4 1 e e f f e
dataframe 2
X Y Z
1 2 g
2 1 h
3 4 i
1 4 j
我想将 dataframe1 的 A 列和 B 列与 dataframe2 的 X 列和 Y 列相匹配。这不是成对比较,即第 1 行 (A=1 B=2) 被认为是相同的作为数据帧2的第1行(X = 1,Y = 2)和第2行(X = 2,Y = 1)。
当可以找到匹配时,我想添加数据帧1的C,D,E,F列到dataframe2的匹配行,如下:没有匹配为na。
最终数据帧
X Y Z C D E F G
1 2 g a a a a a
2 1 h a a a a a
3 4 i na na na na na
1 4 j e e f f e
我只能知道如何对单列进行匹配,但是,如何对两个可交换列进行匹配并根据匹配结果合并两个数据帧对我来说很困难。请帮助提供明智的方法来做到这一点。
为了便于讨论(感谢 Vincent 和 DWin(我之前的问题)的评论,我应该测试一下引用。)将数据帧 1 和 2 加载到 R 是有配额的。
df1 <- data.frame(A = c(1,2,4), B=c(2,3,1), C=c('a','b','e'),
D=c('a','b','e'), E=c('a','b','f'),
F=c('a','c','f'), G=c('a','c', 'e'))
df2 <- data.frame(X = c(1,2,3,1), Y=c(2,1,4,4), Z=letters[7:10])
I have two dataframe in R.
dataframe 1
A B C D E F G
1 2 a a a a a
2 3 b b b c c
4 1 e e f f e
dataframe 2
X Y Z
1 2 g
2 1 h
3 4 i
1 4 j
I want to match dataframe1's column A and B with dataframe2's column X and Y. It is NOT a pairwise comparsions, i.e. row 1 (A=1 B=2) are considered to be same as row 1 (X=1, Y=2) and row 2 (X=2, Y=1) of dataframe 2.
When matching can be found, I would like to add columns C, D, E, F of dataframe1 back to the matched row of dataframe2, as follows: with no matching as na.
Final dataframe
X Y Z C D E F G
1 2 g a a a a a
2 1 h a a a a a
3 4 i na na na na na
1 4 j e e f f e
I can only know how to do matching for single column, however, how to do matching for two exchangable columns and merging two dataframes based on the matching results is difficult for me. Pls kindly help to offer smart way of doing this.
For the ease of discussion (thanks for the comments by Vincent and DWin (my previous quesiton) that I should test the quote.) There are the quota for loading dataframe 1 and 2 to R.
df1 <- data.frame(A = c(1,2,4), B=c(2,3,1), C=c('a','b','e'),
D=c('a','b','e'), E=c('a','b','f'),
F=c('a','c','f'), G=c('a','c', 'e'))
df2 <- data.frame(X = c(1,2,3,1), Y=c(2,1,4,4), Z=letters[7:10])
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
以下作品有效,但毫无疑问可以改进。
我首先创建一个小辅助函数,对 A 和 B 执行按行排序(并将其重命名为 V1 和 V2)。
这意味着您可以使用直接的
合并
来组合数据。The following works, but no doubt can be improved.
I first create a little helper function that performs a row-wise sort on A and B (and renames it to V1 and V2).
This means you can use a straight-forward
merge
to combine the data.这有点笨重,并且存在一些潜在的冲突和顺序问题,但可以与您的示例
一起生成
This is slightly clunky, and has some potential collision and order issues but works with your example
to produce
一种方法是创建一个用于匹配的 id 键,该键是顺序不变的。
这会产生输出
One approach would be to create an
id
key for matching that is order invariant.This produces the output
您还可以以两种方式连接表(
X == A
和Y == B
,然后X == B
和Y == A
)并rbind
它们。这将产生重复对,其中一种方式产生匹配,另一种方式产生NA
,因此您可以通过为每个 XY 组合仅切片一行(没有NA
的行)来减少重复项代码> 如果存在的话。生产:
You could also join the tables both ways (
X == A
andY == B
, thenX == B
andY == A
) andrbind
them. This will produce duplicate pairs where one way yielded a match and the other yieldedNA
, so you would then reduce duplicates by slicing only a single row for each X-Y combination, the one withoutNA
if one exists.Produces:
这是基础 R 中的另一个可能的解决方案。此解决方案使用
cbind()
为两个 data.frames 添加新的关键列(K1
和K2
)矢量化pmin()
和pmax()
函数来导出关键列的规范顺序,并对其进行合并:请注意
pmin()< 的使用/代码> 和
pmax()
仅适用于此问题,因为您只有两个关键列;如果你有更多,那么你必须使用某种 apply+sort 解决方案来实现合并的规范键顺序,类似于 @Andrie 在他的辅助函数中所做的,这适用于任意数量的键列,但是性能会较差。Here's another possible solution in base R. This solution
cbind()
s new key columns (K1
andK2
) to both data.frames using the vectorizedpmin()
andpmax()
functions to derive the canonical order of the key columns, and merges on those:Note that the use of
pmin()
andpmax()
is only possible for this problem because you only have two key columns; if you had more, then you'd have to use some kind of apply+sort solution to achieve the canonical key order for merging, similar to what @Andrie does in his helper function, which would work for any number of key columns, but would be less performant.