如何加入(合并)数据帧(内部,外部,左,右)
给定两个数据帧:
df1 = data.frame(CustomerId = c(1:6), Product = c(rep("Toaster", 3), rep("Radio", 3)))
df2 = data.frame(CustomerId = c(2, 4, 6), State = c(rep("Alabama", 2), rep("Ohio", 1)))
df1
# CustomerId Product
# 1 Toaster
# 2 Toaster
# 3 Toaster
# 4 Radio
# 5 Radio
# 6 Radio
df2
# CustomerId State
# 2 Alabama
# 4 Alabama
# 6 Ohio
我该如何执行数据库样式,即>?也就是说,我如何得到:
- noreferrer“> ninen> innin> ninen join
of
df1
和df2
:
仅返回左表在右表中具有匹配键的行。 - an ofter Join > df2 :
从两个表中返回所有行,从右表中具有匹配键的左侧连接记录。 - a 左outer(或简单左JOIN) 和
df2
从左表返回所有行,以及从右表中匹配键的任何行。 - a
df1
df1 和 和 df2
从右表返回所有行,并从左表中返回带有匹配键的任何行。
额外的信用:
我如何执行SQL样式选择语句?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(14)
通过使用
MERGE
函数及其可选参数:内在加入:
Merge(DF1,DF2)
将适用于这些示例是因为r自动通过通用变量名称加入帧,但是您很可能需要指定MERGE(DF1,DF2,by =“ CustomerId”)
以确保您仅在您想要的字段。如果匹配变量在不同的数据帧中具有不同的名称,则还可以使用by.x
和by.y
参数。外部加入:
合并(x = df1,y = df2,by by =“ customerId”,all = true)
左外部:
合并(x = df1,y = df2,by =“ customerid”,all.x = true)
右外部:< /em>
合并(x = df1,y = df2,by =“ customerId”,all.y = true)
交叉加入: < /strong>
合并(x = df1,y = df2,by = null)
就像内在的加入一样,您可能希望将“ customerId”明确地传递给r作为匹配变量。如果输入数据。帧会意外变化,稍后更易于阅读,则更安全。您可以通过提供
vector,例如,
by = c(“ customerId”,“ orderId”)
来合并多个列。如果要合并的列名不相同,则可以指定,例如,
by.x =“ customerid_in_df1”,by.y =“ customerid_in_df2”
其中customeid_in_df1
is第一个数据框架中的列的名称和customerid_in_df2
是第二个数据框中的列的名称。 (如果您需要在多个列上合并,这些也可以是向量。)By using the
merge
function and its optional parameters:Inner join:
merge(df1, df2)
will work for these examples because R automatically joins the frames by common variable names, but you would most likely want to specifymerge(df1, df2, by = "CustomerId")
to make sure that you were matching on only the fields you desired. You can also use theby.x
andby.y
parameters if the matching variables have different names in the different data frames.Outer join:
merge(x = df1, y = df2, by = "CustomerId", all = TRUE)
Left outer:
merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE)
Right outer:
merge(x = df1, y = df2, by = "CustomerId", all.y = TRUE)
Cross join:
merge(x = df1, y = df2, by = NULL)
Just as with the inner join, you would probably want to explicitly pass "CustomerId" to R as the matching variable.I think it's almost always best to explicitly state the identifiers on which you want to merge; it's safer if the input data.frames change unexpectedly and easier to read later on.You can merge on multiple columns by giving
by
a vector, e.g.,by = c("CustomerId", "OrderId")
.If the column names to merge on are not the same, you can specify, e.g.,
by.x = "CustomerId_in_df1", by.y = "CustomerId_in_df2"
whereCustomerId_in_df1
is the name of the column in the first data frame andCustomerId_in_df2
is the name of the column in the second data frame. (These can also be vectors if you need to merge on multiple columns.)您也可以使用Hadley Wickham的Awesome dplyr 包装。
突变连接:使用DF2
过滤连接中的匹配点在DF1中添加列:DF1中的行过滤器,请勿修改列
You can do joins as well using Hadley Wickham's awesome dplyr package.
Mutating joins: add columns to df1 using matches in df2
Filtering joins: filter out rows in df1, don't modify columns
我建议您查看 gabor grothendieck sqldf package 在SQL中表达这些操作。
我发现SQL语法比其等效物更简单,更自然(但这可能仅反映了我的RDBMS偏置)。
参见 Gabor的SQLDF GitHub 有关加入的更多信息。
I would recommend checking out Gabor Grothendieck's sqldf package, which allows you to express these operations in SQL.
I find the SQL syntax to be simpler and more natural than its R equivalent (but this may just reflect my RDBMS bias).
See Gabor's sqldf GitHub for more information on joins.
有 data.table 内在联接的方法,这是非常时间和内存的效率(对于某些较大的数据。Frames):
MERGE
也适用于数据。表(因为它是通用的,并且调用Merge.data.table
)data.table在Stackoverflow上记录了:
如何进行数据。 br>
翻译SQL与外国钥匙上的sql连接到r数据。表语法
有效合并的较大数据合并的替代方案。
中
,还有另一个选项是
join
函数在 plyr package。 [注:2022年的注:Plyr现在已退休,已被 dplyr 。 ”中的加入操作。href =“ href =
内部
,<代码>左,<代码>右,完整
。来自
?JOIN
:与Merge
不同,[join> join
]保留X的顺序,无论使用哪种JOIN类型。There is the data.table approach for an inner join, which is very time and memory efficient (and necessary for some larger data.frames):
merge
also works on data.tables (as it is generic and callsmerge.data.table
)data.table documented on stackoverflow:
How to do a data.table merge operation
Translating SQL joins on foreign keys to R data.table syntax
Efficient alternatives to merge for larger data.frames R
How to do a basic left outer join with data.table in R?
Yet another option is the
join
function found in the plyr package. [Note from 2022: plyr is now retired and has been superseded by dplyr. Join operations in dplyr are described in this answer.]Options for
type
:inner
,left
,right
,full
.From
?join
: Unlikemerge
, [join
] preserves the order of x no matter what join type is used.在 R Wiki 。我会在这里偷几对夫妇:
合并方法,
因为您的键被命名为同样的简短方法,即在内部加入的短途方法是merge():
完整的内部联接(所有表中的所有记录)可以是用“ ALL”关键字创建:
DF1和DF2的左外连接:
DF1和DF2的右外连接:
您可以翻转'em,slap'em and ub'em down以获取其他两个外部连接:)
下标方法
使用下标方法在左侧的DF1左外连接将是:
外部连接的其他组合可以通过修改左外的联接下标示例来创建。 (是的,我知道这相当于说:“我将其作为读者的练习……”)
There are some good examples of doing this over at the R Wiki. I'll steal a couple here:
Merge Method
Since your keys are named the same the short way to do an inner join is merge():
a full inner join (all records from both tables) can be created with the "all" keyword:
a left outer join of df1 and df2:
a right outer join of df1 and df2:
you can flip 'em, slap 'em and rub 'em down to get the other two outer joins you asked about :)
Subscript Method
A left outer join with df1 on the left using a subscript method would be:
The other combination of outer joins can be created by mungling the left outer join subscript example. (yeah, I know that's the equivalent of saying "I'll leave it as an exercise for the reader...")
更新数据。加入数据集的方法。请参阅下面的每种联接类型的示例。有两种方法,一种来自
[。data.table
传递第二个数据时。将作为第一个参数作为子集,另一种方法是使用Merge
函数,该功能可发送到快速Data.Table方法。在基准测试基础r,sqldf,dplyr和data.table.
下方。
基准测试无关/未索引数据集。
基准测试是在50m-1行数据集上执行的,在JOIN列上有50m-2的共同值,因此可以测试每个方案(内部,左,右,完整),并且连接仍然不重要。这是JOIN的类型,其压力可以加入算法。时间与
sqldf:0.4.11
,dplyr:0.7.8
,data.table:1.12.0
。请注意,还有其他类型的连接可以使用
data.table
:- 在JOIN 上更新 - 如果您想从另一个表查找到主表
- 在JOIN上汇总>
- 重叠加入JOIN - 如果您想通过ranges
合并
- 滚动JOIN JOIN - 如果您希望合并能够通过向前或向后滚动或向后滚动来与preading/laster的值匹配
- non-equi join - 如果您的联接条件是非平等
代码的复制:
Update on data.table methods for joining datasets. See below examples for each type of join. There are two methods, one from
[.data.table
when passing second data.table as the first argument to subset, another way is to usemerge
function which dispatches to fast data.table method.Below benchmark tests base R, sqldf, dplyr and data.table.
Benchmark tests unkeyed/unindexed datasets.
Benchmark is performed on 50M-1 rows datasets, there are 50M-2 common values on join column so each scenario (inner, left, right, full) can be tested and join is still not trivial to perform. It is type of join which well stress join algorithms. Timings are as of
sqldf:0.4.11
,dplyr:0.7.8
,data.table:1.12.0
.Be aware there are other types of joins you can perform using
data.table
:- update on join - if you want to lookup values from another table to your main table
- aggregate on join - if you want to aggregate on key you are joining you do not have to materialize all join results
- overlapping join - if you want to merge by ranges
- rolling join - if you want merge to be able to match to values from preceeding/following rows by rolling them forward or backward
- non-equi join - if your join condition is non-equal
Code to reproduce:
2014年的新事物:
尤其是如果您对一般数据操作感兴趣(包括分类,过滤,子集,摘要等),则绝对应该看一下
dplyr
,它带有一个多样性功能的所有旨在专门针对数据框架和某些其他数据库类型的工作。 甚至提供了相当复杂的SQL接口,甚至可以将SQL代码直接转换为R的函数R。
x y中有匹配值,以及x和y
left_join的所有列(x,y,by = null,copy = false,...)
:返回x的所有行,以及所有行x和ysemi_join的列
是的,仅保留x的列。
anti_join(x,y,by = null,copy = false,...)
:从x返回所有行在y中没有匹配值的情况下,仅保留x的列
,全部在这里非常详细。
选择列可以通过
选择(DF,“列”)
完成。如果这对您来说还不够SQL,则有sql()
函数,您可以在其中输入SQL Code AS-IS,并且它将执行您指定的操作,就像您在一直以来(有关更多信息,请参阅 dplyr/databases小插图)。例如,如果正确应用,则sql(“从Hflights中选择 *”)
将从“ Hflights” dplyr Table(a“ tbl”)中选择所有列。New in 2014:
Especially if you're also interested in data manipulation in general (including sorting, filtering, subsetting, summarizing etc.), you should definitely take a look at
dplyr
, which comes with a variety of functions all designed to facilitate your work specifically with data frames and certain other database types. It even offers quite an elaborate SQL interface, and even a function to convert (most) SQL code directly into R.The four joining-related functions in the dplyr package are (to quote):
inner_join(x, y, by = NULL, copy = FALSE, ...)
: return all rows fromx where there are matching values in y, and all columns from x and y
left_join(x, y, by = NULL, copy = FALSE, ...)
: return all rows from x, and all columns from x and ysemi_join(x, y, by = NULL, copy = FALSE, ...)
: return all rows from x where there are matching values iny, keeping just columns from x.
anti_join(x, y, by = NULL, copy = FALSE, ...)
: return all rows from xwhere there are not matching values in y, keeping just columns from x
It's all here in great detail.
Selecting columns can be done by
select(df,"column")
. If that's not SQL-ish enough for you, then there's thesql()
function, into which you can enter SQL code as-is, and it will do the operation you specified just like you were writing in R all along (for more information, please refer to the dplyr/databases vignette). For example, if applied correctly,sql("SELECT * FROM hflights")
will select all the columns from the "hflights" dplyr table (a "tbl").dplyr自0.4以来实施了所有这些加入,包括
exouts_join
,但值得注意的是,在0.4之前的前几个发行版中,它使用不提供offer_join
,并且AS结果,有很多非常糟糕的黑客解决方法用户代码在相当长的一段时间以后(您仍然可以在SO中找到此类代码,Kaggle答案,GitHub,github。因此,此答案仍然有用。)与join相关的发行亮点:
v0.5 (6/2016)
v0.4.0 (1/2015)
v0.1.3 (4/2014)
每个哈德利在该问题上的评论解决方法:
dplyr since 0.4 implemented all those joins including
outer_join
, but it was worth noting that for the first few releases prior to 0.4 it used not to offerouter_join
, and as a result there was a lot of really bad hacky workaround user code floating around for quite a while afterwards (you can still find such code in SO, Kaggle answers, github from that period. Hence this answer still serves a useful purpose.)Join-related release highlights:
v0.5 (6/2016)
v0.4.0 (1/2015)
v0.3 (10/2014)
v0.2 (5/2014)
v0.1.3 (4/2014)
Workarounds per hadley's comments in that issue:
对于左将
0 ..*:0..1
基数或右连接的情况可以直接从木匠(0..1
表)将单方面列将其放置在joinee(0 ..*
表)上,从而将避免创建全新的数据表。这就需要将joinee的密钥列匹配到木匠中,并索引+订购木匠的行以进行分配。如果键是单列,那么我们可以使用一个调用
match()
进行匹配。我将在此答案中介绍这种情况。这是一个基于OP的示例,除了我在
df2
中添加了一个额外的行,并具有7个ID,以测试Joiner中未匹配键的情况。这是有效的df1
左JOINdf2
:在上面的我硬编码一个假设,即关键列是两个输入表的第一列。我认为,通常,这不是一个不合理的假设,因为,如果您有一个数据。使用键列的框架,如果未设置为Data.frame的第一列,那将很奇怪。一开始。您总是可以重新排序列以使其这样做。该假设的一个有利结果是,尽管我认为它只是用另一个假设替换一个假设,但密钥列的名称不必被硬编码。简洁是整数索引和速度的另一个优点。在下面的基准测试中,我将更改实施方式,以使用字符串名称索引来匹配竞争实现。
我认为,如果您有几张桌子要在一张大桌子上加入,这是一个特别合适的解决方案。反复重建每个合并的整个桌子将是不必要和效率低下的。
另一方面,如果出于任何原因需要大约在此操作中保持不变,则无法使用该解决方案,因为它会直接修改折磨。尽管在这种情况下,您可以简单地制作副本并在副本上执行就地分配。
附带说明,我简要研究了多列键的可能匹配解决方案。不幸的是,我发现的唯一匹配解决方案是:
匹配(互动(DF1 $ a,df1 $ b),交互(DF2 $ a,df2 $ b))
或与paste()
的相同想法。外部(df1 $ a,df2 $ a,`==`)&amp;外部(df1 $ b,df2 $ b,`==`)
。merge()
和基于等效的包装合并功能,该功能总是分配新表以返回合并结果,因此不适合基于原位分配的解决方案。例如,请参见不同数据帧上的列,结果获得其他列,匹配两个带有另外两个列的列, /a>,以及这个问题的骗子,我最初提出了现场解决方案,在r 中将两个数据帧与不同数量的行相结合。
基准测试
我决定进行自己的基准测试,以查看本地分配方法与此问题中提供的其他解决方案的比较。
测试代码:
这是我前面证明的OP的基准:
在这里,我在随机输入数据上基准测试,尝试不同的量表和两个输入表之间的密钥重叠模式的不同模式。该基准仍仅限于单列整数键的情况。同样,为了确保就地解决方案适用于同一表的左右连接,所有随机测试数据均使用
0..1:0..1
cardinality。这是通过对第一个数据的密钥列进行采样来实现的。在生成第二个data.frame的密钥列时,框架。我编写了一些代码来创建上述结果的日志图。我为每个重叠百分比生成了一个单独的图。这有点混乱,但是我喜欢拥有所有解决方案类型,并在同一地块中表示加入类型。
我使用样条插值显示了每个解决方案/连接类型组合的平滑曲线,并用单个PCH符号绘制。联接类型由PCH符号捕获,使用一个点,用于左右的内部,左右角括号,以及一颗钻石。解决方案类型由传说中所示的颜色捕获。
这是第二个大型基准,在钥匙的数量和类型方面更为重型列以及基数。对于此基准测试,我使用三个关键列:一个字符,一个整数和一个逻辑,没有对基数的限制(即,
0 ..*:0 ..*:*
)。 (通常不建议通过浮点比较并发症定义具有双重或复杂值的密钥列,并且基本上没有人使用原始类型,更不用说键列了,因此我还没有在键中包含这些类型为了信息的缘故,我最初尝试通过包含Posixct键列使用四个密钥列,但是Posixct类型与sqldf.indexed
解决方案效果不佳由于浮点比较异常,因此我将其删除。)所得图,使用上面给出的相同的绘图代码:
href =“ https://i.sstatic.net/ytero.png” rel =“ noreferrer”>
For the case of a left join with a
0..*:0..1
cardinality or a right join with a0..1:0..*
cardinality it is possible to assign in-place the unilateral columns from the joiner (the0..1
table) directly onto the joinee (the0..*
table), and thereby avoid the creation of an entirely new table of data. This requires matching the key columns from the joinee into the joiner and indexing+ordering the joiner's rows accordingly for the assignment.If the key is a single column, then we can use a single call to
match()
to do the matching. This is the case I'll cover in this answer.Here's an example based on the OP, except I've added an extra row to
df2
with an id of 7 to test the case of a non-matching key in the joiner. This is effectivelydf1
left joindf2
:In the above I hard-coded an assumption that the key column is the first column of both input tables. I would argue that, in general, this is not an unreasonable assumption, since, if you have a data.frame with a key column, it would be strange if it had not been set up as the first column of the data.frame from the outset. And you can always reorder the columns to make it so. An advantageous consequence of this assumption is that the name of the key column does not have to be hard-coded, although I suppose it's just replacing one assumption with another. Concision is another advantage of integer indexing, as well as speed. In the benchmarks below I'll change the implementation to use string name indexing to match the competing implementations.
I think this is a particularly appropriate solution if you have several tables that you want to left join against a single large table. Repeatedly rebuilding the entire table for each merge would be unnecessary and inefficient.
On the other hand, if you need the joinee to remain unaltered through this operation for whatever reason, then this solution cannot be used, since it modifies the joinee directly. Although in that case you could simply make a copy and perform the in-place assignment(s) on the copy.
As a side note, I briefly looked into possible matching solutions for multicolumn keys. Unfortunately, the only matching solutions I found were:
match(interaction(df1$a,df1$b),interaction(df2$a,df2$b))
, or the same idea withpaste()
.outer(df1$a,df2$a,`==`) & outer(df1$b,df2$b,`==`)
.merge()
and equivalent package-based merge functions, which always allocate a new table to return the merged result, and thus are not suitable for an in-place assignment-based solution.For example, see Matching multiple columns on different data frames and getting other column as result, match two columns with two other columns, Matching on multiple columns, and the dupe of this question where I originally came up with the in-place solution, Combine two data frames with different number of rows in R.
Benchmarking
I decided to do my own benchmarking to see how the in-place assignment approach compares to the other solutions that have been offered in this question.
Testing code:
Here's a benchmark of the example based on the OP that I demonstrated earlier:
Here I benchmark on random input data, trying different scales and different patterns of key overlap between the two input tables. This benchmark is still restricted to the case of a single-column integer key. As well, to ensure that the in-place solution would work for both left and right joins of the same tables, all random test data uses
0..1:0..1
cardinality. This is implemented by sampling without replacement the key column of the first data.frame when generating the key column of the second data.frame.I wrote some code to create log-log plots of the above results. I generated a separate plot for each overlap percentage. It's a little bit cluttered, but I like having all the solution types and join types represented in the same plot.
I used spline interpolation to show a smooth curve for each solution/join type combination, drawn with individual pch symbols. The join type is captured by the pch symbol, using a dot for inner, left and right angle brackets for left and right, and a diamond for full. The solution type is captured by the color as shown in the legend.
Here's a second large-scale benchmark that's more heavy-duty, with respect to the number and types of key columns, as well as cardinality. For this benchmark I use three key columns: one character, one integer, and one logical, with no restrictions on cardinality (that is,
0..*:0..*
). (In general it's not advisable to define key columns with double or complex values due to floating-point comparison complications, and basically no one ever uses the raw type, much less for key columns, so I haven't included those types in the key columns. Also, for information's sake, I initially tried to use four key columns by including a POSIXct key column, but the POSIXct type didn't play well with thesqldf.indexed
solution for some reason, possibly due to floating-point comparison anomalies, so I removed it.)The resulting plots, using the same plotting code given above:
在加入两个数据帧时,每行约100万行,一个带2列,另一个具有〜20的数据,我出人意料地找到了
合并(...,all.x = true,all.y = true)< /code>更快,然后
dplyr :: full_join()
。这是在dplyr v0.4合并中需要约17秒,full_join需要〜65秒。
不过,有些食物是因为我通常默认要dplyr进行操纵任务。
In joining two data frames with ~1 million rows each, one with 2 columns and the other with ~20, I've surprisingly found
merge(..., all.x = TRUE, all.y = TRUE)
to be faster thendplyr::full_join()
. This is with dplyr v0.4Merge takes ~17 seconds, full_join takes ~65 seconds.
Some food for though, since I generally default to dplyr for manipulation tasks.
Merge
函数我们可以选择左表或右表的变量,就像我们熟悉SQL中的Select语句一样(例如:选择A。* ...或从中选择B。*)。 ....)我们必须添加额外的代码,该代码将从新连接的表中子集。
sql: -
在a.customerid = b.customerid
上选择a。
r: -
merge(df1,df2,by.x =“ customerId”,by.y =“ customerId”)[,names(df1)]
< /li>
同样的方式
选择b。
r: -
merge(df1,df2,by.x =“ customerId”,by.y =
“ customerId”)[,名称(df2)]
merge
function we can select the variable of left table or right table, same way like we all familiar with select statement in SQL (EX : Select a.* ...or Select b.* from .....)We have to add extra code which will subset from the newly joined table .
SQL :-
select a.* from df1 a inner join df2 b on a.CustomerId=b.CustomerId
R :-
merge(df1, df2, by.x = "CustomerId", by.y = "CustomerId")[,names(df1)]
Same way
SQL :-
select b.* from df1 a inner join df2 b on a.CustomerId=b.CustomerId
R :-
merge(df1, df2, by.x = "CustomerId", by.y =
"CustomerId")[,names(df2)]
更新加入。另外一个重要的SQL风格加入是“ “更新”>更新join “其中一列中的列中的列使用另一个表更新(或创建)表。
修改OP的示例表...
假设我们想将客户的状态从
CUST
添加到购买表,sales
,忽略年度列。使用基本R,我们可以识别匹配行,然后将复制值复制到:可以看到,
Match
从客户表中选择第一个匹配行。更新与多列连接。当我们仅加入单列并对第一场比赛感到满意时,上面的方法效果很好。假设我们希望在客户表中的测量年份与销售年相匹配。
正如 @bgoldst的答案所提到的那样,
与
互动的匹配
可能是这种情况的选项。更直接地,一个人可以使用数据。表:滚动更新加入。 另外,我们可能要采用最新的状态,以下是:在以下情况下,在以下情况下,在以下情况下:
三个示例都专注于创建/添加新列。请参阅相关的R FAQ 以获取更新/修改现有列的示例。
Update join. One other important SQL-style join is an "update join" where columns in one table are updated (or created) using another table.
Modifying the OP's example tables...
Suppose we want to add the customer's state from
cust
to the purchases table,sales
, ignoring the year column. With base R, we can identify matching rows and then copy values over:As can be seen here,
match
selects the first matching row from the customer table.Update join with multiple columns. The approach above works well when we are joining on only a single column and are satisfied with the first match. Suppose we want the year of measurement in the customer table to match the year of sale.
As @bgoldst's answer mentions,
match
withinteraction
might be an option for this case. More straightforwardly, one could use data.table:Rolling update join. Alternately, we may want to take the last state the customer was found in:
The three examples above all focus on creating/adding a new column. See the related R FAQ for an example of updating/modifying an existing column.
对于所有列上的内部联接,您也可以从 data.table package或
Intersect
intersect intersect intersect 从 dplyr < dplyr << /em> - 包装作为Merge
的替代方案,而无需通过 -columns指定。这将使两个数据范围之间的行相等:
示例数据:
For an inner join on all columns, you could also use
fintersect
from the data.table-package orintersect
from the dplyr-package as an alternative tomerge
without specifying theby
-columns. This will give the rows that are equal between two dataframes:Example data:
collapse
2.0提供了另一个加入框架,并使用JOIN
。 它比任何其他选项都要快。collapse
2.0 provides another join framework withjoin
. It is noticeably faster than any other option.