如何加入(合并)数据帧(内部,外部,左,右)
给定两个数据帧:
df1 = data.frame(CustomerId = c(1:6), Product = c(rep("Toaster", 3), rep("Radio", 3)))
df2 = data.frame(CustomerId = c(2, 4, 6), State = c(rep("Alabama", 2), rep("Ohio", 1)))
df1
# CustomerId Product
# 1 Toaster
# 2 Toaster
# 3 Toaster
# 4 Radio
# 5 Radio
# 6 Radio
df2
# CustomerId State
# 2 Alabama
# 4 Alabama
# 6 Ohio
我该如何执行数据库样式,即>?也就是说,我如何得到:
- noreferrer“> ninen> innin> ninen join
of
df1
和df2
:
仅返回左表在右表中具有匹配键的行。 - an ofter Join > df2 :
从两个表中返回所有行,从右表中具有匹配键的左侧连接记录。 - a 左outer(或简单左JOIN)(或简单左JOIN) 和
df2
从左表返回所有行,以及从右表中匹配键的任何行。 - a
df1
df1 和 和 df2
从右表返回所有行,并从左表中返回带有匹配键的任何行。
额外的信用:
我如何执行SQL样式选择语句?
Given two data frames:
df1 = data.frame(CustomerId = c(1:6), Product = c(rep("Toaster", 3), rep("Radio", 3)))
df2 = data.frame(CustomerId = c(2, 4, 6), State = c(rep("Alabama", 2), rep("Ohio", 1)))
df1
# CustomerId Product
# 1 Toaster
# 2 Toaster
# 3 Toaster
# 4 Radio
# 5 Radio
# 6 Radio
df2
# CustomerId State
# 2 Alabama
# 4 Alabama
# 6 Ohio
How can I do database style, i.e., sql style, joins? That is, how do I get:
- An inner join of
df1
anddf2
:
Return only the rows in which the left table have matching keys in the right table. - An outer join of
df1
anddf2
:
Returns all rows from both tables, join records from the left which have matching keys in the right table. - A left outer join (or simply left join) of
df1
anddf2
Return all rows from the left table, and any rows with matching keys from the right table. - A right outer join of
df1
anddf2
Return all rows from the right table, and any rows with matching keys from the left table.
Extra credit:
How can I do a SQL style select statement?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(14)
通过使用
merge
函数及其可选参数:内连接:
merge(df1, df2)
适用于这些示例是因为 R 自动通过公共变量名称连接框架,但您很可能希望指定merge(df1, df2, by = "CustomerId")
以确保您仅匹配您想要的领域。如果匹配变量在不同的数据框中具有不同的名称,您还可以使用by.x
和by.y
参数。外连接:
merge(x = df1, y = df2, by = "CustomerId", all = TRUE)
左外:
merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE)
右外部:
merge(x = df1, y = df2, by = "CustomerId", all.y = TRUE)
交叉联接:
merge(x = df1, y = df2, by = NULL)
就像内部联接一样,您可能希望显式地将“CustomerId”传递给 R作为匹配变量。我认为最好明确地声明要合并的标识符;如果输入 data.frames 意外更改,则更安全,并且以后更易于阅读。您可以通过给
by
一个向量来合并多列,例如by = c("CustomerId", "OrderId")
。如果要合并的列名称不同,您可以指定,例如
by.x = "CustomerId_in_df1", by.y = "CustomerId_in_df2"
,其中CustomerId_in_df1
是第一个数据框中的列名称,CustomerId_in_df2
是第二个数据框中的列名称。 (如果您需要合并多列,这些也可以是向量。)By using the
merge
function and its optional parameters:Inner join:
merge(df1, df2)
will work for these examples because R automatically joins the frames by common variable names, but you would most likely want to specifymerge(df1, df2, by = "CustomerId")
to make sure that you were matching on only the fields you desired. You can also use theby.x
andby.y
parameters if the matching variables have different names in the different data frames.Outer join:
merge(x = df1, y = df2, by = "CustomerId", all = TRUE)
Left outer:
merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE)
Right outer:
merge(x = df1, y = df2, by = "CustomerId", all.y = TRUE)
Cross join:
merge(x = df1, y = df2, by = NULL)
Just as with the inner join, you would probably want to explicitly pass "CustomerId" to R as the matching variable.I think it's almost always best to explicitly state the identifiers on which you want to merge; it's safer if the input data.frames change unexpectedly and easier to read later on.You can merge on multiple columns by giving
by
a vector, e.g.,by = c("CustomerId", "OrderId")
.If the column names to merge on are not the same, you can specify, e.g.,
by.x = "CustomerId_in_df1", by.y = "CustomerId_in_df2"
whereCustomerId_in_df1
is the name of the column in the first data frame andCustomerId_in_df2
is the name of the column in the second data frame. (These can also be vectors if you need to merge on multiple columns.)您也可以使用Hadley Wickham的Awesome dplyr 包装。
突变连接:使用DF2
过滤连接中的匹配点在DF1中添加列:DF1中的行过滤器,请勿修改列
You can do joins as well using Hadley Wickham's awesome dplyr package.
Mutating joins: add columns to df1 using matches in df2
Filtering joins: filter out rows in df1, don't modify columns
我建议您查看 Gabor Grothendieck 的 sqldf 包,它可以让您用SQL来表达这些操作。
我发现 SQL 语法比其 R 语法更简单、更自然(但这可能只是反映了我对 RDBMS 的偏见)。
有关联接的更多信息,请参阅 Gabor 的 sqldf GitHub。
I would recommend checking out Gabor Grothendieck's sqldf package, which allows you to express these operations in SQL.
I find the SQL syntax to be simpler and more natural than its R equivalent (but this may just reflect my RDBMS bias).
See Gabor's sqldf GitHub for more information on joins.
有一种用于内部联接的 data.table 方法,该方法非常节省时间和内存(对于某些较大的 data.frame 来说是必需的):
merge
也适用于数据。表(因为它是通用的并调用merge.data.table
)data.table记录在stackoverflow上:
如何执行 data.table 合并操作< br>
将外键上的 SQL 连接转换为 R 数据。表语法
合并较大数据帧的有效替代方案 R
如何进行基本左移R 中与 data.table 的外连接?
还有一个选项是 plyr 包。 [2022 年注意事项:plyr 现已退役,并已被 dplyr 取代。 此答案中描述了 dplyr 中的联接操作。]
type
的选项:内部
、左
、右
、全
。来自
?join
:与merge
不同,[join
]无论使用什么连接类型都会保留x的顺序。There is the data.table approach for an inner join, which is very time and memory efficient (and necessary for some larger data.frames):
merge
also works on data.tables (as it is generic and callsmerge.data.table
)data.table documented on stackoverflow:
How to do a data.table merge operation
Translating SQL joins on foreign keys to R data.table syntax
Efficient alternatives to merge for larger data.frames R
How to do a basic left outer join with data.table in R?
Yet another option is the
join
function found in the plyr package. [Note from 2022: plyr is now retired and has been superseded by dplyr. Join operations in dplyr are described in this answer.]Options for
type
:inner
,left
,right
,full
.From
?join
: Unlikemerge
, [join
] preserves the order of x no matter what join type is used.有一些很好的例子可以完成此操作R 维基。我将在这里窃取一些:
合并方法
由于您的键名称相同,因此执行内部联接的简短方法是 merge():
完整的内部联接(两个表中的所有记录)可以是使用“all”关键字创建:
df1 和 df2 的左外连接:
df1 和 df2 的右外连接:
您可以翻转它们、拍打它们并摩擦它们以获得您要求的其他两个外连接关于:)
下标方法
使用下标方法在左侧使用 df1 的左外连接将是:
外连接的其他组合可以通过混合左外连接下标示例来创建。 (是的,我知道这相当于说“我把它留给读者作为练习......”)
There are some good examples of doing this over at the R Wiki. I'll steal a couple here:
Merge Method
Since your keys are named the same the short way to do an inner join is merge():
a full inner join (all records from both tables) can be created with the "all" keyword:
a left outer join of df1 and df2:
a right outer join of df1 and df2:
you can flip 'em, slap 'em and rub 'em down to get the other two outer joins you asked about :)
Subscript Method
A left outer join with df1 on the left using a subscript method would be:
The other combination of outer joins can be created by mungling the left outer join subscript example. (yeah, I know that's the equivalent of saying "I'll leave it as an exercise for the reader...")
更新用于连接数据集的 data.table 方法。请参阅以下每种连接类型的示例。有两种方法,一种来自
[.data.table
,将第二个 data.table 作为第一个参数传递给子集,另一种方法是使用merge
函数,该函数分派到快速数据.表方法。以下基准测试基于 R、sqldf、dplyr 和 data.table。
基准测试未加密/未索引的数据集。
基准测试是在 50M-1 行数据集上执行的,连接列上有 50M-2 个公共值,因此可以测试每个场景(内部、左、右、完整),并且连接执行起来仍然不简单。它是强调连接算法的连接类型。计时截至
sqldf:0.4.11
、dplyr:0.7.8
、data.table:1.12.0
。请注意,您还可以使用
data.table
执行其他类型的联接:- 连接更新 - 如果您想将另一个表中的值查找到主表
- 加入时聚合 - 如果您想聚合您正在加入的键,则不必实现所有连接结果
- 重叠连接 - 如果您想按范围合并
- 滚动连接 - 如果您希望合并能够通过向前或向后滚动来匹配前一行/后一行中的值
- 非等值连接 - 如果您的连接条件为非等值
重现代码:
Update on data.table methods for joining datasets. See below examples for each type of join. There are two methods, one from
[.data.table
when passing second data.table as the first argument to subset, another way is to usemerge
function which dispatches to fast data.table method.Below benchmark tests base R, sqldf, dplyr and data.table.
Benchmark tests unkeyed/unindexed datasets.
Benchmark is performed on 50M-1 rows datasets, there are 50M-2 common values on join column so each scenario (inner, left, right, full) can be tested and join is still not trivial to perform. It is type of join which well stress join algorithms. Timings are as of
sqldf:0.4.11
,dplyr:0.7.8
,data.table:1.12.0
.Be aware there are other types of joins you can perform using
data.table
:- update on join - if you want to lookup values from another table to your main table
- aggregate on join - if you want to aggregate on key you are joining you do not have to materialize all join results
- overlapping join - if you want to merge by ranges
- rolling join - if you want merge to be able to match to values from preceeding/following rows by rolling them forward or backward
- non-equi join - if your join condition is non-equal
Code to reproduce:
2014年的新事物:
尤其是如果您对一般数据操作感兴趣(包括分类,过滤,子集,摘要等),则绝对应该看一下
dplyr
,它带有一个多样性功能的所有旨在专门针对数据框架和某些其他数据库类型的工作。 甚至提供了相当复杂的SQL接口,甚至可以将SQL代码直接转换为R的函数R。
x y中有匹配值,以及x和y
left_join的所有列(x,y,by = null,copy = false,...)
:返回x的所有行,以及所有行x和ysemi_join的列
是的,仅保留x的列。
anti_join(x,y,by = null,copy = false,...)
:从x返回所有行在y中没有匹配值的情况下,仅保留x的列
,全部在这里非常详细。
选择列可以通过
选择(DF,“列”)
完成。如果这对您来说还不够SQL,则有sql()
函数,您可以在其中输入SQL Code AS-IS,并且它将执行您指定的操作,就像您在一直以来(有关更多信息,请参阅 dplyr/databases小插图)。例如,如果正确应用,则sql(“从Hflights中选择 *”)
将从“ Hflights” dplyr Table(a“ tbl”)中选择所有列。New in 2014:
Especially if you're also interested in data manipulation in general (including sorting, filtering, subsetting, summarizing etc.), you should definitely take a look at
dplyr
, which comes with a variety of functions all designed to facilitate your work specifically with data frames and certain other database types. It even offers quite an elaborate SQL interface, and even a function to convert (most) SQL code directly into R.The four joining-related functions in the dplyr package are (to quote):
inner_join(x, y, by = NULL, copy = FALSE, ...)
: return all rows fromx where there are matching values in y, and all columns from x and y
left_join(x, y, by = NULL, copy = FALSE, ...)
: return all rows from x, and all columns from x and ysemi_join(x, y, by = NULL, copy = FALSE, ...)
: return all rows from x where there are matching values iny, keeping just columns from x.
anti_join(x, y, by = NULL, copy = FALSE, ...)
: return all rows from xwhere there are not matching values in y, keeping just columns from x
It's all here in great detail.
Selecting columns can be done by
select(df,"column")
. If that's not SQL-ish enough for you, then there's thesql()
function, into which you can enter SQL code as-is, and it will do the operation you specified just like you were writing in R all along (for more information, please refer to the dplyr/databases vignette). For example, if applied correctly,sql("SELECT * FROM hflights")
will select all the columns from the "hflights" dplyr table (a "tbl").dplyr 自 0.4 起实现了所有这些连接,包括
outer_join
,但值得注意的是,对于 0.4 之前的前几个版本,它过去不提供outer_join
,并且结果,之后有很多非常糟糕的 hacky 解决方法用户代码在相当长的一段时间内流传(你仍然可以在那个时期的 SO、Kaggle 答案、github 中找到这样的代码。因此,这个答案仍然有用)与加入相关的发布亮点:
v0.5 (6/2016)
v0.4.0 (1/2015)
v0.3< /strong> (10/2014)
v0.2 ( 5/2014)
v0.1.3 (4/2014)
根据 hadley 在该问题中的评论,解决方法是:
dplyr since 0.4 implemented all those joins including
outer_join
, but it was worth noting that for the first few releases prior to 0.4 it used not to offerouter_join
, and as a result there was a lot of really bad hacky workaround user code floating around for quite a while afterwards (you can still find such code in SO, Kaggle answers, github from that period. Hence this answer still serves a useful purpose.)Join-related release highlights:
v0.5 (6/2016)
v0.4.0 (1/2015)
v0.3 (10/2014)
v0.2 (5/2014)
v0.1.3 (4/2014)
Workarounds per hadley's comments in that issue:
对于左将
0 ..*:0..1
基数或右连接的情况可以直接从木匠(0..1
表)将单方面列将其放置在joinee(0 ..*
表)上,从而将避免创建全新的数据表。这就需要将joinee的密钥列匹配到木匠中,并索引+订购木匠的行以进行分配。如果键是单列,那么我们可以使用一个调用
match()
进行匹配。我将在此答案中介绍这种情况。这是一个基于OP的示例,除了我在
df2
中添加了一个额外的行,并具有7个ID,以测试Joiner中未匹配键的情况。这是有效的df1
左JOINdf2
:在上面的我硬编码一个假设,即关键列是两个输入表的第一列。我认为,通常,这不是一个不合理的假设,因为,如果您有一个数据。使用键列的框架,如果未设置为Data.frame的第一列,那将很奇怪。一开始。您总是可以重新排序列以使其这样做。该假设的一个有利结果是,尽管我认为它只是用另一个假设替换一个假设,但密钥列的名称不必被硬编码。简洁是整数索引和速度的另一个优点。在下面的基准测试中,我将更改实施方式,以使用字符串名称索引来匹配竞争实现。
我认为,如果您有几张桌子要在一张大桌子上加入,这是一个特别合适的解决方案。反复重建每个合并的整个桌子将是不必要和效率低下的。
另一方面,如果出于任何原因需要大约在此操作中保持不变,则无法使用该解决方案,因为它会直接修改折磨。尽管在这种情况下,您可以简单地制作副本并在副本上执行就地分配。
附带说明,我简要研究了多列键的可能匹配解决方案。不幸的是,我发现的唯一匹配解决方案是:
匹配(互动(DF1 $ a,df1 $ b),交互(DF2 $ a,df2 $ b))
或与paste()
的相同想法。外部(df1 $ a,df2 $ a,`==`)&amp;外部(df1 $ b,df2 $ b,`==`)
。merge()
和基于等效的包装合并功能,该功能总是分配新表以返回合并结果,因此不适合基于原位分配的解决方案。例如,请参见不同数据帧上的列,结果获得其他列,匹配两个带有另外两个列的列, /a>,以及这个问题的骗子,我最初提出了现场解决方案,在r 中将两个数据帧与不同数量的行相结合。
基准测试
我决定进行自己的基准测试,以查看本地分配方法与此问题中提供的其他解决方案的比较。
测试代码:
这是我前面证明的OP的基准:
在这里,我在随机输入数据上基准测试,尝试不同的量表和两个输入表之间的密钥重叠模式的不同模式。该基准仍仅限于单列整数键的情况。同样,为了确保就地解决方案适用于同一表的左右连接,所有随机测试数据均使用
0..1:0..1
cardinality。这是通过对第一个数据的密钥列进行采样来实现的。在生成第二个data.frame的密钥列时,框架。我编写了一些代码来创建上述结果的日志图。我为每个重叠百分比生成了一个单独的图。这有点混乱,但是我喜欢拥有所有解决方案类型,并在同一地块中表示加入类型。
我使用样条插值显示了每个解决方案/连接类型组合的平滑曲线,并用单个PCH符号绘制。联接类型由PCH符号捕获,使用一个点,用于左右的内部,左右角括号,以及一颗钻石。解决方案类型由传说中所示的颜色捕获。
这是第二个大型基准,在钥匙的数量和类型方面更为重型列以及基数。对于此基准测试,我使用三个关键列:一个字符,一个整数和一个逻辑,没有对基数的限制(即,
0 ..*:0 ..*:*
)。 (通常不建议通过浮点比较并发症定义具有双重或复杂值的密钥列,并且基本上没有人使用原始类型,更不用说键列了,因此我还没有将这些类型包括在键中为了信息的缘故,我最初尝试通过包含Posixct键列使用四个密钥列,但是Posixct类型与sqldf.indexed
解决方案效果不佳由于浮点比较异常,因此我将其删除。)所得图,使用上面给出的相同的绘图代码:
href =“ https://i.sstatic.net/ytero.png” rel =“ noreferrer”>
For the case of a left join with a
0..*:0..1
cardinality or a right join with a0..1:0..*
cardinality it is possible to assign in-place the unilateral columns from the joiner (the0..1
table) directly onto the joinee (the0..*
table), and thereby avoid the creation of an entirely new table of data. This requires matching the key columns from the joinee into the joiner and indexing+ordering the joiner's rows accordingly for the assignment.If the key is a single column, then we can use a single call to
match()
to do the matching. This is the case I'll cover in this answer.Here's an example based on the OP, except I've added an extra row to
df2
with an id of 7 to test the case of a non-matching key in the joiner. This is effectivelydf1
left joindf2
:In the above I hard-coded an assumption that the key column is the first column of both input tables. I would argue that, in general, this is not an unreasonable assumption, since, if you have a data.frame with a key column, it would be strange if it had not been set up as the first column of the data.frame from the outset. And you can always reorder the columns to make it so. An advantageous consequence of this assumption is that the name of the key column does not have to be hard-coded, although I suppose it's just replacing one assumption with another. Concision is another advantage of integer indexing, as well as speed. In the benchmarks below I'll change the implementation to use string name indexing to match the competing implementations.
I think this is a particularly appropriate solution if you have several tables that you want to left join against a single large table. Repeatedly rebuilding the entire table for each merge would be unnecessary and inefficient.
On the other hand, if you need the joinee to remain unaltered through this operation for whatever reason, then this solution cannot be used, since it modifies the joinee directly. Although in that case you could simply make a copy and perform the in-place assignment(s) on the copy.
As a side note, I briefly looked into possible matching solutions for multicolumn keys. Unfortunately, the only matching solutions I found were:
match(interaction(df1$a,df1$b),interaction(df2$a,df2$b))
, or the same idea withpaste()
.outer(df1$a,df2$a,`==`) & outer(df1$b,df2$b,`==`)
.merge()
and equivalent package-based merge functions, which always allocate a new table to return the merged result, and thus are not suitable for an in-place assignment-based solution.For example, see Matching multiple columns on different data frames and getting other column as result, match two columns with two other columns, Matching on multiple columns, and the dupe of this question where I originally came up with the in-place solution, Combine two data frames with different number of rows in R.
Benchmarking
I decided to do my own benchmarking to see how the in-place assignment approach compares to the other solutions that have been offered in this question.
Testing code:
Here's a benchmark of the example based on the OP that I demonstrated earlier:
Here I benchmark on random input data, trying different scales and different patterns of key overlap between the two input tables. This benchmark is still restricted to the case of a single-column integer key. As well, to ensure that the in-place solution would work for both left and right joins of the same tables, all random test data uses
0..1:0..1
cardinality. This is implemented by sampling without replacement the key column of the first data.frame when generating the key column of the second data.frame.I wrote some code to create log-log plots of the above results. I generated a separate plot for each overlap percentage. It's a little bit cluttered, but I like having all the solution types and join types represented in the same plot.
I used spline interpolation to show a smooth curve for each solution/join type combination, drawn with individual pch symbols. The join type is captured by the pch symbol, using a dot for inner, left and right angle brackets for left and right, and a diamond for full. The solution type is captured by the color as shown in the legend.
Here's a second large-scale benchmark that's more heavy-duty, with respect to the number and types of key columns, as well as cardinality. For this benchmark I use three key columns: one character, one integer, and one logical, with no restrictions on cardinality (that is,
0..*:0..*
). (In general it's not advisable to define key columns with double or complex values due to floating-point comparison complications, and basically no one ever uses the raw type, much less for key columns, so I haven't included those types in the key columns. Also, for information's sake, I initially tried to use four key columns by including a POSIXct key column, but the POSIXct type didn't play well with thesqldf.indexed
solution for some reason, possibly due to floating-point comparison anomalies, so I removed it.)The resulting plots, using the same plotting code given above:
在连接两个数据框,每个数据框约有 100 万行,一个有 2 列,另一个有约 20 列时,我惊讶地发现
merge(..., all.x = TRUE, all.y = TRUE)< /code> 比 dplyr::full_join() 更快。这是 dplyr v0.4 的情况,
合并需要约 17 秒,full_join 需要约 65 秒。
不过,有些食物,因为我通常默认使用 dplyr 来执行操作任务。
In joining two data frames with ~1 million rows each, one with 2 columns and the other with ~20, I've surprisingly found
merge(..., all.x = TRUE, all.y = TRUE)
to be faster thendplyr::full_join()
. This is with dplyr v0.4Merge takes ~17 seconds, full_join takes ~65 seconds.
Some food for though, since I generally default to dplyr for manipulation tasks.
merge
函数我们可以选择左表或右表的变量,就像我们熟悉SQL中的select语句一样(例如:Select a.* ...或Select b.* from . ....)我们必须添加额外的代码,这些代码将从新加入的表中提取子集。
SQL :-
从 df1 a 内连接 df2 b 中选择 a.* a.CustomerId=b.CustomerId
R :-
merge(df1, df2, by.x = "CustomerId", by.y = "CustomerId")[,names(df1)]
< /里>
同样的方式
SQL :-
select b.* from df1 a 内联 df2 b on a.CustomerId=b.CustomerId
R :-
merge(df1, df2, by.x = "CustomerId", by.y =
"CustomerId")[,names(df2)]
merge
function we can select the variable of left table or right table, same way like we all familiar with select statement in SQL (EX : Select a.* ...or Select b.* from .....)We have to add extra code which will subset from the newly joined table .
SQL :-
select a.* from df1 a inner join df2 b on a.CustomerId=b.CustomerId
R :-
merge(df1, df2, by.x = "CustomerId", by.y = "CustomerId")[,names(df1)]
Same way
SQL :-
select b.* from df1 a inner join df2 b on a.CustomerId=b.CustomerId
R :-
merge(df1, df2, by.x = "CustomerId", by.y =
"CustomerId")[,names(df2)]
更新加入。另外一个重要的SQL风格加入是“ “更新”>更新join “其中一列中的列中的列使用另一个表更新(或创建)表。
修改OP的示例表...
假设我们想将客户的状态从
CUST
添加到购买表,sales
,忽略年度列。使用基本R,我们可以识别匹配行,然后将复制值复制到:可以看到,
Match
从客户表中选择第一个匹配行。更新与多列连接。当我们仅加入单列并对第一场比赛感到满意时,上面的方法效果很好。假设我们希望在客户表中的测量年份与销售年相匹配。
正如 @bgoldst的答案所提到的那样,
与
可能是这种情况的选项。更直接地,一个人可以使用数据。表:互动
匹配滚动更新加入。 另外,我们可能要采用最新的状态,以下是:在以下情况下,在以下情况下,在以下情况下:
三个示例都专注于创建/添加新列。请参阅相关的R FAQ 以获取更新/修改现有列的示例。
Update join. One other important SQL-style join is an "update join" where columns in one table are updated (or created) using another table.
Modifying the OP's example tables...
Suppose we want to add the customer's state from
cust
to the purchases table,sales
, ignoring the year column. With base R, we can identify matching rows and then copy values over:As can be seen here,
match
selects the first matching row from the customer table.Update join with multiple columns. The approach above works well when we are joining on only a single column and are satisfied with the first match. Suppose we want the year of measurement in the customer table to match the year of sale.
As @bgoldst's answer mentions,
match
withinteraction
might be an option for this case. More straightforwardly, one could use data.table:Rolling update join. Alternately, we may want to take the last state the customer was found in:
The three examples above all focus on creating/adding a new column. See the related R FAQ for an example of updating/modifying an existing column.
对于所有列的内部联接,您还可以使用 data.table 包中的
fintersect
或 dplyr< 中的intersect
/em>-package 作为merge
的替代方案,无需指定by
-列。这将给出两个数据帧之间相等的行:示例数据:
For an inner join on all columns, you could also use
fintersect
from the data.table-package orintersect
from the dplyr-package as an alternative tomerge
without specifying theby
-columns. This will give the rows that are equal between two dataframes:Example data:
collapse
2.0 提供了另一个连接框架join
。 它明显比任何其他选项更快。collapse
2.0 provides another join framework withjoin
. It is noticeably faster than any other option.