如何连接(合并)数据框(内部、外部、左、右)
给定两个数据框:
df1 = data.frame(CustomerId = c(1:6), Product = c(rep("Toaster", 3), rep("Radio", 3)))
df2 = data.frame(CustomerId = c(2, 4, 6), State = c(rep("Alabama", 2), rep("Ohio", 1)))
df1
# CustomerId Product
# 1 Toaster
# 2 Toaster
# 3 Toaster
# 4 Radio
# 5 Radio
# 6 Radio
df2
# CustomerId State
# 2 Alabama
# 4 Alabama
# 6 Ohio
我怎样才能做到数据库风格,即sql风格,连接?也就是说,我如何获得:
-
的 内部联接 df1 和 df2:
仅返回左表在右表中具有匹配键的行。 -
df1
和的外连接 >df2
:
返回两个表中的所有行,连接左侧表中在右侧表中具有匹配键的记录。 -
df1 的左外连接(或简称左连接)
和df2
返回左表中的所有行以及右表中具有匹配键的任何行。 -
df1
和 右外部联接 和 <代码>df2
返回右表中的所有行以及左表中具有匹配键的任何行。
额外加分:
如何执行 SQL 样式的 select 语句?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(14)
通过使用
merge
函数及其可选参数:内连接:
merge(df1, df2)
适用于这些示例是因为 R 自动通过公共变量名称连接框架,但您很可能希望指定merge(df1, df2, by = "CustomerId")
以确保您仅匹配您想要的领域。如果匹配变量在不同的数据框中具有不同的名称,您还可以使用by.x
和by.y
参数。外连接:
merge(x = df1, y = df2, by = "CustomerId", all = TRUE)
左外层:
merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE)
右外层:< /em>
merge(x = df1, y = df2, by = "CustomerId", all.y = TRUE)
交叉联接:< /strong>
merge(x = df1, y = df2, by = NULL)
就像内部联接一样,您可能希望将“CustomerId”显式传递给 R 作为匹配变量.我认为最好是明确声明要合并的标识符;如果输入 data.frames 意外更改,则更安全,并且以后更易于阅读。您可以通过给
by
一个向量来合并多列,例如by = c("CustomerId", "OrderId")
。如果要合并的列名称不同,您可以指定,例如
by.x = "CustomerId_in_df1", by.y = "CustomerId_in_df2"
,其中CustomerId_in_df1
是第一个数据框中的列名称,CustomerId_in_df2
是第二个数据框中的列名称。 (如果您需要合并多列,这些也可以是向量。)By using the
merge
function and its optional parameters:Inner join:
merge(df1, df2)
will work for these examples because R automatically joins the frames by common variable names, but you would most likely want to specifymerge(df1, df2, by = "CustomerId")
to make sure that you were matching on only the fields you desired. You can also use theby.x
andby.y
parameters if the matching variables have different names in the different data frames.Outer join:
merge(x = df1, y = df2, by = "CustomerId", all = TRUE)
Left outer:
merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE)
Right outer:
merge(x = df1, y = df2, by = "CustomerId", all.y = TRUE)
Cross join:
merge(x = df1, y = df2, by = NULL)
Just as with the inner join, you would probably want to explicitly pass "CustomerId" to R as the matching variable.I think it's almost always best to explicitly state the identifiers on which you want to merge; it's safer if the input data.frames change unexpectedly and easier to read later on.You can merge on multiple columns by giving
by
a vector, e.g.,by = c("CustomerId", "OrderId")
.If the column names to merge on are not the same, you can specify, e.g.,
by.x = "CustomerId_in_df1", by.y = "CustomerId_in_df2"
whereCustomerId_in_df1
is the name of the column in the first data frame andCustomerId_in_df2
is the name of the column in the second data frame. (These can also be vectors if you need to merge on multiple columns.)您还可以使用 Hadley Wickham 出色的 dplyr 包进行连接。
变异连接:使用 df2 中的匹配将列添加到 df1
过滤连接:过滤掉 df1 中的行,不修改列
You can do joins as well using Hadley Wickham's awesome dplyr package.
Mutating joins: add columns to df1 using matches in df2
Filtering joins: filter out rows in df1, don't modify columns
我建议您查看 Gabor Grothendieck 的 sqldf 包,它可以让您用SQL来表达这些操作。
我发现 SQL 语法比其 R 语法更简单、更自然(但这可能只是反映了我对 RDBMS 的偏见)。
有关联接的更多信息,请参阅 Gabor 的 sqldf GitHub。
I would recommend checking out Gabor Grothendieck's sqldf package, which allows you to express these operations in SQL.
I find the SQL syntax to be simpler and more natural than its R equivalent (but this may just reflect my RDBMS bias).
See Gabor's sqldf GitHub for more information on joins.
有一种用于内部联接的 data.table 方法,该方法非常节省时间和内存(对于某些较大的 data.frame 来说是必需的):
merge
也适用于数据。表(因为它是通用的并调用merge.data.table
)data.table记录在stackoverflow上:
如何执行 data.table 合并操作< br>
将外键上的 SQL 连接转换为 R 数据。表语法
合并较大数据帧的有效替代方案 R
如何进行基本左移R 中与 data.table 的外连接?
还有一个选项是 plyr 包。 [2022 年注意事项:plyr 现已退役,并已被 dplyr 取代。 此答案中描述了 dplyr 中的联接操作。]
type
的选项:内部
、左
、右
、全
。来自
?join
:与merge
不同,[join
]无论使用什么连接类型都会保留x的顺序。There is the data.table approach for an inner join, which is very time and memory efficient (and necessary for some larger data.frames):
merge
also works on data.tables (as it is generic and callsmerge.data.table
)data.table documented on stackoverflow:
How to do a data.table merge operation
Translating SQL joins on foreign keys to R data.table syntax
Efficient alternatives to merge for larger data.frames R
How to do a basic left outer join with data.table in R?
Yet another option is the
join
function found in the plyr package. [Note from 2022: plyr is now retired and has been superseded by dplyr. Join operations in dplyr are described in this answer.]Options for
type
:inner
,left
,right
,full
.From
?join
: Unlikemerge
, [join
] preserves the order of x no matter what join type is used.有一些很好的例子可以完成此操作R 维基。我将在这里窃取一些:
合并方法
由于您的键名称相同,因此执行内部联接的简短方法是 merge():
完整的内部联接(两个表中的所有记录)可以是使用“all”关键字创建:
df1 和 df2 的左外连接:
df1 和 df2 的右外连接:
您可以翻转它们、拍打它们并摩擦它们以获得您询问的其他两个外连接:)
下标方法
使用下标方法将 df1 放在左侧的左外连接将是:
可以通过混合左外连接下标示例来创建外连接的其他组合。 (是的,我知道这相当于说“我把它留给读者作为练习......”)
There are some good examples of doing this over at the R Wiki. I'll steal a couple here:
Merge Method
Since your keys are named the same the short way to do an inner join is merge():
a full inner join (all records from both tables) can be created with the "all" keyword:
a left outer join of df1 and df2:
a right outer join of df1 and df2:
you can flip 'em, slap 'em and rub 'em down to get the other two outer joins you asked about :)
Subscript Method
A left outer join with df1 on the left using a subscript method would be:
The other combination of outer joins can be created by mungling the left outer join subscript example. (yeah, I know that's the equivalent of saying "I'll leave it as an exercise for the reader...")
更新用于连接数据集的 data.table 方法。请参阅以下每种连接类型的示例。有两种方法,一种来自
[.data.table
,将第二个 data.table 作为第一个参数传递给子集,另一种方法是使用merge
函数,该函数分派到快速数据.表方法。以下基准测试基于 R、sqldf、dplyr 和 data.table。
基准测试未加密/未索引的数据集。
基准测试是在 50M-1 行数据集上执行的,连接列上有 50M-2 个公共值,因此可以测试每个场景(内部、左、右、完整),并且连接执行起来仍然不简单。它是强调连接算法的连接类型。计时截至
sqldf:0.4.11
、dplyr:0.7.8
、data.table:1.12.0
。请注意,您还可以使用
data.table
执行其他类型的联接:- 连接更新 - 如果您想将另一个表中的值查找到主表
- 加入时聚合 - 如果您想聚合您正在加入的键,则不必实现所有连接结果
- 重叠连接 - 如果您想按范围合并
- 滚动连接 - 如果您希望合并能够通过向前或向后滚动来匹配前一行/后一行中的值
- 非等值连接 - 如果您的连接条件为非等值
重现代码:
Update on data.table methods for joining datasets. See below examples for each type of join. There are two methods, one from
[.data.table
when passing second data.table as the first argument to subset, another way is to usemerge
function which dispatches to fast data.table method.Below benchmark tests base R, sqldf, dplyr and data.table.
Benchmark tests unkeyed/unindexed datasets.
Benchmark is performed on 50M-1 rows datasets, there are 50M-2 common values on join column so each scenario (inner, left, right, full) can be tested and join is still not trivial to perform. It is type of join which well stress join algorithms. Timings are as of
sqldf:0.4.11
,dplyr:0.7.8
,data.table:1.12.0
.Be aware there are other types of joins you can perform using
data.table
:- update on join - if you want to lookup values from another table to your main table
- aggregate on join - if you want to aggregate on key you are joining you do not have to materialize all join results
- overlapping join - if you want to merge by ranges
- rolling join - if you want merge to be able to match to values from preceeding/following rows by rolling them forward or backward
- non-equi join - if your join condition is non-equal
Code to reproduce:
2014 年新增功能:
特别是如果您还对一般数据操作(包括排序、过滤、取子集、汇总等)感兴趣,那么您绝对应该看看
dplyr
,它提供了多种功能全部功能旨在促进您专门处理数据框架和某些其他数据库类型的工作。它甚至提供了相当复杂的 SQL 接口,甚至还有一个将(大多数)SQL 代码直接转换为 R 的函数。dplyr 包中的四个与连接相关的函数是(引用):
x 中 y 中存在匹配值,以及 x 和 y 中的所有列
left_join(x, y, by = NULL, copy = FALSE, ...)
:返回 x 中的所有行,以及所有x 和 y 中的列semi_join(x, y, by = NULL, copy = FALSE, ...)
:返回 x 中存在匹配值的所有行y,仅保留 x 中的列。
anti_join(x, y, by = NULL, copy = FALSE, ...)
:返回 x 中的所有行如果 y 中没有匹配的值,则仅保留 x 中的列
这都是 这里非常详细。
选择列可以通过
select(df,"column")
来完成。如果这对您来说还不够 SQL-ish,那么可以使用sql()
函数,您可以按原样输入 SQL 代码,它会执行您指定的操作,就像您在其中编写的那样R 一直(有关更多信息,请参阅 dplyr/databases小插图)。例如,如果应用正确,sql("SELECT * FROM hflights")
将从“hflights”dplyr 表(“tbl”)中选择所有列。New in 2014:
Especially if you're also interested in data manipulation in general (including sorting, filtering, subsetting, summarizing etc.), you should definitely take a look at
dplyr
, which comes with a variety of functions all designed to facilitate your work specifically with data frames and certain other database types. It even offers quite an elaborate SQL interface, and even a function to convert (most) SQL code directly into R.The four joining-related functions in the dplyr package are (to quote):
inner_join(x, y, by = NULL, copy = FALSE, ...)
: return all rows fromx where there are matching values in y, and all columns from x and y
left_join(x, y, by = NULL, copy = FALSE, ...)
: return all rows from x, and all columns from x and ysemi_join(x, y, by = NULL, copy = FALSE, ...)
: return all rows from x where there are matching values iny, keeping just columns from x.
anti_join(x, y, by = NULL, copy = FALSE, ...)
: return all rows from xwhere there are not matching values in y, keeping just columns from x
It's all here in great detail.
Selecting columns can be done by
select(df,"column")
. If that's not SQL-ish enough for you, then there's thesql()
function, into which you can enter SQL code as-is, and it will do the operation you specified just like you were writing in R all along (for more information, please refer to the dplyr/databases vignette). For example, if applied correctly,sql("SELECT * FROM hflights")
will select all the columns from the "hflights" dplyr table (a "tbl").dplyr 自 0.4 起实现了所有这些连接,包括
outer_join
,但值得注意的是,对于 0.4 之前的前几个版本,它过去不提供outer_join
,并且结果,之后相当长一段时间里出现了很多非常糟糕的 hacky 解决方法用户代码(你仍然可以在那个时期的 SO、Kaggle 答案、github 中找到这样的代码。因此这个答案仍然有一个有用的目的。)加入相关发布亮点:
v0.5 (6/2016)
v0.4.0 (1/2015)
v0.3< /strong> (10/2014)
v0.2 (5/2014)
v0.1.3 (4/2014)
根据 hadley 在该问题中的评论,解决方法
dplyr since 0.4 implemented all those joins including
outer_join
, but it was worth noting that for the first few releases prior to 0.4 it used not to offerouter_join
, and as a result there was a lot of really bad hacky workaround user code floating around for quite a while afterwards (you can still find such code in SO, Kaggle answers, github from that period. Hence this answer still serves a useful purpose.)Join-related release highlights:
v0.5 (6/2016)
v0.4.0 (1/2015)
v0.3 (10/2014)
v0.2 (5/2014)
v0.1.3 (4/2014)
Workarounds per hadley's comments in that issue:
对于具有
0..*:0..1
基数的左连接或具有0..1:0..*
基数的右连接的情况可以将连接器(0..1
表)中的单边列直接分配到连接器(0..*
表)上,从而避免创建全新的数据表。这需要将连接器中的键列匹配到连接器中,并相应地对连接器的行进行索引和排序以进行分配。如果键是单列,那么我们可以使用对
match()
进行匹配。这就是我将在这个答案中介绍的情况。这是一个基于 OP 的示例,只不过我向 df2 添加了一个额外的行,其 id 为 7,以测试连接器中不匹配键的情况。这实际上是 df1 左连接 df2:
在上面我硬编码了一个假设,即键列是两个输入表的第一列。我认为,一般来说,这并不是一个不合理的假设,因为,如果您有一个带有关键列的 data.frame,如果它没有被设置为 data.frame 的第一列,那会很奇怪一开始。您始终可以对列重新排序以实现此目的。这种假设的一个有利结果是键列的名称不必进行硬编码,尽管我认为它只是用一个假设替换另一个假设。简洁是整数索引的另一个优点以及速度。在下面的基准测试中,我将更改实现以使用字符串名称索引来匹配竞争实现。
我认为如果您有多个表想要针对单个大表进行左连接,那么这是一个特别合适的解决方案。为每次合并重复重建整个表是不必要的并且效率低下。
另一方面,如果您出于某种原因需要加入者通过此操作保持不变,则不能使用此解决方案,因为它直接修改加入者。尽管在这种情况下,您可以简单地制作副本并在副本上执行就地分配。
作为旁注,我简要研究了多列键的可能匹配解决方案。不幸的是,我找到的唯一匹配的解决方案是:
match(interaction(df1$a,df1$b),interaction(df2$a,df2$b))
,或与paste()
相同的想法。outer(df1$a,df2$a,`==`) &外部(df1$b,df2$b,`==`)
。merge()
和等效的基于包的合并函数,它们总是分配一个新表来返回合并结果,因此不适合基于就地分配的解决方案。例如,请参阅匹配多个不同数据帧上的列并获取其他列作为结果,匹配两个列与其他两列,多列匹配< /a>,以及这个问题的欺骗,我最初提出了就地解决方案,在 R 中合并两个行数不同的数据框。
基准测试
我决定做我自己的基准测试,看看就地分配方法与这个问题中提供的其他解决方案相比如何。
测试代码:
这是基于我之前演示的 OP 的示例基准测试:
这里我对随机输入数据进行基准测试,尝试两个输入表之间不同的尺度和不同的键重叠模式。该基准测试仍然仅限于单列整数键的情况。此外,为了确保就地解决方案适用于同一表的左连接和右连接,所有随机测试数据都使用
0..1:0..1
基数。这是通过在生成第二个data.frame的关键列时对第一个data.frame的关键列进行采样而不替换来实现的。我编写了一些代码来创建上述结果的对数图。我为每个重叠百分比生成了一个单独的图。它有点混乱,但我喜欢在同一个图中表示所有解决方案类型和连接类型。
我使用样条插值来显示每个解决方案/连接类型组合的平滑曲线,并使用单独的 pch 符号绘制。连接类型由 pch 符号捕获,使用点表示内部,使用左尖括号和右尖括号表示左和右,使用菱形表示完整。解决方案类型由颜色捕获,如图例所示。
这是第二个大规模基准测试,在键的数量和类型方面更加繁重列,以及基数。对于此基准测试,我使用三个关键列:一个字符、一个整数和一个逻辑列,基数没有限制(即
0..*:0..*
)。 (一般来说,由于浮点比较的复杂性,不建议使用双精度或复数值定义键列,而且基本上没有人使用原始类型,更不用说键列了,所以我没有将这些类型包含在键中另外,为了提供信息,我最初尝试通过包含 POSIXct 键列来使用四个键列,但由于某种原因,POSIXct 类型与sqldf.indexed
解决方案不能很好地配合。由于浮点比较异常,所以我将其删除。)使用上面给出的相同绘图代码生成的图:
For the case of a left join with a
0..*:0..1
cardinality or a right join with a0..1:0..*
cardinality it is possible to assign in-place the unilateral columns from the joiner (the0..1
table) directly onto the joinee (the0..*
table), and thereby avoid the creation of an entirely new table of data. This requires matching the key columns from the joinee into the joiner and indexing+ordering the joiner's rows accordingly for the assignment.If the key is a single column, then we can use a single call to
match()
to do the matching. This is the case I'll cover in this answer.Here's an example based on the OP, except I've added an extra row to
df2
with an id of 7 to test the case of a non-matching key in the joiner. This is effectivelydf1
left joindf2
:In the above I hard-coded an assumption that the key column is the first column of both input tables. I would argue that, in general, this is not an unreasonable assumption, since, if you have a data.frame with a key column, it would be strange if it had not been set up as the first column of the data.frame from the outset. And you can always reorder the columns to make it so. An advantageous consequence of this assumption is that the name of the key column does not have to be hard-coded, although I suppose it's just replacing one assumption with another. Concision is another advantage of integer indexing, as well as speed. In the benchmarks below I'll change the implementation to use string name indexing to match the competing implementations.
I think this is a particularly appropriate solution if you have several tables that you want to left join against a single large table. Repeatedly rebuilding the entire table for each merge would be unnecessary and inefficient.
On the other hand, if you need the joinee to remain unaltered through this operation for whatever reason, then this solution cannot be used, since it modifies the joinee directly. Although in that case you could simply make a copy and perform the in-place assignment(s) on the copy.
As a side note, I briefly looked into possible matching solutions for multicolumn keys. Unfortunately, the only matching solutions I found were:
match(interaction(df1$a,df1$b),interaction(df2$a,df2$b))
, or the same idea withpaste()
.outer(df1$a,df2$a,`==`) & outer(df1$b,df2$b,`==`)
.merge()
and equivalent package-based merge functions, which always allocate a new table to return the merged result, and thus are not suitable for an in-place assignment-based solution.For example, see Matching multiple columns on different data frames and getting other column as result, match two columns with two other columns, Matching on multiple columns, and the dupe of this question where I originally came up with the in-place solution, Combine two data frames with different number of rows in R.
Benchmarking
I decided to do my own benchmarking to see how the in-place assignment approach compares to the other solutions that have been offered in this question.
Testing code:
Here's a benchmark of the example based on the OP that I demonstrated earlier:
Here I benchmark on random input data, trying different scales and different patterns of key overlap between the two input tables. This benchmark is still restricted to the case of a single-column integer key. As well, to ensure that the in-place solution would work for both left and right joins of the same tables, all random test data uses
0..1:0..1
cardinality. This is implemented by sampling without replacement the key column of the first data.frame when generating the key column of the second data.frame.I wrote some code to create log-log plots of the above results. I generated a separate plot for each overlap percentage. It's a little bit cluttered, but I like having all the solution types and join types represented in the same plot.
I used spline interpolation to show a smooth curve for each solution/join type combination, drawn with individual pch symbols. The join type is captured by the pch symbol, using a dot for inner, left and right angle brackets for left and right, and a diamond for full. The solution type is captured by the color as shown in the legend.
Here's a second large-scale benchmark that's more heavy-duty, with respect to the number and types of key columns, as well as cardinality. For this benchmark I use three key columns: one character, one integer, and one logical, with no restrictions on cardinality (that is,
0..*:0..*
). (In general it's not advisable to define key columns with double or complex values due to floating-point comparison complications, and basically no one ever uses the raw type, much less for key columns, so I haven't included those types in the key columns. Also, for information's sake, I initially tried to use four key columns by including a POSIXct key column, but the POSIXct type didn't play well with thesqldf.indexed
solution for some reason, possibly due to floating-point comparison anomalies, so I removed it.)The resulting plots, using the same plotting code given above:
在连接两个数据框,每个数据框约有 100 万行,一个有 2 列,另一个有约 20 列时,我惊讶地发现
merge(..., all.x = TRUE, all.y = TRUE)< /code> 比 dplyr::full_join() 更快。这是 dplyr v0.4
合并需要约 17 秒,full_join 需要约 65 秒。
不过,有些食物,因为我通常默认使用 dplyr 来执行操作任务。
In joining two data frames with ~1 million rows each, one with 2 columns and the other with ~20, I've surprisingly found
merge(..., all.x = TRUE, all.y = TRUE)
to be faster thendplyr::full_join()
. This is with dplyr v0.4Merge takes ~17 seconds, full_join takes ~65 seconds.
Some food for though, since I generally default to dplyr for manipulation tasks.
merge
函数我们可以选择左表或右表的变量,就像我们熟悉SQL中的select语句一样(例如:Select a.* ...或Select b.* from . ....)我们必须添加额外的代码,这些代码将从新加入的表中提取子集。
SQL :-
从 df1 a 内连接 df2 b 中选择 a.* a.CustomerId=b.CustomerId
R :-
merge(df1, df2, by.x = "CustomerId", by.y = "CustomerId")[,names(df1)]
< /里>
同样的方式
SQL :-
select b.* from df1 a inside join df2 b on a.CustomerId=b.CustomerId
R :-
merge(df1, df2, by.x = "CustomerId", by.y =
"CustomerId")[,names(df2)]
merge
function we can select the variable of left table or right table, same way like we all familiar with select statement in SQL (EX : Select a.* ...or Select b.* from .....)We have to add extra code which will subset from the newly joined table .
SQL :-
select a.* from df1 a inner join df2 b on a.CustomerId=b.CustomerId
R :-
merge(df1, df2, by.x = "CustomerId", by.y = "CustomerId")[,names(df1)]
Same way
SQL :-
select b.* from df1 a inner join df2 b on a.CustomerId=b.CustomerId
R :-
merge(df1, df2, by.x = "CustomerId", by.y =
"CustomerId")[,names(df2)]
更新联接。 另一种重要的 SQL 样式联接是“更新联接”,其中一个列中的列使用另一个表更新(或创建)表。
修改 OP 的示例表...
假设我们要将客户的状态从
cust
添加到购买表sales
,忽略年份列。使用基本 R,我们可以识别匹配行,然后复制值:如此处所示,
match
从客户表中选择第一个匹配行。更新多列连接。当我们仅连接单列并对第一个匹配项感到满意时,上述方法效果很好。假设我们希望客户表中的测量年份与销售年份相匹配。
正如 @bgoldst 的回答所提到的,
match
与interaction
可能是这种情况的一个选项。更直接地,可以使用 data.table:滚动更新联接。 或者,我们可能希望获取客户发现的最后状态:
上面的三个示例都侧重于创建/添加新列。有关更新/修改现有列的示例,请参阅相关的 R 常见问题解答。
Update join. One other important SQL-style join is an "update join" where columns in one table are updated (or created) using another table.
Modifying the OP's example tables...
Suppose we want to add the customer's state from
cust
to the purchases table,sales
, ignoring the year column. With base R, we can identify matching rows and then copy values over:As can be seen here,
match
selects the first matching row from the customer table.Update join with multiple columns. The approach above works well when we are joining on only a single column and are satisfied with the first match. Suppose we want the year of measurement in the customer table to match the year of sale.
As @bgoldst's answer mentions,
match
withinteraction
might be an option for this case. More straightforwardly, one could use data.table:Rolling update join. Alternately, we may want to take the last state the customer was found in:
The three examples above all focus on creating/adding a new column. See the related R FAQ for an example of updating/modifying an existing column.
对于所有列的内部联接,您还可以使用 data.table 包中的
fintersect
或 dplyr< 中的intersect
/em>-package 作为merge
的替代方案,无需指定by
-列。这将给出两个数据帧之间相等的行:示例数据:
For an inner join on all columns, you could also use
fintersect
from the data.table-package orintersect
from the dplyr-package as an alternative tomerge
without specifying theby
-columns. This will give the rows that are equal between two dataframes:Example data:
collapse
2.0 提供了另一个连接框架join
。 它明显比任何其他选项更快。collapse
2.0 provides another join framework withjoin
. It is noticeably faster than any other option.