当两个数据框具有不同的列集时,按行合并两个数据框(rbind)
是否可以行绑定两个不具有相同列集的数据框?我希望保留绑定后不匹配的列。
Is it possible to row bind two data frames that don't have the same set of columns? I am hoping to retain the columns that do not match after the bind.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(15)
如果 df1 中的列是 df2 中的列的子集(按列名称):
If the columns in df1 is a subset of those in df2 (by column names):
您也可以只提取常见的列名称。
You could also just pull out the common column names.
我编写了一个函数来执行此操作,因为我喜欢我的代码告诉我是否出现问题。该函数将明确告诉您哪些列名不匹配以及类型是否不匹配。然后它会尽力组合 data.frames。限制是您一次只能组合两个 data.frame。
I wrote a function to do this because I like my code to tell me if something is wrong. This function will explicitly tell you which column names don't match and if you have a type mismatch. Then it will do its best to combine the data.frames anyway. The limitation is that you can only combine two data.frames at a time.
gtools/smartbind 不喜欢使用日期,可能是因为它是 as.vectoring。所以这是我的解决方案......
gtools/smartbind didnt like working with Dates, probably because it was as.vectoring. So here's my solution...
只是为了文档。您可以按以下形式尝试
Stack
库及其函数Stack
:我还有这样的印象:对于大型数据集,它比其他方法更快。
Just for the documentation. You can try the
Stack
library and its functionStack
in the following form:I have also the impression that it is faster than other methods for large data sets.
也许我完全误解了你的问题,但是“我希望保留绑定后不匹配的列”让我认为你正在寻找
左连接
或右连接 类似于 SQL 查询。 R 具有
merge
函数,可让您指定左连接、右连接或内连接,类似于 SQL 中的表连接。关于这个主题已经有一个很好的问题和答案:如何连接(合并)数据框(内部、外部、左、右)?
Maybe I completely misread your question, but the "I am hoping to retain the columns that do not match after the bind" makes me think you are looking for a
left join
orright join
similar to an SQL query. R has themerge
function that lets you specify left, right, or inner joins similar to joining tables in SQL.There is already a great question and answer on this topic here: How to join (merge) data frames (inner, outer, left, right)?
您还可以使用
sjmisc::add_rows()
,它使用 dplyr::bind_rows(),但与bind_rows()
不同,add_rows()
保留属性,因此对于 标记数据。请参阅以下带有标记数据集的示例。
frq()
函数会打印带有值标签的频率表(如果数据已标记)。You could also use
sjmisc::add_rows()
, which usesdplyr::bind_rows()
, but unlikebind_rows()
,add_rows()
preserves attributes and hence is useful for labelled data.See following example with a labelled dataset. The
frq()
-function prints frequency tables with value labels, if the data is labelled.您可以将它们插入到原始数据库 (db1) 的末尾,添加第二个数据库的行数。 db2 中未包含的列将显示 NA 值。
db1[nrow(db1)+1:nrow(db1)+nrow(db2), 名称(db2)] <- db2
You can insert them at the end of your original database (db1) adding the number of rows of your second database. The columns that are not included in db2 will show NA values.
db1[nrow(db1)+1:nrow(db1)+nrow(db2), names(db2)] <- db2
不幸的是,最好的答案
data.table::rbindlist(x, fill=TRUE)
对我不起作用。相反,它破坏了我的数据,我只是在分析过程中才发现这一点,当时合并前有价值的一些行在合并后失去了价值。由于列数不同,使用
merge
或rbind
的其他解决方案也不起作用。所以我必须开发自己的特殊解决方案。它的两个短函数带有基数!没有其他需要。
第一个问题是我们无法将
data.frame
与不存在的列进行子集化。但如果我们解决这个问题,我们只需对这些data.frame
进行子集化,然后对结果进行rbind
即可。这是另一个 StackOverflow 答案中的清理解决方案。
一旦我们有了这个,剩下的就很容易了!
现在,让我们测试一下:
Unfortunately, the best answer
data.table::rbindlist(x, fill=TRUE)
didn't work for me. Instead, it corrupted my data, which I found out only during analysis when some rows that had value before merging lost their value after merging.Other solutions using
merge
orrbind
also didn't work due to a differing number of columns.So I had to develop my own special solution. Its two short functions with base! Nothing else required.
The first issue is that we can't subset
data.frame
with non-existing columns. But if we solve that, we can just subset thosedata.frame
s andrbind
the result.This is a cleaned solution from another StackOverflow answer.
Once we have that, the rest is easy!
Now, lets test it:
plyr
包中的rbind.fill
可能就是您正在寻找的内容。rbind.fill
from the packageplyr
might be what you are looking for.最近的解决方案是使用 dplyr 的 bind_rows 函数,我认为它比 smartbind 更有效。
A more recent solution is to use
dplyr
'sbind_rows
function which I assume is more efficient thansmartbind
.大多数基本 R 答案都解决了只有一个 data.frame 具有附加列或生成的 data.frame 将具有列交集的情况。由于OP写道我希望保留绑定后不匹配的列,因此使用基本R方法来解决此问题的答案可能值得发布。
下面,我介绍两种基本的 R 方法:一种改变原始 data.frames,另一种不改变。此外,我提供了一种将非破坏性方法推广到两个以上 data.frames 的方法。
首先,让我们获取一些示例数据。
两个数据框,更改原始数据
为了保留 rbind 中两个 data.frame 中的所有列(并允许该函数正常工作而不会导致错误),您可以将 NA 列添加到每个 data.frame 中,并填充适当的缺失名称使用
setdiff
。现在,rbind-em
请注意,前两行更改了原始 data.frames df1 和 df2,向两者添加了完整的列集。
两个data.frame,不要改变原来的
要保持原始 data.frames 完整,首先循环遍历不同的名称,返回一个 NA 的命名向量,该向量使用
c
连接到带有 data.frame 的列表中。然后,data.frame
将结果转换为rbind
的适当 data.frame。很多data.frames,不要改变原来的
如果您有两个以上的 data.frames,您可以执行以下操作。
也许看不到原始 data.frames 的行名称会更好一些?然后这样做。
Most of the base R answers address the situation where only one data.frame has additional columns or that the resulting data.frame would have the intersection of the columns. Since the OP writes I am hoping to retain the columns that do not match after the bind, an answer using base R methods to address this issue is probably worth posting.
Below, I present two base R methods: One that alters the original data.frames, and one that doesn't. Additionally, I offer a method that generalizes the non-destructive method to more than two data.frames.
First, let's get some sample data.
Two data.frames, alter originals
In order to retain all columns from both data.frames in an
rbind
(and allow the function to work without resulting in an error), you add NA columns to each data.frame with the appropriate missing names filled in usingsetdiff
.Now,
rbind
-emNote that the first two lines alter the original data.frames, df1 and df2, adding the full set of columns to both.
Two data.frames, do not alter originals
To leave the original data.frames intact, first loop through the names that differ, return a named vector of NAs that are concatenated into a list with the data.frame using
c
. Then,data.frame
converts the result into an appropriate data.frame for therbind
.Many data.frames, do not alter originals
In the instance that you have more than two data.frames, you could do the following.
Maybe a bit nicer to not see the row names of original data.frames? Then do this.
data.table
的替代方案:rbind
也可以在data.table
中工作,只要对象转换为data.table
对象,因此在这种情况下也可以工作。当您有几个 data.tables 并且不想构建列表时,这可能是更好的选择。
An alternative with
data.table
:rbind
will also work indata.table
as long as the objects are converted todata.table
objects, sowill also work in this situation. This can be preferable when you have a couple of data.tables and don't want to construct a list.
您可以使用
gtools
包中的smartbind
。例子:
You can use
smartbind
from thegtools
package.Example: