R 中的条件合并/替换
我有两个数据框:
df1
x1 x2
1 a
2 b
3 c
4 d
并且
df2
x1 x2
2 zz
3 qq
我想根据 df1$x1 和 df2$x2 之间的条件匹配将 df1$x2 中的一些值替换为 df2$x2 中的值,以生成:
df1
x1 x2
1 a
2 zz
3 qq
4 d
I have two data frames:
df1
x1 x2
1 a
2 b
3 c
4 d
and
df2
x1 x2
2 zz
3 qq
I want to replace some of the values in df1$x2 with values in df2$x2 based on the conditional match between df1$x1 and df2$x2 to produce:
df1
x1 x2
1 a
2 zz
3 qq
4 d
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
使用
match()
,假设 df1 中的值是唯一的。如果值不唯一,请使用:
use
match()
, assuming values in df1 are unique.If the values aren't unique, use :
我们可以使用 {powerjoin},并使用
coalesce_yx
处理冲突的列We can use {powerjoin}, and handle the conflicting columns with
coalesce_yx
Joris 答案的第一部分很好,但在 df1 中存在非唯一值的情况下,逐行 for 循环在大型 data.frames 上无法很好地扩展。
您可以使用
data.table
“更新连接”进行就地修改,这将非常快:或者,假设您不关心维护行顺序,您可以使用受 SQL 启发的
dplyr
:其中任何一个都会比逐行 for 循环更好地扩展。
The first part of Joris' answer is good, but in the case of non-unique values in
df1
, the row-wise for-loop will not scale well on large data.frames.You could use a
data.table
"update join" to modify in place, which will be quite fast:Or, assuming you don't care about maintaining row order, you could use SQL-inspired
dplyr
:Either of these will scale much better than the row-wise for-loop.
我看到 Joris 和 Aaron 都选择构建没有因素的示例。我当然可以理解这个选择。对于已经是因素的列的读者来说,还可以选择强制“字符”。有一种策略可以避免该约束,并且还允许
df2
中可能存在df1
中没有的索引,我认为这会使 Joris Meys 无效,但是不是亚伦到目前为止发布的解决方案:它要求扩展级别以包括两个因子变量的交集,然后还需要删除 match(df1$x1, df2$x1)
(请注意,最新版本的 R 在
data.frame
函数默认值中没有将stringsAsFactors
设置为 TRUE,这与 R 的大多数历史版本不同.)I see that Joris and Aaron have both chosen to build examples without factors. I can certainly understand that choice. For the reader with columns that are already factors there would also be to option of coercion to "character". There is a strategy that avoids that constraint and which also allows for the possibility that there may be indices in
df2
that are not indf1
which I believe would invalidate Joris Meys' but not Aaron's solutions posted so far:It requires that the levels be expanded to include the intersection of both factor variables and then also the need to drop non-matching columns (= NA values) in match(df1$x1, df2$x1)
(Note that recent versions of R do not have
stringsAsFactors
set to TRUE in thedata.frame
function defaults, unlike it was for most of the history of R.)您也可以通过其他方式进行匹配,但比较复杂。 Joris 的解决方案更好,但我将其放在这里也是为了提醒您思考您想要哪种方式匹配。
You can do it by matching the other way too but it's more complicated. Joris's solution is better but I'm putting this here also as a reminder to think about which way you want to match.
可以使用 dplyr 来完成。
It can be done with
dplyr
.这里是新的,但使用以下 dplyr 方法似乎也有效
与上面的答案之一类似但略有不同
new here, but using the following dplyr approach seems to work as well
similar but slightly different to one of the answers above
从 dplyr 1.0.0 开始,有一个专门用于此目的的函数:
请参阅 https://stackoverflow.com/a/65254214/2738526
As of dplyr 1.0.0 there is a function specifically for this:
See https://stackoverflow.com/a/65254214/2738526