R 中的条件合并/替换

发布于 2024-11-09 13:58:27 字数 271 浏览 6 评论 0原文

我有两个数据框:

df1
x1  x2
1   a
2   b
3   c
4   d

并且

df2
x1  x2
2   zz
3   qq

我想根据 df1$x1 和 df2$x2 之间的条件匹配将 df1$x2 中的一些值替换为 df2$x2 中的值,以生成:

df1
x1  x2
1   a
2   zz
3   qq
4   d

I have two data frames:

df1
x1  x2
1   a
2   b
3   c
4   d

and

df2
x1  x2
2   zz
3   qq

I want to replace some of the values in df1$x2 with values in df2$x2 based on the conditional match between df1$x1 and df2$x2 to produce:

df1
x1  x2
1   a
2   zz
3   qq
4   d

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

樱娆 2024-11-16 13:58:27

使用 match(),假设 df1 中的值是唯一的。

df1 <- data.frame(x1=1:4,x2=letters[1:4],stringsAsFactors=FALSE)
df2 <- data.frame(x1=2:3,x2=c("zz","qq"),stringsAsFactors=FALSE)

df1$x2[match(df2$x1,df1$x1)] <- df2$x2
> df1
  x1 x2
1  1  a
2  2 zz
3  3 qq
4  4  d

如果值不唯一,请使用:

for(id in 1:nrow(df2)){
  df1$x2[df1$x1 %in% df2$x1[id]] <- df2$x2[id]
}

use match(), assuming values in df1 are unique.

df1 <- data.frame(x1=1:4,x2=letters[1:4],stringsAsFactors=FALSE)
df2 <- data.frame(x1=2:3,x2=c("zz","qq"),stringsAsFactors=FALSE)

df1$x2[match(df2$x1,df1$x1)] <- df2$x2
> df1
  x1 x2
1  1  a
2  2 zz
3  3 qq
4  4  d

If the values aren't unique, use :

for(id in 1:nrow(df2)){
  df1$x2[df1$x1 %in% df2$x1[id]] <- df2$x2[id]
}
月竹挽风 2024-11-16 13:58:27

我们可以使用 {powerjoin},并使用 coalesce_yx 处理冲突的列

library(powerjoin)
df1 <- data.frame(x1 = 1:4, x2 = letters[1:4], stringsAsFactors = FALSE)
df2 <- data.frame(x1 = 2:3, x2 = c("zz", "qq"), stringsAsFactors = FALSE)

power_left_join(df1, df2, by = "x1", conflict = coalesce_yx)
#>   x1 x2
#> 1  1  a
#> 2  2 zz
#> 3  3 qq
#> 4  4  d

We can use {powerjoin}, and handle the conflicting columns with coalesce_yx

library(powerjoin)
df1 <- data.frame(x1 = 1:4, x2 = letters[1:4], stringsAsFactors = FALSE)
df2 <- data.frame(x1 = 2:3, x2 = c("zz", "qq"), stringsAsFactors = FALSE)

power_left_join(df1, df2, by = "x1", conflict = coalesce_yx)
#>   x1 x2
#> 1  1  a
#> 2  2 zz
#> 3  3 qq
#> 4  4  d
记忆里有你的影子 2024-11-16 13:58:27

Joris 答案的第一部分很好,但在 df1 中存在非唯一值的情况下,逐行 for 循环在大型 data.frames 上无法很好地扩展。

您可以使用data.table“更新连接”进行就地修改,这将非常快:

library(data.table)
setDT(df1); setDT(df2)
df1[df2, on = .(x1), x2 := i.x2]

或者,假设您不关心维护行顺序,您可以使用受 SQL 启发的dplyr

library(dplyr)
union_all(
  inner_join( df1["x1"], df2 ), # x1 from df1 with matches in df2, x2 from df2
  anti_join(  df1, df2["x1"] )  # rows of df1 with no match in df2
) # %>% arrange(x1) # optional, won't maintain an arbitrary row order

其中任何一个都会比逐行 for 循环更好地扩展。

The first part of Joris' answer is good, but in the case of non-unique values in df1, the row-wise for-loop will not scale well on large data.frames.

You could use a data.table "update join" to modify in place, which will be quite fast:

library(data.table)
setDT(df1); setDT(df2)
df1[df2, on = .(x1), x2 := i.x2]

Or, assuming you don't care about maintaining row order, you could use SQL-inspired dplyr:

library(dplyr)
union_all(
  inner_join( df1["x1"], df2 ), # x1 from df1 with matches in df2, x2 from df2
  anti_join(  df1, df2["x1"] )  # rows of df1 with no match in df2
) # %>% arrange(x1) # optional, won't maintain an arbitrary row order

Either of these will scale much better than the row-wise for-loop.

烟沫凡尘 2024-11-16 13:58:27

我看到 Joris 和 Aaron 都选择构建没有因素的示例。我当然可以理解这个选择。对于已经是因素的列的读者来说,还可以选择强制“字符”。有一种策略可以避免该约束,并且还允许 df2 中可能存在 df1 中没有的索引,我认为这会使 Joris Meys 无效,但是不是亚伦到目前为止发布的解决方案:

df1 <- data.frame(x1=1:4,x2=letters[1:4])
df2 <- data.frame(x1=c(2,3,5), x2=c("zz", "qq", "xx") )

它要求扩展级别以包括两个因子变量的交集,然后还需要删除 match(df1$x1, df2$x1)

 df1$x2 <- factor(df1$x2 , levels=c(levels(df1$x2), levels(df2$x2)) )
 df1$x2[na.omit(match(df2$x1,df1$x1))] <- df2$x2[which(df2$x1 %in% df1$x1)]
 df1
#-----------
  x1 x2
1  1  a
2  2 zz
3  3 qq
4  4  d

(请注意,最新版本的 R 在 data.frame 函数默认值中没有将 stringsAsFactors 设置为 TRUE,这与 R 的大多数历史版本不同.)

I see that Joris and Aaron have both chosen to build examples without factors. I can certainly understand that choice. For the reader with columns that are already factors there would also be to option of coercion to "character". There is a strategy that avoids that constraint and which also allows for the possibility that there may be indices in df2 that are not in df1 which I believe would invalidate Joris Meys' but not Aaron's solutions posted so far:

df1 <- data.frame(x1=1:4,x2=letters[1:4])
df2 <- data.frame(x1=c(2,3,5), x2=c("zz", "qq", "xx") )

It requires that the levels be expanded to include the intersection of both factor variables and then also the need to drop non-matching columns (= NA values) in match(df1$x1, df2$x1)

 df1$x2 <- factor(df1$x2 , levels=c(levels(df1$x2), levels(df2$x2)) )
 df1$x2[na.omit(match(df2$x1,df1$x1))] <- df2$x2[which(df2$x1 %in% df1$x1)]
 df1
#-----------
  x1 x2
1  1  a
2  2 zz
3  3 qq
4  4  d

(Note that recent versions of R do not have stringsAsFactors set to TRUE in the data.frame function defaults, unlike it was for most of the history of R.)

源来凯始玺欢你 2024-11-16 13:58:27

您也可以通过其他方式进行匹配,但比较复杂。 Joris 的解决方案更好,但我将其放在这里也是为了提醒您思考您想要哪种方式匹配。

df1 <- data.frame(x1=1:4, x2=letters[1:4], stringsAsFactors=FALSE)
df2 <- data.frame(x1=2:3, x2=c("zz", "qq"), stringsAsFactors=FALSE)
swap <- df2$x2[match(df1$x1, df2$x1)]
ok <- !is.na(swap)
df1$x2[ok] <- swap[ok]

> df1
  x1 x2
1  1  a
2  2 zz
3  3 qq
4  4  d

You can do it by matching the other way too but it's more complicated. Joris's solution is better but I'm putting this here also as a reminder to think about which way you want to match.

df1 <- data.frame(x1=1:4, x2=letters[1:4], stringsAsFactors=FALSE)
df2 <- data.frame(x1=2:3, x2=c("zz", "qq"), stringsAsFactors=FALSE)
swap <- df2$x2[match(df1$x1, df2$x1)]
ok <- !is.na(swap)
df1$x2[ok] <- swap[ok]

> df1
  x1 x2
1  1  a
2  2 zz
3  3 qq
4  4  d
梦中楼上月下 2024-11-16 13:58:27

可以使用 dplyr 来完成。

library(dplyr)

full_join(df1,df2,by = c("x1" = "x1")) %>% 
  transmute(x1 = x1,x2 = coalesce(x2.y,x2.x))

  x1 x2
1  1  a
2  2 zz
3  3 qq
4  4  d

It can be done with dplyr.

library(dplyr)

full_join(df1,df2,by = c("x1" = "x1")) %>% 
  transmute(x1 = x1,x2 = coalesce(x2.y,x2.x))

  x1 x2
1  1  a
2  2 zz
3  3 qq
4  4  d
娇柔作态 2024-11-16 13:58:27

这里是新的,但使用以下 dplyr 方法似乎也有效
与上面的答案之一类似但略有不同

df3 <- anti_join(df1, df2, by = "x1")
df3 <- rbind(df3, df2)
df3

new here, but using the following dplyr approach seems to work as well
similar but slightly different to one of the answers above

df3 <- anti_join(df1, df2, by = "x1")
df3 <- rbind(df3, df2)
df3
星光不落少年眉 2024-11-16 13:58:27

从 dplyr 1.0.0 开始,有一个专门用于此目的的函数:

library(dplyr)
df1 <- data.frame(x1=1:4,x2=letters[1:4],stringsAsFactors=FALSE)
df2 <- data.frame(x1=2:3,x2=c("zz","qq"),stringsAsFactors=FALSE)


rows_update(df1, df2, by = "x1")

请参阅 https://stackoverflow.com/a/65254214/2738526

As of dplyr 1.0.0 there is a function specifically for this:

library(dplyr)
df1 <- data.frame(x1=1:4,x2=letters[1:4],stringsAsFactors=FALSE)
df2 <- data.frame(x1=2:3,x2=c("zz","qq"),stringsAsFactors=FALSE)


rows_update(df1, df2, by = "x1")

See https://stackoverflow.com/a/65254214/2738526

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文