从Julia的另一个DF中替换DF中的列

发布于 2025-02-13 12:31:59 字数 1733 浏览 1 评论 0原文

假设我有两个dataframes df1和df2，目的

df1 = DataFrame(id=["a", "a", "a", "b", "b", "b", "c", "c", "c", "d", "d"],
    var=[1, 32, 3, 22, 5, 4, 6, 7, 8, 4, 3])

11×2 DataFrame
 Row │ id      var   
     │ String  Int64 
─────┼───────────────
   1 │ a           1
   2 │ a          32
   3 │ a           3
   4 │ b          22
   5 │ b           5
   6 │ b           4
   7 │ c           6
   8 │ c           7
   9 │ c           8
  10 │ d           4
  11 │ d           3

df2 = DataFrame(id=["a", "a", "b", "b", "b", "c", "c", "c"],
    var=[1, 1, 2, 2, 2, 6, 6, 6])

8×2 DataFrame
 Row │ id      var   
     │ String  Int64 
─────┼───────────────
   1 │ a           1
   2 │ a           1
   3 │ b           2
   4 │ b           2
   5 │ b           2
   6 │ c           6
   7 │ c           6
   8 │ c           6

是为每个ID var列列>使用var的值，来自df2的每个相应的ID，仅适用df2和df1，

因此所需的结果看起来像：

DataFrame(id=["a", "a", "a", "b", "b", "b", "c", "c", "c", "d", "d"],
    var=[1, 32, 3, 22, 5, 4, 6, 7, 8, 4, 3])

11×2 DataFrame
 Row │ id      var   
     │ String  Int64 
─────┼───────────────
   1 │ a           1
   2 │ a           1
   3 │ a           1
   4 │ b           2
   5 │ b           2
   6 │ b           2
   7 │ c           6
   8 │ c           6
   9 │ c           6
  10 │ d           4
  11 │ d           3

尝试以下内容，但它们不起作用

for d1 in groupby(df1, :id)
    replace!(d1.var .= [d2.var for d1 in groupby(df2, :id)])
end

#or

[[d1.var = d2.var for d2 in groupby(df2, :id)] for d1 in groupby(df1, :id)]

将感谢任何帮助。谢谢！

原文

Suppose I have two DataFrames df1 and df2 as follows

df1 = DataFrame(id=["a", "a", "a", "b", "b", "b", "c", "c", "c", "d", "d"],
    var=[1, 32, 3, 22, 5, 4, 6, 7, 8, 4, 3])

11×2 DataFrame
 Row │ id      var   
     │ String  Int64 
─────┼───────────────
   1 │ a           1
   2 │ a          32
   3 │ a           3
   4 │ b          22
   5 │ b           5
   6 │ b           4
   7 │ c           6
   8 │ c           7
   9 │ c           8
  10 │ d           4
  11 │ d           3

df2 = DataFrame(id=["a", "a", "b", "b", "b", "c", "c", "c"],
    var=[1, 1, 2, 2, 2, 6, 6, 6])

8×2 DataFrame
 Row │ id      var   
     │ String  Int64 
─────┼───────────────
   1 │ a           1
   2 │ a           1
   3 │ b           2
   4 │ b           2
   5 │ b           2
   6 │ c           6
   7 │ c           6
   8 │ c           6

The objective is to replace the var column in df1 for each id with the value of var from df2 for each corresponding id, only for those id which exist in both df2 and df1

So the desired outcome will look like:

DataFrame(id=["a", "a", "a", "b", "b", "b", "c", "c", "c", "d", "d"],
    var=[1, 32, 3, 22, 5, 4, 6, 7, 8, 4, 3])

11×2 DataFrame
 Row │ id      var   
     │ String  Int64 
─────┼───────────────
   1 │ a           1
   2 │ a           1
   3 │ a           1
   4 │ b           2
   5 │ b           2
   6 │ b           2
   7 │ c           6
   8 │ c           6
   9 │ c           6
  10 │ d           4
  11 │ d           3

Tried the following but they don't work

for d1 in groupby(df1, :id)
    replace!(d1.var .= [d2.var for d1 in groupby(df2, :id)])
end

#or

[[d1.var = d2.var for d2 in groupby(df2, :id)] for d1 in groupby(df1, :id)]

Will appreciate any help. Thanks!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

ら栖息 2025-02-20 12:31:59

我会这样做：

julia> leftjoin!(df1, unique(df2, :id), on=:id, makeunique=true)
11×3 DataFrame
 Row │ id      var    var_1
     │ String  Int64  Int64?
─────┼────────────────────────
   1 │ a           1        1
   2 │ a          32        1
   3 │ a           3        1
   4 │ b          22        2
   5 │ b           5        2
   6 │ b           4        2
   7 │ c           6        6
   8 │ c           7        6
   9 │ c           8        6
  10 │ d           4  missing
  11 │ d           3  missing

julia> select!(df1, :id, [:var_1, :var] => ByRow(coalesce) => :var)
11×2 DataFrame
 Row │ id      var
     │ String  Int64
─────┼───────────────
   1 │ a           1
   2 │ a           1
   3 │ a           1
   4 │ b           2
   5 │ b           2
   6 │ b           2
   7 │ c           6
   8 │ c           6
   9 │ c           6
  10 │ d           4
  11 │ d           3

请注意，您的数据的问题是df2对于相同的唯一：id具有多个行，所以我运行unique在加入之前，请注意（还请注意，df1和df2中的每个组值数量不相同）。

还有其他方法可以这样做（如果您想看到它们，请评论），但是它们将涉及更多代码（通过小组等进行迭代）。我提出的解决方案依赖于DataFrames.jl API的一部分的函数。

编辑

性能比较（首次运行包括汇编时间）：

julia> repeat!(df1, 10^6);

julia> function sol1(df1, df2)
           leftjoin!(df1, unique(df2, :id), on=:id, makeunique=true)
           select!(df1, :id, [:var_1, :var] => ByRow(coalesce) => :var)
       end
sol1 (generic function with 1 method)

julia> function sol2(df1, df2)
           D = Dict(df2[!, :id] .=> df2[!, :var])
           for (i, v) in enumerate(df1[!, :id])
               if v in keys(D)
                   df1[!, :var][i] = D[v]
               end
           end
       end
sol2 (generic function with 1 method)

julia> function sol3(var1, var2, id1, id2)
           D = Dict(id2 .=> var2)
           for (i, v) in enumerate(id1)
               if v in keys(D)
                   var = D[v]
               end
           end
       end
sol3 (generic function with 3 methods)

julia> x = copy(df1);

julia> @time sol1(x, df2);
  3.103564 seconds (4.97 M allocations: 685.190 MiB, 4.84% gc time, 80.37% compilation time)

julia> x = copy(df1);

julia> @time sol1(x, df2);
  0.660479 seconds (585 allocations: 409.727 MiB, 23.24% gc time)

julia> x = copy(df1);

julia> @time sol2(x, df2);
  3.462888 seconds (55.36 M allocations: 1.495 GiB, 6.22% gc time, 3.58% compilation time)

julia> x = copy(df1);

julia> @time sol2(x, df2);
  3.382529 seconds (55.00 M allocations: 1.475 GiB, 7.24% gc time)

julia> x = copy(df1);

julia> @time sol3(x.var, df2.var, x.id, df2.id);
  0.380297 seconds (142.77 k allocations: 7.360 MiB, 9.58% compilation time)

julia> x = copy(df1);

julia> @time sol3(x.var, df2.var, x.id, df2.id);
  0.331760 seconds (5 allocations: 720 bytes)

I would do it like this:

julia> leftjoin!(df1, unique(df2, :id), on=:id, makeunique=true)
11×3 DataFrame
 Row │ id      var    var_1
     │ String  Int64  Int64?
─────┼────────────────────────
   1 │ a           1        1
   2 │ a          32        1
   3 │ a           3        1
   4 │ b          22        2
   5 │ b           5        2
   6 │ b           4        2
   7 │ c           6        6
   8 │ c           7        6
   9 │ c           8        6
  10 │ d           4  missing
  11 │ d           3  missing

julia> select!(df1, :id, [:var_1, :var] => ByRow(coalesce) => :var)
11×2 DataFrame
 Row │ id      var
     │ String  Int64
─────┼───────────────
   1 │ a           1
   2 │ a           1
   3 │ a           1
   4 │ b           2
   5 │ b           2
   6 │ b           2
   7 │ c           6
   8 │ c           6
   9 │ c           6
  10 │ d           4
  11 │ d           3

Note that the problem with your data is that df2 has multiple rows for the same unique :id so I run unique on it before joining (also note that number of values per group in df1 and df2 is not the same).

There are other ways to do it (please comment if you would like to see them), but they would involve more code (iteration through groups etc.). The solution I propose relies on the functions that are part of DataFrames.jl API.

EDIT

Performance comparison (first run includes compilation time):

julia> repeat!(df1, 10^6);

julia> function sol1(df1, df2)
           leftjoin!(df1, unique(df2, :id), on=:id, makeunique=true)
           select!(df1, :id, [:var_1, :var] => ByRow(coalesce) => :var)
       end
sol1 (generic function with 1 method)

julia> function sol2(df1, df2)
           D = Dict(df2[!, :id] .=> df2[!, :var])
           for (i, v) in enumerate(df1[!, :id])
               if v in keys(D)
                   df1[!, :var][i] = D[v]
               end
           end
       end
sol2 (generic function with 1 method)

julia> function sol3(var1, var2, id1, id2)
           D = Dict(id2 .=> var2)
           for (i, v) in enumerate(id1)
               if v in keys(D)
                   var = D[v]
               end
           end
       end
sol3 (generic function with 3 methods)

julia> x = copy(df1);

julia> @time sol1(x, df2);
  3.103564 seconds (4.97 M allocations: 685.190 MiB, 4.84% gc time, 80.37% compilation time)

julia> x = copy(df1);

julia> @time sol1(x, df2);
  0.660479 seconds (585 allocations: 409.727 MiB, 23.24% gc time)

julia> x = copy(df1);

julia> @time sol2(x, df2);
  3.462888 seconds (55.36 M allocations: 1.495 GiB, 6.22% gc time, 3.58% compilation time)

julia> x = copy(df1);

julia> @time sol2(x, df2);
  3.382529 seconds (55.00 M allocations: 1.475 GiB, 7.24% gc time)

julia> x = copy(df1);

julia> @time sol3(x.var, df2.var, x.id, df2.id);
  0.380297 seconds (142.77 k allocations: 7.360 MiB, 9.58% compilation time)

julia> x = copy(df1);

julia> @time sol3(x.var, df2.var, x.id, df2.id);
  0.331760 seconds (5 allocations: 720 bytes)

回复收藏 0 原文

笑红尘 2025-02-20 12:31:59

我相信 @bogumiłkamiński答案是如果您决定使用dataframes.jl API提供的内容，则需要解决。但是，如果您想尝试更长的时间，但更快的速度，您可能想尝试以下内容：

D = Dict(df2[!, :id] .=> df2[!, :var])

for (i, v) in enumerate(df1[!, :id])
    if v in keys(D)
        df1[!, :var][i] = D[v]
    end 
end

Julia > df1
11×2 DataFrame
 Row │ id      var
     │ String  Int64
─────┼───────────────
   1 │ a           1
   2 │ a           1
   3 │ a           1
   4 │ b           2
   5 │ b           2
   6 │ b           2
   7 │ c           6
   8 │ c           6
   9 │ c           6
  10 │ d           4
  11 │ d           3

I believe @BogumiłKamiński answer is the way to go if you decide to use what DataFrames.jl API has to offer. If you want to try something longer but way faster, however, you might want to try the following:

D = Dict(df2[!, :id] .=> df2[!, :var])

for (i, v) in enumerate(df1[!, :id])
    if v in keys(D)
        df1[!, :var][i] = D[v]
    end 
end

Julia > df1
11×2 DataFrame
 Row │ id      var
     │ String  Int64
─────┼───────────────
   1 │ a           1
   2 │ a           1
   3 │ a           1
   4 │ b           2
   5 │ b           2
   6 │ b           2
   7 │ c           6
   8 │ c           6
   9 │ c           6
  10 │ d           4
  11 │ d           3

回复收藏 0 原文

~没有更多了~