将缺失值替换为组平均值

发布于 2025-02-06 19:42:18 字数 998 浏览 2 评论 0原文

iris dataframe中,我替换了丢失的值。 我想用petal_lengthpetal_length中替换缺失值。下面的代码确实有效(替换值之前和之后的手段相等),但是我怀疑必须有一种更有效的方法来执行此操作,该方法不会循环到每行,而仅在某些行值中丢失。同样,在更优化的解决方案中可能没有必要创建字典。有任何优化建议吗?

using CSV
using DataFrames
using Random
using Statistics
using StatsBase

download("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv", "iris.csv")
iris = DataFrame(CSV.File("iris.csv", delim = ","))
allowmissing!(iris)

Random.seed!(20_000)
for i in 1:100
    iris[rand(1:nrow(iris)), rand(1:4)] = missing
end

Random.seed!(20_000)
iris[sample(1:nrow(iris), 10), :species] .= missing

mean_per_species = combine(groupby(iris, :species), :petal_length .=> mean∘skipmissing .=> :mean)
mean_per_species_dict = Dict(mean_per_species.species .=> mean_per_species.mean)

for row in eachrow(iris)
    if ismissing(row.petal_length)
        row.petal_length = mean_per_species_dict[row.species]
    end
end

In an iris dataframe I replaced some values by missing.
I would like to replace the missing values in the column petal_length by the petal_length mean per species. The code below does work (means before and after replacing values are equal), however I suspect there must be a more efficient way to do this which does not loop through every row while only in some rows values are missing. Also, creating a dictionary is probably not necessary in a more optimised solution. Any suggestions for optimising?

using CSV
using DataFrames
using Random
using Statistics
using StatsBase

download("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv", "iris.csv")
iris = DataFrame(CSV.File("iris.csv", delim = ","))
allowmissing!(iris)

Random.seed!(20_000)
for i in 1:100
    iris[rand(1:nrow(iris)), rand(1:4)] = missing
end

Random.seed!(20_000)
iris[sample(1:nrow(iris), 10), :species] .= missing

mean_per_species = combine(groupby(iris, :species), :petal_length .=> mean∘skipmissing .=> :mean)
mean_per_species_dict = Dict(mean_per_species.species .=> mean_per_species.mean)

for row in eachrow(iris)
    if ismissing(row.petal_length)
        row.petal_length = mean_per_species_dict[row.species]
    end
end

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

明月夜 2025-02-13 19:42:18

这是您想要的(使用dataframesmeta.jl):(

julia> @chain iris begin
           groupby(:species)
           @transform(:petal_length = coalesce.(:petal_length, mean(skipmissing(:petal_length))))
       end

现场变体将使用@transform!

Is this what you want (using DataFramesMeta.jl):

julia> @chain iris begin
           groupby(:species)
           @transform(:petal_length = coalesce.(:petal_length, mean(skipmissing(:petal_length))))
       end

(an in-place variant would use @transform!)

辞慾 2025-02-13 19:42:18

我认为我通过使用groupby找到了一种更优化的方法。

for group in groupby(iris, :species)
    group[ismissing.(group.petal_length), :petal_length] .= mean(skipmissing(group.petal_length))
end

I think I found a more optimised way by using groupby.

for group in groupby(iris, :species)
    group[ismissing.(group.petal_length), :petal_length] .= mean(skipmissing(group.petal_length))
end
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文