将缺失值替换为组平均值

发布于 2025-02-06 19:42:18 字数 998 浏览 2 评论 0原文

在iris dataframe中，我替换了丢失的值。我想用petal_length在petal_length中替换缺失值。下面的代码确实有效（替换值之前和之后的手段相等），但是我怀疑必须有一种更有效的方法来执行此操作，该方法不会循环到每行，而仅在某些行值中丢失。同样，在更优化的解决方案中可能没有必要创建字典。有任何优化建议吗？

using CSV
using DataFrames
using Random
using Statistics
using StatsBase

download("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv", "iris.csv")
iris = DataFrame(CSV.File("iris.csv", delim = ","))
allowmissing!(iris)

Random.seed!(20_000)
for i in 1:100
    iris[rand(1:nrow(iris)), rand(1:4)] = missing
end

Random.seed!(20_000)
iris[sample(1:nrow(iris), 10), :species] .= missing

mean_per_species = combine(groupby(iris, :species), :petal_length .=> mean∘skipmissing .=> :mean)
mean_per_species_dict = Dict(mean_per_species.species .=> mean_per_species.mean)

for row in eachrow(iris)
    if ismissing(row.petal_length)
        row.petal_length = mean_per_species_dict[row.species]
    end
end

原文

In an iris dataframe I replaced some values by missing.
I would like to replace the missing values in the column petal_length by the petal_length mean per species. The code below does work (means before and after replacing values are equal), however I suspect there must be a more efficient way to do this which does not loop through every row while only in some rows values are missing. Also, creating a dictionary is probably not necessary in a more optimised solution. Any suggestions for optimising?

using CSV
using DataFrames
using Random
using Statistics
using StatsBase

download("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv", "iris.csv")
iris = DataFrame(CSV.File("iris.csv", delim = ","))
allowmissing!(iris)

Random.seed!(20_000)
for i in 1:100
    iris[rand(1:nrow(iris)), rand(1:4)] = missing
end

Random.seed!(20_000)
iris[sample(1:nrow(iris), 10), :species] .= missing

mean_per_species = combine(groupby(iris, :species), :petal_length .=> mean∘skipmissing .=> :mean)
mean_per_species_dict = Dict(mean_per_species.species .=> mean_per_species.mean)

for row in eachrow(iris)
    if ismissing(row.petal_length)
        row.petal_length = mean_per_species_dict[row.species]
    end
end

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

明月夜 2025-02-13 19:42:18

这是您想要的（使用dataframesmeta.jl）：（

julia> @chain iris begin
           groupby(:species)
           @transform(:petal_length = coalesce.(:petal_length, mean(skipmissing(:petal_length))))
       end

现场变体将使用@transform！）

Is this what you want (using DataFramesMeta.jl):

julia> @chain iris begin
           groupby(:species)
           @transform(:petal_length = coalesce.(:petal_length, mean(skipmissing(:petal_length))))
       end

(an in-place variant would use @transform!)

回复收藏 0 原文

辞慾 2025-02-13 19:42:18

我认为我通过使用groupby找到了一种更优化的方法。

for group in groupby(iris, :species)
    group[ismissing.(group.petal_length), :petal_length] .= mean(skipmissing(group.petal_length))
end

I think I found a more optimised way by using groupby.

for group in groupby(iris, :species)
    group[ismissing.(group.petal_length), :petal_length] .= mean(skipmissing(group.petal_length))
end

回复收藏 0 原文

~没有更多了~

关于作者

空心空情空意

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

将缺失值替换为组平均值

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

冰之心

貪欢

好菇凉咱不稀罕他

guowei007

大海や

1KUPGZrJCxEwZ

友情链接

将缺失值替换为组平均值

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

冰之心

貪欢

好菇凉咱不稀罕他

guowei007

大海や

1KUPGZrJCxEwZ

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。