将缺失值替换为组平均值
在iris
dataframe中,我替换了丢失
的值。 我想用petal_length
在petal_length
中替换缺失
值。下面的代码确实有效(替换值之前和之后的手段相等),但是我怀疑必须有一种更有效的方法来执行此操作,该方法不会循环到每行,而仅在某些行值中丢失。同样,在更优化的解决方案中可能没有必要创建字典。有任何优化建议吗?
using CSV
using DataFrames
using Random
using Statistics
using StatsBase
download("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv", "iris.csv")
iris = DataFrame(CSV.File("iris.csv", delim = ","))
allowmissing!(iris)
Random.seed!(20_000)
for i in 1:100
iris[rand(1:nrow(iris)), rand(1:4)] = missing
end
Random.seed!(20_000)
iris[sample(1:nrow(iris), 10), :species] .= missing
mean_per_species = combine(groupby(iris, :species), :petal_length .=> mean∘skipmissing .=> :mean)
mean_per_species_dict = Dict(mean_per_species.species .=> mean_per_species.mean)
for row in eachrow(iris)
if ismissing(row.petal_length)
row.petal_length = mean_per_species_dict[row.species]
end
end
In an iris
dataframe I replaced some values by missing
.
I would like to replace the missing
values in the column petal_length
by the petal_length
mean per species
. The code below does work (means before and after replacing values are equal), however I suspect there must be a more efficient way to do this which does not loop through every row while only in some rows values are missing. Also, creating a dictionary is probably not necessary in a more optimised solution. Any suggestions for optimising?
using CSV
using DataFrames
using Random
using Statistics
using StatsBase
download("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv", "iris.csv")
iris = DataFrame(CSV.File("iris.csv", delim = ","))
allowmissing!(iris)
Random.seed!(20_000)
for i in 1:100
iris[rand(1:nrow(iris)), rand(1:4)] = missing
end
Random.seed!(20_000)
iris[sample(1:nrow(iris), 10), :species] .= missing
mean_per_species = combine(groupby(iris, :species), :petal_length .=> mean∘skipmissing .=> :mean)
mean_per_species_dict = Dict(mean_per_species.species .=> mean_per_species.mean)
for row in eachrow(iris)
if ismissing(row.petal_length)
row.petal_length = mean_per_species_dict[row.species]
end
end
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这是您想要的(使用dataframesmeta.jl):(
现场变体将使用
@transform!
)Is this what you want (using DataFramesMeta.jl):
(an in-place variant would use
@transform!
)我认为我通过使用
groupby
找到了一种更优化的方法。I think I found a more optimised way by using
groupby
.