使用 MissForest 算法对 python 列中的每个组填充缺失值
我有大约 4000 名患者的时间序列数据,其中存在缺失值,我想使用 Python 中的 MissForest 算法对每个患者文件分别估算 NaN 值。
数据如下所示:
HR | Resp | P_ID |
---|---|---|
72.0 | 18.0 | 1 |
NaN | 15.0 | 1 |
80.0 | NaN | 1 |
NaN | 16.0 | 1 |
79.5 | NaN | 1 |
NaN | 19.0 | 2 |
79.5 | 22.5 | 2 |
NaN | NaN | 2 |
NaN | 16.0 | 2 |
85.0 | NaN | 3 |
NaN | 14.5 | 3 |
76.4 | NaN | 3 |
NaN | NaN | 4 |
80.5 | 19.5 | 4 |
75.3 | 18.0 | 4 |
NaN | 21.5 | 4 |
现在,我想根据 P_ID 在列中的患者数据中估算 NaN 值。就像它会估算P_ID = 1,然后估算P_ID = 2,依此类推。不是对整个列的插补。我使用的代码将把 NaN 归咎于所有患者的整个列,而不是单个患者列,然后是下一个患者。
imputer = MissForest(max_iter=12, n_jobs=-1)
X_imputed = imputer.fit_transform(df)
df1 = pd.DataFrame(X_imputed)
df1.head()
我使用以下代码对患者本身进行了平均插补,但无法弄清楚如何将其用于 MissForest。
for i in ['HR','Resp']:
df[i] = df[i].fillna(df.groupby('P_ID')[i].transform('mean'))
一个解决方案是我为每个患者制作 4000 个数据帧,使用 MissForest 对其进行估算,然后将它们组合在一起。这将是一项繁忙的任务。所以我想要一个循环整个数据帧的解决方案。请帮忙。谢谢。
I have a time series data of about 4000 patients that has missing values and I want to impute NaN values using MissForest algorithm in Python on each patient file separately.
The data looks like this:
HR | Resp | P_ID |
---|---|---|
72.0 | 18.0 | 1 |
NaN | 15.0 | 1 |
80.0 | NaN | 1 |
NaN | 16.0 | 1 |
79.5 | NaN | 1 |
NaN | 19.0 | 2 |
79.5 | 22.5 | 2 |
NaN | NaN | 2 |
NaN | 16.0 | 2 |
85.0 | NaN | 3 |
NaN | 14.5 | 3 |
76.4 | NaN | 3 |
NaN | NaN | 4 |
80.5 | 19.5 | 4 |
75.3 | 18.0 | 4 |
NaN | 21.5 | 4 |
Now, I want to impute NaN values within the patients data in column based on P_ID. Like it will impute P_ID = 1, then P_ID = 2 and so on. Not the imputation on the whole column. The code I am using will impute NaN on whole column of all patients, not in individual Patients column, then the next patient.
imputer = MissForest(max_iter=12, n_jobs=-1)
X_imputed = imputer.fit_transform(df)
df1 = pd.DataFrame(X_imputed)
df1.head()
I did the Mean Imputation within patient itself using the following code, but can't figure out how I can use it for MissForest.
for i in ['HR','Resp']:
df[i] = df[i].fillna(df.groupby('P_ID')[i].transform('mean'))
One solution is I make 4000 data frames of each patient, impute them using MissForest, then combine them together. That will be a hectic task. So I want a solution with looping over the entire dataframe. Kindly help. Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以使用以下命令来遍历
P_ID
,然后仅将MissForest
应用于过滤后的值:这将为您提供:
You can use the following to go through the
P_ID
s, then apply theMissForest
only on the filtered values:This gives you: