Python,如何使用Sklearn中的Knnimputer并使用Groupby插入数据(填充缺失值)
我有大数据,我需要进行插补才能填写缺失的值。但是我需要使用GroupBy进行插补,以根据另一列的值填充缺失值。 假设我有这张表,
data = {'name': ['Alex', 'Ben', 'Marry','Alex', 'Ben', 'Marry'],
'job': ['teacher', 'doctor', 'engineer','teacher', 'doctor', 'engineer'],
'age': [27, 32, 78,27, 32, 78],
'weight': [160, 209, 130,164, 206, 132],
'date': ['6-12-2022', '6-12-2022', '6-12-2022','6-13-2022', '6-13-2022', '6-13-2022']
}
df = pd.DataFrame(data) df
第二天我添加了数据,但是作为零值,我最终得到了该表:
|name |job |age|weight |date
|---|-------|-----------|---|-------|--------
|0 |Alex |teacher |27 |160 |6-12-2022
|1 |Ben |doctor |32 |209 |6-12-2022
|2 |Marry |engineer |78 |130 |6-12-2022
|3 |Alex |teacher |27 |164 |6-13-2022
|4 |Ben |doctor |32 |206 |6-13-2022
|5 |Marry |engineer |78 |132 |6-13-2022
|6 |Alex |teacher |NaN|NaN |6-14-2022
|7 |Ben |doctor |NaN|NaN |6-14-2022
|8 |Marry |engineer |NaN|NaN |6-14-2022
现在,如果我想使用knnimputer或任何其他概括者填写“年龄”和“重量”列的值,当我分组“名称”时,我该怎么做? 正如我提到的,数据框是一个示例,我正在处理大数据 任何帮助
感谢
I have big data and I need to do imputation to fill out missing values. but I need to do imputation using groupby to fill missing values based on another column's value.
let's say I have this table
data = {'name': ['Alex', 'Ben', 'Marry','Alex', 'Ben', 'Marry'],
'job': ['teacher', 'doctor', 'engineer','teacher', 'doctor', 'engineer'],
'age': [27, 32, 78,27, 32, 78],
'weight': [160, 209, 130,164, 206, 132],
'date': ['6-12-2022', '6-12-2022', '6-12-2022','6-13-2022', '6-13-2022', '6-13-2022']
}
df = pd.DataFrame(data) df
I add data for the next day, but as null values, and I end up with this table:
|name |job |age|weight |date
|---|-------|-----------|---|-------|--------
|0 |Alex |teacher |27 |160 |6-12-2022
|1 |Ben |doctor |32 |209 |6-12-2022
|2 |Marry |engineer |78 |130 |6-12-2022
|3 |Alex |teacher |27 |164 |6-13-2022
|4 |Ben |doctor |32 |206 |6-13-2022
|5 |Marry |engineer |78 |132 |6-13-2022
|6 |Alex |teacher |NaN|NaN |6-14-2022
|7 |Ben |doctor |NaN|NaN |6-14-2022
|8 |Marry |engineer |NaN|NaN |6-14-2022
now if I want to fill out values for "age" and "weight" columns using KNNImputer or any other imputer, how can I do that while I groupby "name"?
As I mentioned the dataframe is a sample and I am dealing with big data
Appreciate any helps
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
@mnm。您发布的问题并非直接解决。主要的开始问题是所有数值都是NAN,其余所有值都是字符串。后者对于KNN来说是不便的,因为弦之间没有直观的距离。
我将(1)使用单次编码来进行工作变量,(2)连续将几天标记为1、2和3。总的来说,这会导致以下培训和测试数据:
显然,这不是将
作业
和date
变量重新版本的唯一方法。例如,您可以轻松确保连续几天相距仅1/365。当运行最近的邻居算法时,这本质上会减少重量差异。下面的代码应用KNN将单个缺失值插入表中。将所有丢失的观察值算:
我希望这会有所帮助...
@mnm. The problem you posted is not straightforward to solve. The main starting problem is the fact that all numerical values are NaN and all remaining values are strings. The latter is rather inconvenient for KNN because there is no intuitive distance between strings.
I will (1) use a one-hot encoding for the job variable, and (2) label the consecutive days by 1, 2, and 3. Overall, this results in the following training and test data:
Clearly, this is not the only way to recode the
job
anddate
variable into numerical values. For example, you can easily ensure that consecutive days are only 1/365 apart. This essentially gives less weight to date differences when running the nearest neighbors algorithm.The code below applies KNN to insert a single missing value into the table. To impute all missing observations:
I hope this helps...