Python，如何使用Sklearn中的Knnimputer并使用Groupby插入数据（填充缺失值）

发布于 2025-02-13 18:37:10 字数 1142 浏览 1 评论 0原文

我有大数据，我需要进行插补才能填写缺失的值。但是我需要使用GroupBy进行插补，以根据另一列的值填充缺失值。假设我有这张表，

data = {'name':  ['Alex', 'Ben', 'Marry','Alex', 'Ben', 'Marry'],
        'job': ['teacher', 'doctor', 'engineer','teacher', 'doctor', 'engineer'],
        'age': [27, 32, 78,27, 32, 78],
        'weight': [160, 209, 130,164, 206, 132],
        'date': ['6-12-2022', '6-12-2022', '6-12-2022','6-13-2022', '6-13-2022', '6-13-2022']
        }

df = pd.DataFrame(data) df

第二天我添加了数据，但是作为零值，我最终得到了该表：

    |name   |job        |age|weight |date
|---|-------|-----------|---|-------|--------
|0  |Alex   |teacher    |27 |160    |6-12-2022
|1  |Ben    |doctor     |32 |209    |6-12-2022
|2  |Marry  |engineer   |78 |130    |6-12-2022
|3  |Alex   |teacher    |27 |164    |6-13-2022
|4  |Ben    |doctor     |32 |206    |6-13-2022
|5  |Marry  |engineer   |78 |132    |6-13-2022
|6  |Alex   |teacher    |NaN|NaN    |6-14-2022
|7  |Ben    |doctor     |NaN|NaN    |6-14-2022
|8  |Marry  |engineer   |NaN|NaN    |6-14-2022

现在，如果我想使用knnimputer或任何其他概括者填写“年龄”和“重量”列的值，当我分组“名称”时，我该怎么做？正如我提到的，数据框是一个示例，我正在处理大数据任何帮助

感谢

原文

I have big data and I need to do imputation to fill out missing values. but I need to do imputation using groupby to fill missing values based on another column's value.
let's say I have this table

data = {'name':  ['Alex', 'Ben', 'Marry','Alex', 'Ben', 'Marry'],
        'job': ['teacher', 'doctor', 'engineer','teacher', 'doctor', 'engineer'],
        'age': [27, 32, 78,27, 32, 78],
        'weight': [160, 209, 130,164, 206, 132],
        'date': ['6-12-2022', '6-12-2022', '6-12-2022','6-13-2022', '6-13-2022', '6-13-2022']
        }

df = pd.DataFrame(data) df

I add data for the next day, but as null values, and I end up with this table:

    |name   |job        |age|weight |date
|---|-------|-----------|---|-------|--------
|0  |Alex   |teacher    |27 |160    |6-12-2022
|1  |Ben    |doctor     |32 |209    |6-12-2022
|2  |Marry  |engineer   |78 |130    |6-12-2022
|3  |Alex   |teacher    |27 |164    |6-13-2022
|4  |Ben    |doctor     |32 |206    |6-13-2022
|5  |Marry  |engineer   |78 |132    |6-13-2022
|6  |Alex   |teacher    |NaN|NaN    |6-14-2022
|7  |Ben    |doctor     |NaN|NaN    |6-14-2022
|8  |Marry  |engineer   |NaN|NaN    |6-14-2022

now if I want to fill out values for "age" and "weight" columns using KNNImputer or any other imputer, how can I do that while I groupby "name"?
As I mentioned the dataframe is a sample and I am dealing with big data
Appreciate any helps

Thanks

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

缘字诀 2025-02-20 18:37:10

@mnm。您发布的问题并非直接解决。主要的开始问题是所有数值都是NAN，其余所有值都是字符串。后者对于KNN来说是不便的，因为弦之间没有直观的距离。

我将（1）使用单次编码来进行工作变量，（2）连续将几天标记为1、2和3。总的来说，这会导致以下培训和测试数据：

TRAINING
    name  teacher  doctor  engineer  age  weight  date
0   Alex        1       0         0   27     160     1
1    Ben        0       1         0   32     209     1
2  Marry        0       0         1   78     130     1
3   Alex        1       0         0   27     164     2
4    Ben        0       1         0   32     206     2
5  Marry        0       0         1   78     132     2



TEST
    name  teacher  doctor  engineer  age  weight  date
0   Alex        1       0         0  NaN     NaN     3
1    Ben        0       1         0  NaN     NaN     3
2  Marry        0       0         1  NaN     NaN     3

显然，这不是将作业和date变量重新版本的唯一方法。例如，您可以轻松确保连续几天相距仅1/365。当运行最近的邻居算法时，这本质上会减少重量差异。

下面的代码应用KNN将单个缺失值插入表中。将所有丢失的观察值算：

将“最近的邻居”下面的代码转换为函数。
循环通过数据循环测试数据，识别NAN并使用此功能插入。

我希望这会有所帮助...

import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsRegressor

############
### DATA ###
############
TrainingData = { 'name':  ['Alex', 'Ben', 'Marry','Alex', 'Ben', 'Marry'],
                'teacher': [1,0,0,1,0,0],
                'doctor': [0,1,0,0,1,0],
                'engineer': [0,0,1,0,0,1],
                'age': [27, 32, 78,27, 32, 78],
                'weight': [160, 209, 130,164, 206, 132],
                'date': [1,1,1,2,2,2]}

TestData = {'name': ['Alex', 'Ben', 'Marry'],
            'teacher': [1,0,0],
            'doctor': [0,1,0],
            'engineer': [0,0,1],
            'age': [np.NaN,np.NaN,np.NaN],
            'weight': [np.NaN,np.NaN,np.NaN],
            'data': [3,3,3]}
# Convert to pandas dataframe
dfTraining = pd.DataFrame(TrainingData)
dfTest = pd.DataFrame(TestData)
# Print
print(dfTraining)
print(dfTest)


#########################
### NEAREST NEIGHBORS ###
#########################
Variable2Predict = 'weight'
Person2Predict = 'Marry'

#### Step 1: Determine training data
# Select rows of beloning to Person2Predict
dfSubSelectedByName = dfTraining.loc[dfTraining['name'] == Person2Predict]
# Assign dependent variable as the variable to predict
Y = dfSubSelectedByName[Variable2Predict].to_numpy()
# Nearest neighbor computation should be based on all columns in the data except 'name' and the variable to predict
X = dfSubSelectedByName[dfSubSelectedByName.columns[~dfSubSelectedByName.columns.isin(['name', Variable2Predict])]].to_numpy()
print(X)

### Step 2: Nearest neighbor prediction
K = 1
# Read row from pandas dataframe based on name
Row2Predict = dfTest.loc[dfTest['name'] == Person2Predict]
# Select all variable except for name and dependent variable
Xpred = Row2Predict[Row2Predict.columns[~Row2Predict.columns.isin(['name', Variable2Predict])]].to_numpy()

# Step 3: K-NN based on the observed variables
PresentVariables = ~np.isnan(Xpred)
NumberOfPresentVariables = np.sum(PresentVariables)
NearestNeighbor = KNeighborsRegressor(n_neighbors=K).fit(X[:, PresentVariables[0]], Y)
Yhat = NearestNeighbor.predict( np.reshape( Xpred[PresentVariables], (1,NumberOfPresentVariables) ) )
print(Yhat)

@mnm. The problem you posted is not straightforward to solve. The main starting problem is the fact that all numerical values are NaN and all remaining values are strings. The latter is rather inconvenient for KNN because there is no intuitive distance between strings.

I will (1) use a one-hot encoding for the job variable, and (2) label the consecutive days by 1, 2, and 3. Overall, this results in the following training and test data:

TRAINING
    name  teacher  doctor  engineer  age  weight  date
0   Alex        1       0         0   27     160     1
1    Ben        0       1         0   32     209     1
2  Marry        0       0         1   78     130     1
3   Alex        1       0         0   27     164     2
4    Ben        0       1         0   32     206     2
5  Marry        0       0         1   78     132     2



TEST
    name  teacher  doctor  engineer  age  weight  date
0   Alex        1       0         0  NaN     NaN     3
1    Ben        0       1         0  NaN     NaN     3
2  Marry        0       0         1  NaN     NaN     3

Clearly, this is not the only way to recode the job and date variable into numerical values. For example, you can easily ensure that consecutive days are only 1/365 apart. This essentially gives less weight to date differences when running the nearest neighbors algorithm.

The code below applies KNN to insert a single missing value into the table. To impute all missing observations:

Transform the code underneath "NEAREST NEIGHBORS" into a function.
Loop through the data the test data, identify a NaN, and impute using this function.

I hope this helps...

import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsRegressor

############
### DATA ###
############
TrainingData = { 'name':  ['Alex', 'Ben', 'Marry','Alex', 'Ben', 'Marry'],
                'teacher': [1,0,0,1,0,0],
                'doctor': [0,1,0,0,1,0],
                'engineer': [0,0,1,0,0,1],
                'age': [27, 32, 78,27, 32, 78],
                'weight': [160, 209, 130,164, 206, 132],
                'date': [1,1,1,2,2,2]}

TestData = {'name': ['Alex', 'Ben', 'Marry'],
            'teacher': [1,0,0],
            'doctor': [0,1,0],
            'engineer': [0,0,1],
            'age': [np.NaN,np.NaN,np.NaN],
            'weight': [np.NaN,np.NaN,np.NaN],
            'data': [3,3,3]}
# Convert to pandas dataframe
dfTraining = pd.DataFrame(TrainingData)
dfTest = pd.DataFrame(TestData)
# Print
print(dfTraining)
print(dfTest)


#########################
### NEAREST NEIGHBORS ###
#########################
Variable2Predict = 'weight'
Person2Predict = 'Marry'

#### Step 1: Determine training data
# Select rows of beloning to Person2Predict
dfSubSelectedByName = dfTraining.loc[dfTraining['name'] == Person2Predict]
# Assign dependent variable as the variable to predict
Y = dfSubSelectedByName[Variable2Predict].to_numpy()
# Nearest neighbor computation should be based on all columns in the data except 'name' and the variable to predict
X = dfSubSelectedByName[dfSubSelectedByName.columns[~dfSubSelectedByName.columns.isin(['name', Variable2Predict])]].to_numpy()
print(X)

### Step 2: Nearest neighbor prediction
K = 1
# Read row from pandas dataframe based on name
Row2Predict = dfTest.loc[dfTest['name'] == Person2Predict]
# Select all variable except for name and dependent variable
Xpred = Row2Predict[Row2Predict.columns[~Row2Predict.columns.isin(['name', Variable2Predict])]].to_numpy()

# Step 3: K-NN based on the observed variables
PresentVariables = ~np.isnan(Xpred)
NumberOfPresentVariables = np.sum(PresentVariables)
NearestNeighbor = KNeighborsRegressor(n_neighbors=K).fit(X[:, PresentVariables[0]], Y)
Yhat = NearestNeighbor.predict( np.reshape( Xpred[PresentVariables], (1,NumberOfPresentVariables) ) )
print(Yhat)

回复收藏 0 原文

~没有更多了~