Randomforest分类器几代人没有学习

发布于 2025-01-27 20:01:43 字数 7615 浏览 4 评论 0原文

我正在尝试使用随机森林使用Sklearn-Genetic的GasearchCV对心脏病进行分类。

这是原始的 dataset ;我已经对其进行了预处理（使用LabElencoder，binaryCoder，onehotencoder和Maxabsscaler）。我已经跳过了简洁的预处理代码，但其由此产生的38个功能列是：

'BMI'，'hyshyshealth'，'centhichealth'，'agecategory'，'agecategory'，'sleeptime'，'spebring_0'，smoking_0'，smoking_1'，'smoting_1'，''，''，''，''，''，' AlcoholdRinking_0'，'alcoholdRinking_1'，'stroke_0'，'stroke_1'，'diffwalking_0'，'diffwalking_1'，'seal_0'，sex _0'，sex_1'，'helthyActivity_0' ，'Skiccancer_1'，Race_american Indian/Alaskan本地'，'Race_asian'，'Race_black'，'Race_hispanic'，'Race_other'，'Race_white'，'nedneydisease_no' '，'genhealth_poor'，'genhealth_very good'，diabetic_no'，diabetic_no，边界糖尿病'，'diabetic_yes'，diabetic_yes（diabetic_yes（怀孕期间）'

heartdisesease'是目标。

代码：

# Loading the dataset
dataset = pd.read_csv('data/heart_2020_cleaned.csv')


# Slicing the dataset to first 10000 rows to ease computations
dataset = dataset.iloc[:10000]

# Separating target from features
features = dataset.drop(columns='HeartDisease')     # X
target = dataset['HeartDisease']                    # y

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.33, random_state=42, stratify=target)

# Undersample the training data
sampler = RandomUnderSampler(random_state=42)
balanced_features, balanced_target = sampler.fit_resample(
    X_train, y_train)

# Classification starts here
clf = RandomForestClassifier(n_jobs=5)

start = time()

param_grid = {'min_weight_fraction_leaf': Continuous(0.01, 0.5, distribution='log-uniform'),
              'bootstrap': Categorical([True, False]),
              'max_depth': Integer(2, 30),
              'max_leaf_nodes': Integer(2, 35),
              'n_estimators': Integer(100, 300)}

cv = StratifiedKFold(n_splits=3, shuffle=True)

# Using GASearchCV to search for best parameters
evolved_estimator = GASearchCV(estimator=clf,
                               cv=cv,
                               scoring='accuracy',
                               population_size=10,
                               generations=35,
                               param_grid=param_grid,
                               n_jobs=5,
                               verbose=True,
                               keep_top_k=4,)

# Train and optimise the estimator
evolved_estimator.fit(X_train, y_train)

end = time()
result = end - start
print('%.3f seconds' % result)

# Best parameters found
print(evolved_estimator.best_params_)

# Use the model fitted with the best parameters
y_predict_ga = evolved_estimator.predict(X_test)
print(accuracy_score(y_test, y_predict_ga))

# Saved metadata for further analysis
print("Stats achieved in each generation: ", evolved_estimator.history)
print("Best k solutions: ", evolved_estimator.hof)

输出：

gen nevals  fitness     fitness_std fitness_max fitness_min
0   10      0.901343    1.11022e-16 0.901343    0.901343   
1   20      0.901343    1.11022e-16 0.901343    0.901343   
2   20      0.901343    1.11022e-16 0.901343    0.901343   
3   20      0.901343    1.11022e-16 0.901343    0.901343   
4   20      0.901343    1.11022e-16 0.901343    0.901343   
5   20      0.901343    1.11022e-16 0.901343    0.901343   
6   20      0.901343    1.11022e-16 0.901343    0.901343   
7   20      0.901343    1.11022e-16 0.901343    0.901343   
8   20      0.901343    1.11022e-16 0.901343    0.901343   
9   20      0.901343    1.11022e-16 0.901343    0.901343   
10  20      0.901343    1.11022e-16 0.901343    0.901343   
11  20      0.901343    1.11022e-16 0.901343    0.901343   
12  20      0.901343    1.11022e-16 0.901343    0.901343   
13  20      0.901343    1.11022e-16 0.901343    0.901343   
14  20      0.901343    1.11022e-16 0.901343    0.901343   
15  20      0.901343    1.11022e-16 0.901343    0.901343   
16  20      0.901343    1.11022e-16 0.901343    0.901343   
17  20      0.901343    1.11022e-16 0.901343    0.901343   
18  20      0.901343    1.11022e-16 0.901343    0.901343   
19  20      0.901343    1.11022e-16 0.901343    0.901343   
20  20      0.901343    1.11022e-16 0.901343    0.901343   
21  20      0.901343    1.11022e-16 0.901343    0.901343   
22  20      0.901343    1.11022e-16 0.901343    0.901343   
24  20      0.901343    1.11022e-16 0.901343    0.901343   

sklearn-genetic-opt closed prematurely. Will use the current best model.
INFO: Stopping the algorithm
235.563 seconds
{'min_weight_fraction_leaf': 0.023058297512734277, 'bootstrap': True, 'max_depth': 5, 'max_leaf_nodes': 33, 'n_estimators': 300}
0.9015151515151515
Stats achieved in each generation:  {'gen': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 24], 'fitness': [0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636], 'fitness_std': [1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16], 'fitness_max': [0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638], 'fitness_min': [0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638]}
Best k solutions:  {0: {'min_weight_fraction_leaf': 0.023058297512734277, 'bootstrap': True, 'max_depth': 5, 'max_leaf_nodes': 33, 'n_estimators': 300}, 1: {'min_weight_fraction_leaf': 0.22091581038404914, 'bootstrap': True, 'max_depth': 5, 'max_leaf_nodes': 26, 'n_estimators': 142}, 2: {'min_weight_fraction_leaf': 0.09793187151751966, 'bootstrap': True, 'max_depth': 3, 'max_leaf_nodes': 28, 'n_estimators': 177}, 3: {'min_weight_fraction_leaf': 0.023058297512734277, 'bootstrap': True, 'max_depth': 5, 'max_leaf_nodes': 33, 'n_estimators': 300}}

问题：分类器似乎并未在几代人中学习。什么原因？

原文

I am trying to use Random Forest to classify heart disease using GASearchCV from sklearn-genetic.

This is the original dataset; I have pre-processed it (using LabelEncoder, BinaryEncoder, OneHotEncoder and MaxAbsScaler). I have skipped out on the pre-processing code for brevity, but its resulting 38 feature columns are:

'BMI', 'PhysicalHealth', 'MentalHealth', 'AgeCategory', 'SleepTime', 'Smoking_0', 'Smoking_1', 'AlcoholDrinking_0', 'AlcoholDrinking_1', 'Stroke_0', 'Stroke_1', 'DiffWalking_0', 'DiffWalking_1', 'Sex_0', 'Sex_1', 'PhysicalActivity_0', 'PhysicalActivity_1', 'Asthma_0', 'Asthma_1', 'SkinCancer_0', 'SkinCancer_1', Race_American Indian/Alaskan Native', 'Race_Asian', 'Race_Black', 'Race_Hispanic', 'Race_Other', 'Race_White', 'KidneyDisease_No', 'KidneyDisease_Yes', 'GenHealth_Excellent', 'GenHealth_Fair', 'GenHealth_Good', 'GenHealth_Poor', 'GenHealth_Very good', 'Diabetic_No', 'Diabetic_No, borderline diabetes', 'Diabetic_Yes', 'Diabetic_Yes (during pregnancy)'

'HeartDisease' is the target.

CODE:

# Loading the dataset
dataset = pd.read_csv('data/heart_2020_cleaned.csv')


# Slicing the dataset to first 10000 rows to ease computations
dataset = dataset.iloc[:10000]

# Separating target from features
features = dataset.drop(columns='HeartDisease')     # X
target = dataset['HeartDisease']                    # y

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.33, random_state=42, stratify=target)

# Undersample the training data
sampler = RandomUnderSampler(random_state=42)
balanced_features, balanced_target = sampler.fit_resample(
    X_train, y_train)

# Classification starts here
clf = RandomForestClassifier(n_jobs=5)

start = time()

param_grid = {'min_weight_fraction_leaf': Continuous(0.01, 0.5, distribution='log-uniform'),
              'bootstrap': Categorical([True, False]),
              'max_depth': Integer(2, 30),
              'max_leaf_nodes': Integer(2, 35),
              'n_estimators': Integer(100, 300)}

cv = StratifiedKFold(n_splits=3, shuffle=True)

# Using GASearchCV to search for best parameters
evolved_estimator = GASearchCV(estimator=clf,
                               cv=cv,
                               scoring='accuracy',
                               population_size=10,
                               generations=35,
                               param_grid=param_grid,
                               n_jobs=5,
                               verbose=True,
                               keep_top_k=4,)

# Train and optimise the estimator
evolved_estimator.fit(X_train, y_train)

end = time()
result = end - start
print('%.3f seconds' % result)

# Best parameters found
print(evolved_estimator.best_params_)

# Use the model fitted with the best parameters
y_predict_ga = evolved_estimator.predict(X_test)
print(accuracy_score(y_test, y_predict_ga))

# Saved metadata for further analysis
print("Stats achieved in each generation: ", evolved_estimator.history)
print("Best k solutions: ", evolved_estimator.hof)

OUTPUT:

gen nevals  fitness     fitness_std fitness_max fitness_min
0   10      0.901343    1.11022e-16 0.901343    0.901343   
1   20      0.901343    1.11022e-16 0.901343    0.901343   
2   20      0.901343    1.11022e-16 0.901343    0.901343   
3   20      0.901343    1.11022e-16 0.901343    0.901343   
4   20      0.901343    1.11022e-16 0.901343    0.901343   
5   20      0.901343    1.11022e-16 0.901343    0.901343   
6   20      0.901343    1.11022e-16 0.901343    0.901343   
7   20      0.901343    1.11022e-16 0.901343    0.901343   
8   20      0.901343    1.11022e-16 0.901343    0.901343   
9   20      0.901343    1.11022e-16 0.901343    0.901343   
10  20      0.901343    1.11022e-16 0.901343    0.901343   
11  20      0.901343    1.11022e-16 0.901343    0.901343   
12  20      0.901343    1.11022e-16 0.901343    0.901343   
13  20      0.901343    1.11022e-16 0.901343    0.901343   
14  20      0.901343    1.11022e-16 0.901343    0.901343   
15  20      0.901343    1.11022e-16 0.901343    0.901343   
16  20      0.901343    1.11022e-16 0.901343    0.901343   
17  20      0.901343    1.11022e-16 0.901343    0.901343   
18  20      0.901343    1.11022e-16 0.901343    0.901343   
19  20      0.901343    1.11022e-16 0.901343    0.901343   
20  20      0.901343    1.11022e-16 0.901343    0.901343   
21  20      0.901343    1.11022e-16 0.901343    0.901343   
22  20      0.901343    1.11022e-16 0.901343    0.901343   
24  20      0.901343    1.11022e-16 0.901343    0.901343   

sklearn-genetic-opt closed prematurely. Will use the current best model.
INFO: Stopping the algorithm
235.563 seconds
{'min_weight_fraction_leaf': 0.023058297512734277, 'bootstrap': True, 'max_depth': 5, 'max_leaf_nodes': 33, 'n_estimators': 300}
0.9015151515151515
Stats achieved in each generation:  {'gen': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 24], 'fitness': [0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636], 'fitness_std': [1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16], 'fitness_max': [0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638], 'fitness_min': [0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638]}
Best k solutions:  {0: {'min_weight_fraction_leaf': 0.023058297512734277, 'bootstrap': True, 'max_depth': 5, 'max_leaf_nodes': 33, 'n_estimators': 300}, 1: {'min_weight_fraction_leaf': 0.22091581038404914, 'bootstrap': True, 'max_depth': 5, 'max_leaf_nodes': 26, 'n_estimators': 142}, 2: {'min_weight_fraction_leaf': 0.09793187151751966, 'bootstrap': True, 'max_depth': 3, 'max_leaf_nodes': 28, 'n_estimators': 177}, 3: {'min_weight_fraction_leaf': 0.023058297512734277, 'bootstrap': True, 'max_depth': 5, 'max_leaf_nodes': 33, 'n_estimators': 300}}

ISSUE: The classifier does not seem to be learning over generations. What is the cause?

分享到QQ

分享到微博