
发布于 2025-01-27 20:01:43 字数 7615 浏览 4 评论 0原文


这是原始的 dataset ;我已经对其进行了预处理(使用LabElencoder,binaryCoder,onehotencoder和Maxabsscaler)。我已经跳过了简洁的预处理代码,但其由此产生的38个功能列是:

'BMI','hyshyshealth','centhichealth','agecategory','agecategory','sleeptime','spebring_0',smoking_0',smoking_1','smoting_1','','','','','',' AlcoholdRinking_0','alcoholdRinking_1','stroke_0','stroke_1','diffwalking_0','diffwalking_1','seal_0',sex _0',sex_1','helthyActivity_0' ,'Skiccancer_1',Race_american Indian/Alaskan本地','Race_asian','Race_black','Race_hispanic','Race_other','Race_white','nedneydisease_no' ','genhealth_poor','genhealth_very good',diabetic_no',diabetic_no,边界糖尿病','diabetic_yes',diabetic_yes(diabetic_yes(怀孕期间)'



# Loading the dataset
dataset = pd.read_csv('data/heart_2020_cleaned.csv')

# Slicing the dataset to first 10000 rows to ease computations
dataset = dataset.iloc[:10000]

# Separating target from features
features = dataset.drop(columns='HeartDisease')     # X
target = dataset['HeartDisease']                    # y

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.33, random_state=42, stratify=target)

# Undersample the training data
sampler = RandomUnderSampler(random_state=42)
balanced_features, balanced_target = sampler.fit_resample(
    X_train, y_train)

# Classification starts here
clf = RandomForestClassifier(n_jobs=5)

start = time()

param_grid = {'min_weight_fraction_leaf': Continuous(0.01, 0.5, distribution='log-uniform'),
              'bootstrap': Categorical([True, False]),
              'max_depth': Integer(2, 30),
              'max_leaf_nodes': Integer(2, 35),
              'n_estimators': Integer(100, 300)}

cv = StratifiedKFold(n_splits=3, shuffle=True)

# Using GASearchCV to search for best parameters
evolved_estimator = GASearchCV(estimator=clf,

# Train and optimise the estimator
evolved_estimator.fit(X_train, y_train)

end = time()
result = end - start
print('%.3f seconds' % result)

# Best parameters found

# Use the model fitted with the best parameters
y_predict_ga = evolved_estimator.predict(X_test)
print(accuracy_score(y_test, y_predict_ga))

# Saved metadata for further analysis
print("Stats achieved in each generation: ", evolved_estimator.history)
print("Best k solutions: ", evolved_estimator.hof)


gen nevals  fitness     fitness_std fitness_max fitness_min
0   10      0.901343    1.11022e-16 0.901343    0.901343   
1   20      0.901343    1.11022e-16 0.901343    0.901343   
2   20      0.901343    1.11022e-16 0.901343    0.901343   
3   20      0.901343    1.11022e-16 0.901343    0.901343   
4   20      0.901343    1.11022e-16 0.901343    0.901343   
5   20      0.901343    1.11022e-16 0.901343    0.901343   
6   20      0.901343    1.11022e-16 0.901343    0.901343   
7   20      0.901343    1.11022e-16 0.901343    0.901343   
8   20      0.901343    1.11022e-16 0.901343    0.901343   
9   20      0.901343    1.11022e-16 0.901343    0.901343   
10  20      0.901343    1.11022e-16 0.901343    0.901343   
11  20      0.901343    1.11022e-16 0.901343    0.901343   
12  20      0.901343    1.11022e-16 0.901343    0.901343   
13  20      0.901343    1.11022e-16 0.901343    0.901343   
14  20      0.901343    1.11022e-16 0.901343    0.901343   
15  20      0.901343    1.11022e-16 0.901343    0.901343   
16  20      0.901343    1.11022e-16 0.901343    0.901343   
17  20      0.901343    1.11022e-16 0.901343    0.901343   
18  20      0.901343    1.11022e-16 0.901343    0.901343   
19  20      0.901343    1.11022e-16 0.901343    0.901343   
20  20      0.901343    1.11022e-16 0.901343    0.901343   
21  20      0.901343    1.11022e-16 0.901343    0.901343   
22  20      0.901343    1.11022e-16 0.901343    0.901343   
24  20      0.901343    1.11022e-16 0.901343    0.901343   

sklearn-genetic-opt closed prematurely. Will use the current best model.
INFO: Stopping the algorithm
235.563 seconds
{'min_weight_fraction_leaf': 0.023058297512734277, 'bootstrap': True, 'max_depth': 5, 'max_leaf_nodes': 33, 'n_estimators': 300}
Stats achieved in each generation:  {'gen': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 24], 'fitness': [0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636], 'fitness_std': [1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16], 'fitness_max': [0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638], 'fitness_min': [0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638]}
Best k solutions:  {0: {'min_weight_fraction_leaf': 0.023058297512734277, 'bootstrap': True, 'max_depth': 5, 'max_leaf_nodes': 33, 'n_estimators': 300}, 1: {'min_weight_fraction_leaf': 0.22091581038404914, 'bootstrap': True, 'max_depth': 5, 'max_leaf_nodes': 26, 'n_estimators': 142}, 2: {'min_weight_fraction_leaf': 0.09793187151751966, 'bootstrap': True, 'max_depth': 3, 'max_leaf_nodes': 28, 'n_estimators': 177}, 3: {'min_weight_fraction_leaf': 0.023058297512734277, 'bootstrap': True, 'max_depth': 5, 'max_leaf_nodes': 33, 'n_estimators': 300}}


I am trying to use Random Forest to classify heart disease using GASearchCV from sklearn-genetic.

This is the original dataset; I have pre-processed it (using LabelEncoder, BinaryEncoder, OneHotEncoder and MaxAbsScaler). I have skipped out on the pre-processing code for brevity, but its resulting 38 feature columns are:

'BMI', 'PhysicalHealth', 'MentalHealth', 'AgeCategory', 'SleepTime', 'Smoking_0', 'Smoking_1', 'AlcoholDrinking_0', 'AlcoholDrinking_1', 'Stroke_0', 'Stroke_1', 'DiffWalking_0', 'DiffWalking_1', 'Sex_0', 'Sex_1', 'PhysicalActivity_0', 'PhysicalActivity_1', 'Asthma_0', 'Asthma_1', 'SkinCancer_0', 'SkinCancer_1', Race_American Indian/Alaskan Native', 'Race_Asian', 'Race_Black', 'Race_Hispanic', 'Race_Other', 'Race_White', 'KidneyDisease_No', 'KidneyDisease_Yes', 'GenHealth_Excellent', 'GenHealth_Fair', 'GenHealth_Good', 'GenHealth_Poor', 'GenHealth_Very good', 'Diabetic_No', 'Diabetic_No, borderline diabetes', 'Diabetic_Yes', 'Diabetic_Yes (during pregnancy)'

'HeartDisease' is the target.


# Loading the dataset
dataset = pd.read_csv('data/heart_2020_cleaned.csv')

# Slicing the dataset to first 10000 rows to ease computations
dataset = dataset.iloc[:10000]

# Separating target from features
features = dataset.drop(columns='HeartDisease')     # X
target = dataset['HeartDisease']                    # y

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.33, random_state=42, stratify=target)

# Undersample the training data
sampler = RandomUnderSampler(random_state=42)
balanced_features, balanced_target = sampler.fit_resample(
    X_train, y_train)

# Classification starts here
clf = RandomForestClassifier(n_jobs=5)

start = time()

param_grid = {'min_weight_fraction_leaf': Continuous(0.01, 0.5, distribution='log-uniform'),
              'bootstrap': Categorical([True, False]),
              'max_depth': Integer(2, 30),
              'max_leaf_nodes': Integer(2, 35),
              'n_estimators': Integer(100, 300)}

cv = StratifiedKFold(n_splits=3, shuffle=True)

# Using GASearchCV to search for best parameters
evolved_estimator = GASearchCV(estimator=clf,

# Train and optimise the estimator
evolved_estimator.fit(X_train, y_train)

end = time()
result = end - start
print('%.3f seconds' % result)

# Best parameters found

# Use the model fitted with the best parameters
y_predict_ga = evolved_estimator.predict(X_test)
print(accuracy_score(y_test, y_predict_ga))

# Saved metadata for further analysis
print("Stats achieved in each generation: ", evolved_estimator.history)
print("Best k solutions: ", evolved_estimator.hof)


gen nevals  fitness     fitness_std fitness_max fitness_min
0   10      0.901343    1.11022e-16 0.901343    0.901343   
1   20      0.901343    1.11022e-16 0.901343    0.901343   
2   20      0.901343    1.11022e-16 0.901343    0.901343   
3   20      0.901343    1.11022e-16 0.901343    0.901343   
4   20      0.901343    1.11022e-16 0.901343    0.901343   
5   20      0.901343    1.11022e-16 0.901343    0.901343   
6   20      0.901343    1.11022e-16 0.901343    0.901343   
7   20      0.901343    1.11022e-16 0.901343    0.901343   
8   20      0.901343    1.11022e-16 0.901343    0.901343   
9   20      0.901343    1.11022e-16 0.901343    0.901343   
10  20      0.901343    1.11022e-16 0.901343    0.901343   
11  20      0.901343    1.11022e-16 0.901343    0.901343   
12  20      0.901343    1.11022e-16 0.901343    0.901343   
13  20      0.901343    1.11022e-16 0.901343    0.901343   
14  20      0.901343    1.11022e-16 0.901343    0.901343   
15  20      0.901343    1.11022e-16 0.901343    0.901343   
16  20      0.901343    1.11022e-16 0.901343    0.901343   
17  20      0.901343    1.11022e-16 0.901343    0.901343   
18  20      0.901343    1.11022e-16 0.901343    0.901343   
19  20      0.901343    1.11022e-16 0.901343    0.901343   
20  20      0.901343    1.11022e-16 0.901343    0.901343   
21  20      0.901343    1.11022e-16 0.901343    0.901343   
22  20      0.901343    1.11022e-16 0.901343    0.901343   
24  20      0.901343    1.11022e-16 0.901343    0.901343   

sklearn-genetic-opt closed prematurely. Will use the current best model.
INFO: Stopping the algorithm
235.563 seconds
{'min_weight_fraction_leaf': 0.023058297512734277, 'bootstrap': True, 'max_depth': 5, 'max_leaf_nodes': 33, 'n_estimators': 300}
Stats achieved in each generation:  {'gen': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 24], 'fitness': [0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636], 'fitness_std': [1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16], 'fitness_max': [0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638], 'fitness_min': [0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638]}
Best k solutions:  {0: {'min_weight_fraction_leaf': 0.023058297512734277, 'bootstrap': True, 'max_depth': 5, 'max_leaf_nodes': 33, 'n_estimators': 300}, 1: {'min_weight_fraction_leaf': 0.22091581038404914, 'bootstrap': True, 'max_depth': 5, 'max_leaf_nodes': 26, 'n_estimators': 142}, 2: {'min_weight_fraction_leaf': 0.09793187151751966, 'bootstrap': True, 'max_depth': 3, 'max_leaf_nodes': 28, 'n_estimators': 177}, 3: {'min_weight_fraction_leaf': 0.023058297512734277, 'bootstrap': True, 'max_depth': 5, 'max_leaf_nodes': 33, 'n_estimators': 300}}

ISSUE: The classifier does not seem to be learning over generations. What is the cause?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。



需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。


且行且努力 2025-02-03 20:01:43


  • 如果没有太多数据不平衡,删除平衡,随机森林应该能够更好地处理它,但请确保将指标更改为
  • log 可能太低)来改变

fia fif fi1 ,看起来GasearchCV陷入了一定的位置,这可以通过增加数量人口大小(10个

there are a couple things you can try to improve it:

  • If there is not too much data imbalance, remove the balancing, random forest should be able to handle it better, but make sure to change the metric to something like F1 score
  • By the log, it looks gasearchcv got stuck in a point, this can change by increasing the number population size (10 might be too low), you can also increase the mutation_probability parameter, so it look for more diverse solutions

I hope it helps

我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。