Randomforest分类器几代人没有学习
我正在尝试使用随机森林使用Sklearn-Genetic的GasearchCV对心脏病进行分类。
这是原始的 dataset ;我已经对其进行了预处理(使用LabElencoder,binaryCoder,onehotencoder和Maxabsscaler)。我已经跳过了简洁的预处理代码,但其由此产生的38个功能列是:
'BMI','hyshyshealth','centhichealth','agecategory','agecategory','sleeptime','spebring_0',smoking_0',smoking_1','smoting_1','','','','','',' AlcoholdRinking_0','alcoholdRinking_1','stroke_0','stroke_1','diffwalking_0','diffwalking_1','seal_0',sex _0',sex_1','helthyActivity_0' ,'Skiccancer_1',Race_american Indian/Alaskan本地','Race_asian','Race_black','Race_hispanic','Race_other','Race_white','nedneydisease_no' ','genhealth_poor','genhealth_very good',diabetic_no',diabetic_no,边界糖尿病','diabetic_yes',diabetic_yes(diabetic_yes(怀孕期间)'
heartdisesease'是目标。
代码:
# Loading the dataset
dataset = pd.read_csv('data/heart_2020_cleaned.csv')
# Slicing the dataset to first 10000 rows to ease computations
dataset = dataset.iloc[:10000]
# Separating target from features
features = dataset.drop(columns='HeartDisease') # X
target = dataset['HeartDisease'] # y
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.33, random_state=42, stratify=target)
# Undersample the training data
sampler = RandomUnderSampler(random_state=42)
balanced_features, balanced_target = sampler.fit_resample(
X_train, y_train)
# Classification starts here
clf = RandomForestClassifier(n_jobs=5)
start = time()
param_grid = {'min_weight_fraction_leaf': Continuous(0.01, 0.5, distribution='log-uniform'),
'bootstrap': Categorical([True, False]),
'max_depth': Integer(2, 30),
'max_leaf_nodes': Integer(2, 35),
'n_estimators': Integer(100, 300)}
cv = StratifiedKFold(n_splits=3, shuffle=True)
# Using GASearchCV to search for best parameters
evolved_estimator = GASearchCV(estimator=clf,
cv=cv,
scoring='accuracy',
population_size=10,
generations=35,
param_grid=param_grid,
n_jobs=5,
verbose=True,
keep_top_k=4,)
# Train and optimise the estimator
evolved_estimator.fit(X_train, y_train)
end = time()
result = end - start
print('%.3f seconds' % result)
# Best parameters found
print(evolved_estimator.best_params_)
# Use the model fitted with the best parameters
y_predict_ga = evolved_estimator.predict(X_test)
print(accuracy_score(y_test, y_predict_ga))
# Saved metadata for further analysis
print("Stats achieved in each generation: ", evolved_estimator.history)
print("Best k solutions: ", evolved_estimator.hof)
输出:
gen nevals fitness fitness_std fitness_max fitness_min
0 10 0.901343 1.11022e-16 0.901343 0.901343
1 20 0.901343 1.11022e-16 0.901343 0.901343
2 20 0.901343 1.11022e-16 0.901343 0.901343
3 20 0.901343 1.11022e-16 0.901343 0.901343
4 20 0.901343 1.11022e-16 0.901343 0.901343
5 20 0.901343 1.11022e-16 0.901343 0.901343
6 20 0.901343 1.11022e-16 0.901343 0.901343
7 20 0.901343 1.11022e-16 0.901343 0.901343
8 20 0.901343 1.11022e-16 0.901343 0.901343
9 20 0.901343 1.11022e-16 0.901343 0.901343
10 20 0.901343 1.11022e-16 0.901343 0.901343
11 20 0.901343 1.11022e-16 0.901343 0.901343
12 20 0.901343 1.11022e-16 0.901343 0.901343
13 20 0.901343 1.11022e-16 0.901343 0.901343
14 20 0.901343 1.11022e-16 0.901343 0.901343
15 20 0.901343 1.11022e-16 0.901343 0.901343
16 20 0.901343 1.11022e-16 0.901343 0.901343
17 20 0.901343 1.11022e-16 0.901343 0.901343
18 20 0.901343 1.11022e-16 0.901343 0.901343
19 20 0.901343 1.11022e-16 0.901343 0.901343
20 20 0.901343 1.11022e-16 0.901343 0.901343
21 20 0.901343 1.11022e-16 0.901343 0.901343
22 20 0.901343 1.11022e-16 0.901343 0.901343
24 20 0.901343 1.11022e-16 0.901343 0.901343
sklearn-genetic-opt closed prematurely. Will use the current best model.
INFO: Stopping the algorithm
235.563 seconds
{'min_weight_fraction_leaf': 0.023058297512734277, 'bootstrap': True, 'max_depth': 5, 'max_leaf_nodes': 33, 'n_estimators': 300}
0.9015151515151515
Stats achieved in each generation: {'gen': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 24], 'fitness': [0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636], 'fitness_std': [1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16], 'fitness_max': [0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638], 'fitness_min': [0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638]}
Best k solutions: {0: {'min_weight_fraction_leaf': 0.023058297512734277, 'bootstrap': True, 'max_depth': 5, 'max_leaf_nodes': 33, 'n_estimators': 300}, 1: {'min_weight_fraction_leaf': 0.22091581038404914, 'bootstrap': True, 'max_depth': 5, 'max_leaf_nodes': 26, 'n_estimators': 142}, 2: {'min_weight_fraction_leaf': 0.09793187151751966, 'bootstrap': True, 'max_depth': 3, 'max_leaf_nodes': 28, 'n_estimators': 177}, 3: {'min_weight_fraction_leaf': 0.023058297512734277, 'bootstrap': True, 'max_depth': 5, 'max_leaf_nodes': 33, 'n_estimators': 300}}
问题:分类器似乎并未在几代人中学习。什么原因?
I am trying to use Random Forest to classify heart disease using GASearchCV from sklearn-genetic.
This is the original dataset; I have pre-processed it (using LabelEncoder, BinaryEncoder, OneHotEncoder and MaxAbsScaler). I have skipped out on the pre-processing code for brevity, but its resulting 38 feature columns are:
'BMI', 'PhysicalHealth', 'MentalHealth', 'AgeCategory', 'SleepTime', 'Smoking_0', 'Smoking_1', 'AlcoholDrinking_0', 'AlcoholDrinking_1', 'Stroke_0', 'Stroke_1', 'DiffWalking_0', 'DiffWalking_1', 'Sex_0', 'Sex_1', 'PhysicalActivity_0', 'PhysicalActivity_1', 'Asthma_0', 'Asthma_1', 'SkinCancer_0', 'SkinCancer_1', Race_American Indian/Alaskan Native', 'Race_Asian', 'Race_Black', 'Race_Hispanic', 'Race_Other', 'Race_White', 'KidneyDisease_No', 'KidneyDisease_Yes', 'GenHealth_Excellent', 'GenHealth_Fair', 'GenHealth_Good', 'GenHealth_Poor', 'GenHealth_Very good', 'Diabetic_No', 'Diabetic_No, borderline diabetes', 'Diabetic_Yes', 'Diabetic_Yes (during pregnancy)'
'HeartDisease' is the target.
CODE:
# Loading the dataset
dataset = pd.read_csv('data/heart_2020_cleaned.csv')
# Slicing the dataset to first 10000 rows to ease computations
dataset = dataset.iloc[:10000]
# Separating target from features
features = dataset.drop(columns='HeartDisease') # X
target = dataset['HeartDisease'] # y
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.33, random_state=42, stratify=target)
# Undersample the training data
sampler = RandomUnderSampler(random_state=42)
balanced_features, balanced_target = sampler.fit_resample(
X_train, y_train)
# Classification starts here
clf = RandomForestClassifier(n_jobs=5)
start = time()
param_grid = {'min_weight_fraction_leaf': Continuous(0.01, 0.5, distribution='log-uniform'),
'bootstrap': Categorical([True, False]),
'max_depth': Integer(2, 30),
'max_leaf_nodes': Integer(2, 35),
'n_estimators': Integer(100, 300)}
cv = StratifiedKFold(n_splits=3, shuffle=True)
# Using GASearchCV to search for best parameters
evolved_estimator = GASearchCV(estimator=clf,
cv=cv,
scoring='accuracy',
population_size=10,
generations=35,
param_grid=param_grid,
n_jobs=5,
verbose=True,
keep_top_k=4,)
# Train and optimise the estimator
evolved_estimator.fit(X_train, y_train)
end = time()
result = end - start
print('%.3f seconds' % result)
# Best parameters found
print(evolved_estimator.best_params_)
# Use the model fitted with the best parameters
y_predict_ga = evolved_estimator.predict(X_test)
print(accuracy_score(y_test, y_predict_ga))
# Saved metadata for further analysis
print("Stats achieved in each generation: ", evolved_estimator.history)
print("Best k solutions: ", evolved_estimator.hof)
OUTPUT:
gen nevals fitness fitness_std fitness_max fitness_min
0 10 0.901343 1.11022e-16 0.901343 0.901343
1 20 0.901343 1.11022e-16 0.901343 0.901343
2 20 0.901343 1.11022e-16 0.901343 0.901343
3 20 0.901343 1.11022e-16 0.901343 0.901343
4 20 0.901343 1.11022e-16 0.901343 0.901343
5 20 0.901343 1.11022e-16 0.901343 0.901343
6 20 0.901343 1.11022e-16 0.901343 0.901343
7 20 0.901343 1.11022e-16 0.901343 0.901343
8 20 0.901343 1.11022e-16 0.901343 0.901343
9 20 0.901343 1.11022e-16 0.901343 0.901343
10 20 0.901343 1.11022e-16 0.901343 0.901343
11 20 0.901343 1.11022e-16 0.901343 0.901343
12 20 0.901343 1.11022e-16 0.901343 0.901343
13 20 0.901343 1.11022e-16 0.901343 0.901343
14 20 0.901343 1.11022e-16 0.901343 0.901343
15 20 0.901343 1.11022e-16 0.901343 0.901343
16 20 0.901343 1.11022e-16 0.901343 0.901343
17 20 0.901343 1.11022e-16 0.901343 0.901343
18 20 0.901343 1.11022e-16 0.901343 0.901343
19 20 0.901343 1.11022e-16 0.901343 0.901343
20 20 0.901343 1.11022e-16 0.901343 0.901343
21 20 0.901343 1.11022e-16 0.901343 0.901343
22 20 0.901343 1.11022e-16 0.901343 0.901343
24 20 0.901343 1.11022e-16 0.901343 0.901343
sklearn-genetic-opt closed prematurely. Will use the current best model.
INFO: Stopping the algorithm
235.563 seconds
{'min_weight_fraction_leaf': 0.023058297512734277, 'bootstrap': True, 'max_depth': 5, 'max_leaf_nodes': 33, 'n_estimators': 300}
0.9015151515151515
Stats achieved in each generation: {'gen': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 24], 'fitness': [0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636], 'fitness_std': [1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16], 'fitness_max': [0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638], 'fitness_min': [0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638]}
Best k solutions: {0: {'min_weight_fraction_leaf': 0.023058297512734277, 'bootstrap': True, 'max_depth': 5, 'max_leaf_nodes': 33, 'n_estimators': 300}, 1: {'min_weight_fraction_leaf': 0.22091581038404914, 'bootstrap': True, 'max_depth': 5, 'max_leaf_nodes': 26, 'n_estimators': 142}, 2: {'min_weight_fraction_leaf': 0.09793187151751966, 'bootstrap': True, 'max_depth': 3, 'max_leaf_nodes': 28, 'n_estimators': 177}, 3: {'min_weight_fraction_leaf': 0.023058297512734277, 'bootstrap': True, 'max_depth': 5, 'max_leaf_nodes': 33, 'n_estimators': 300}}
ISSUE: The classifier does not seem to be learning over generations. What is the cause?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
有几件事可以尝试改进它:
fia fif fi1 ,看起来GasearchCV陷入了一定的位置,这可以通过增加数量人口大小(10个
there are a couple things you can try to improve it:
I hope it helps