为什么我会通过随机搜索估计器和具有最佳参数创建的新估计器获得不同的结果?
随机搜索CV的估计量,我经历了意外的行为:
我正在寻找随机森林的最佳参数。当我确定所得最佳估计器的准确性时,与训练随机搜索最佳参数的新随机森林相比,我会获得不同的结果。这是为什么?
这是随机搜索的代码示例(只有很少的迭代):
n_estimators = np.linspace(start=100,stop=2500,num=11, dtype=int)
max_features = ['sqrt',None , 0.2, 0.4]
max_depth = [10, 20, 50, 75, 100, 125, 150]
min_samples_split = [2, 5, 8, 11]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]
criterion=['gini', 'entropy']
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap,
'criterion': criterion}
rf_base=RandomForestClassifier()
rf_random=RandomizedSearchCV(estimator = rf_base, param_distributions = random_grid,
n_iter = 10, cv = 5, verbose=2, random_state=42, n_jobs = -1)
training_features, training_labels = gd.get_data(1)
test_features, test_labels = gd.get_data(2)
rf_random.fit(training_features, training_labels)
print("The best estimator: ", rf_random.best_estimator_)
print("The best score: ", rf_random.best_score_)
print ('Training Accuracy: ', str(rf_random.score(training_features, training_labels)))
print ('Test Accuracy: ', str(rf_random.score(test_features,test_labels)))
返回exapmle的结果n_estimators = 1780,min_samples_split = 11,min_samples_leaf = 2, max_features = 0.2,max_depth = 20,criterion ='熵',bootstrap ='false'和测试精度为0.8417,
但是当我使用这些参数训练新模型时,我会得到例如0,8339的测试准确性。代码看起来像这样:
training_features, training_labels = gd.get_data(1)
test_features, test_labels = gd.get_data(2)
rf=RandomForestClassifier()
rf.set_params(n_estimators=1780, min_samples_split=11, min_samples_leaf=2,
max_features=0.2, max_depth=20, criterion='entropy',bootstrap='False')
rf.fit(training_features, training_labels)
print('Training Accuracy: ', str(rf.score(training_features, training_labels)))
print ('Test Accuracy: ', str(rf.score(test_features,test_labels)))
I have experienced an unexpected behaviour of with the estimator of the RandomizedSearchCV:
I am searching for the best parameter for a random forest. When I determine the accuracy with the resulting best estimator I get different results compared to training a new random forest with the best parameters from the randomized search. Why is that?
Here is a code example for the RandomizedSearch (with just very few iterations):
n_estimators = np.linspace(start=100,stop=2500,num=11, dtype=int)
max_features = ['sqrt',None , 0.2, 0.4]
max_depth = [10, 20, 50, 75, 100, 125, 150]
min_samples_split = [2, 5, 8, 11]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]
criterion=['gini', 'entropy']
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap,
'criterion': criterion}
rf_base=RandomForestClassifier()
rf_random=RandomizedSearchCV(estimator = rf_base, param_distributions = random_grid,
n_iter = 10, cv = 5, verbose=2, random_state=42, n_jobs = -1)
training_features, training_labels = gd.get_data(1)
test_features, test_labels = gd.get_data(2)
rf_random.fit(training_features, training_labels)
print("The best estimator: ", rf_random.best_estimator_)
print("The best score: ", rf_random.best_score_)
print ('Training Accuracy: ', str(rf_random.score(training_features, training_labels)))
print ('Test Accuracy: ', str(rf_random.score(test_features,test_labels)))
This returns for exapmle the results n_estimators=1780, min_samples_split=11, min_samples_leaf=2,
max_features=0.2, max_depth=20, criterion='entropy',bootstrap='False' and a test accuracy of 0.8417
But when I train a new model with these parameters I get for example a test accuracy of 0,8339. The code would look like this:
training_features, training_labels = gd.get_data(1)
test_features, test_labels = gd.get_data(2)
rf=RandomForestClassifier()
rf.set_params(n_estimators=1780, min_samples_split=11, min_samples_leaf=2,
max_features=0.2, max_depth=20, criterion='entropy',bootstrap='False')
rf.fit(training_features, training_labels)
print('Training Accuracy: ', str(rf.score(training_features, training_labels)))
print ('Test Accuracy: ', str(rf.score(test_features,test_labels)))
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
解决方案是在两种情况下将随机_STATE设置为相同的值(新估算器缺少它)。
The solution is to set in both cases the random_state to the same value (it was missing for the new estimator).