为什么我会通过随机搜索估计器和具有最佳参数创建的新估计器获得不同的结果？

发布于 2025-02-04 11:21:14 字数 2104 浏览 4 评论 0原文

随机搜索CV的估计量，我经历了意外的行为：

我正在寻找随机森林的最佳参数。当我确定所得最佳估计器的准确性时，与训练随机搜索最佳参数的新随机森林相比，我会获得不同的结果。这是为什么？

这是随机搜索的代码示例（只有很少的迭代）：

n_estimators = np.linspace(start=100,stop=2500,num=11, dtype=int)
max_features = ['sqrt',None , 0.2, 0.4]
max_depth = [10, 20, 50, 75, 100, 125, 150]
min_samples_split = [2, 5, 8, 11]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]
criterion=['gini', 'entropy']
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap,
               'criterion': criterion}
rf_base=RandomForestClassifier()
rf_random=RandomizedSearchCV(estimator = rf_base, param_distributions = random_grid,
                             n_iter = 10, cv = 5, verbose=2, random_state=42, n_jobs = -1)

training_features, training_labels = gd.get_data(1)
test_features, test_labels = gd.get_data(2)

rf_random.fit(training_features, training_labels)
print("The best estimator: ", rf_random.best_estimator_)
print("The best score: ", rf_random.best_score_)

print ('Training Accuracy: ', str(rf_random.score(training_features, training_labels)))
print ('Test Accuracy: ', str(rf_random.score(test_features,test_labels)))

返回exapmle的结果n_estimators = 1780，min_samples_split = 11，min_samples_leaf = 2， max_features = 0.2，max_depth = 20，criterion ='熵'，bootstrap ='false'和测试精度为0.8417，

但是当我使用这些参数训练新模型时，我会得到例如0,8339的测试准确性。代码看起来像这样：

training_features, training_labels = gd.get_data(1)
test_features, test_labels = gd.get_data(2)

rf=RandomForestClassifier()
rf.set_params(n_estimators=1780, min_samples_split=11, min_samples_leaf=2,
                            max_features=0.2, max_depth=20, criterion='entropy',bootstrap='False')

rf.fit(training_features, training_labels)
print('Training Accuracy: ', str(rf.score(training_features, training_labels)))
print ('Test Accuracy: ', str(rf.score(test_features,test_labels)))

原文

I have experienced an unexpected behaviour of with the estimator of the RandomizedSearchCV:

I am searching for the best parameter for a random forest. When I determine the accuracy with the resulting best estimator I get different results compared to training a new random forest with the best parameters from the randomized search. Why is that?

Here is a code example for the RandomizedSearch (with just very few iterations):

n_estimators = np.linspace(start=100,stop=2500,num=11, dtype=int)
max_features = ['sqrt',None , 0.2, 0.4]
max_depth = [10, 20, 50, 75, 100, 125, 150]
min_samples_split = [2, 5, 8, 11]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]
criterion=['gini', 'entropy']
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap,
               'criterion': criterion}
rf_base=RandomForestClassifier()
rf_random=RandomizedSearchCV(estimator = rf_base, param_distributions = random_grid,
                             n_iter = 10, cv = 5, verbose=2, random_state=42, n_jobs = -1)

training_features, training_labels = gd.get_data(1)
test_features, test_labels = gd.get_data(2)

rf_random.fit(training_features, training_labels)
print("The best estimator: ", rf_random.best_estimator_)
print("The best score: ", rf_random.best_score_)

print ('Training Accuracy: ', str(rf_random.score(training_features, training_labels)))
print ('Test Accuracy: ', str(rf_random.score(test_features,test_labels)))

This returns for exapmle the results n_estimators=1780, min_samples_split=11, min_samples_leaf=2,
max_features=0.2, max_depth=20, criterion='entropy',bootstrap='False' and a test accuracy of 0.8417

But when I train a new model with these parameters I get for example a test accuracy of 0,8339. The code would look like this:

training_features, training_labels = gd.get_data(1)
test_features, test_labels = gd.get_data(2)

rf=RandomForestClassifier()
rf.set_params(n_estimators=1780, min_samples_split=11, min_samples_leaf=2,
                            max_features=0.2, max_depth=20, criterion='entropy',bootstrap='False')

rf.fit(training_features, training_labels)
print('Training Accuracy: ', str(rf.score(training_features, training_labels)))
print ('Test Accuracy: ', str(rf.score(test_features,test_labels)))

分享到QQ

分享到微博