为什么我会通过随机搜索估计器和具有最佳参数创建的新估计器获得不同的结果?

发布于 2025-02-04 11:21:14 字数 2104 浏览 4 评论 0原文

随机搜索CV的估计量,我经历了意外的行为:

我正在寻找随机森林的最佳参数。当我确定所得最佳估计器的准确性时,与训练随机搜索最佳参数的新随机森林相比,我会获得不同的结果。这是为什么?

这是随机搜索的代码示例(只有很少的迭代):

n_estimators = np.linspace(start=100,stop=2500,num=11, dtype=int)
max_features = ['sqrt',None , 0.2, 0.4]
max_depth = [10, 20, 50, 75, 100, 125, 150]
min_samples_split = [2, 5, 8, 11]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]
criterion=['gini', 'entropy']
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap,
               'criterion': criterion}
rf_base=RandomForestClassifier()
rf_random=RandomizedSearchCV(estimator = rf_base, param_distributions = random_grid,
                             n_iter = 10, cv = 5, verbose=2, random_state=42, n_jobs = -1)

training_features, training_labels = gd.get_data(1)
test_features, test_labels = gd.get_data(2)

rf_random.fit(training_features, training_labels)
print("The best estimator: ", rf_random.best_estimator_)
print("The best score: ", rf_random.best_score_)

print ('Training Accuracy: ', str(rf_random.score(training_features, training_labels)))
print ('Test Accuracy: ', str(rf_random.score(test_features,test_labels)))

返回exapmle的结果n_estimators = 1780,min_samples_split = 11,min_samples_leaf = 2, max_features = 0.2,max_depth = 20,criterion ='熵',bootstrap ='false'和测试精度为0.8417,

但是当我使用这些参数训练新模型时,我会得到例如0,8339的测试准确性。代码看起来像这样:

training_features, training_labels = gd.get_data(1)
test_features, test_labels = gd.get_data(2)

rf=RandomForestClassifier()
rf.set_params(n_estimators=1780, min_samples_split=11, min_samples_leaf=2,
                            max_features=0.2, max_depth=20, criterion='entropy',bootstrap='False')

rf.fit(training_features, training_labels)
print('Training Accuracy: ', str(rf.score(training_features, training_labels)))
print ('Test Accuracy: ', str(rf.score(test_features,test_labels)))

I have experienced an unexpected behaviour of with the estimator of the RandomizedSearchCV:

I am searching for the best parameter for a random forest. When I determine the accuracy with the resulting best estimator I get different results compared to training a new random forest with the best parameters from the randomized search. Why is that?

Here is a code example for the RandomizedSearch (with just very few iterations):

n_estimators = np.linspace(start=100,stop=2500,num=11, dtype=int)
max_features = ['sqrt',None , 0.2, 0.4]
max_depth = [10, 20, 50, 75, 100, 125, 150]
min_samples_split = [2, 5, 8, 11]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]
criterion=['gini', 'entropy']
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap,
               'criterion': criterion}
rf_base=RandomForestClassifier()
rf_random=RandomizedSearchCV(estimator = rf_base, param_distributions = random_grid,
                             n_iter = 10, cv = 5, verbose=2, random_state=42, n_jobs = -1)

training_features, training_labels = gd.get_data(1)
test_features, test_labels = gd.get_data(2)

rf_random.fit(training_features, training_labels)
print("The best estimator: ", rf_random.best_estimator_)
print("The best score: ", rf_random.best_score_)

print ('Training Accuracy: ', str(rf_random.score(training_features, training_labels)))
print ('Test Accuracy: ', str(rf_random.score(test_features,test_labels)))

This returns for exapmle the results n_estimators=1780, min_samples_split=11, min_samples_leaf=2,
max_features=0.2, max_depth=20, criterion='entropy',bootstrap='False' and a test accuracy of 0.8417

But when I train a new model with these parameters I get for example a test accuracy of 0,8339. The code would look like this:

training_features, training_labels = gd.get_data(1)
test_features, test_labels = gd.get_data(2)

rf=RandomForestClassifier()
rf.set_params(n_estimators=1780, min_samples_split=11, min_samples_leaf=2,
                            max_features=0.2, max_depth=20, criterion='entropy',bootstrap='False')

rf.fit(training_features, training_labels)
print('Training Accuracy: ', str(rf.score(training_features, training_labels)))
print ('Test Accuracy: ', str(rf.score(test_features,test_labels)))

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

So要识趣 2025-02-11 11:21:14

解决方案是在两种情况下将随机_STATE设置为相同的值(新估算器缺少它)。

The solution is to set in both cases the random_state to the same value (it was missing for the new estimator).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文