与简单的随机森林相比,带有网格搜索的随机森林搜索最差
我正在使用一个简单的随机森林进行训练模型,然后使用网格搜索与随机森林完全相同的数据集进行训练。骄傲的是,由于网格搜索寻找值的最佳组合,因此后来的搜索应更高,但情况恰恰相反。
#Random Forest
clf=RandomForestClassifier()
model=clf.fit(X_train, y_train)
y_pred=model.predict(X_test)
#Model metrics
results=classifmodel_Metrics('rf',model, y_test, y_pred)
list_of_results.append(results)
#GridSearchCV
clf=RandomForestClassifier()
parameter_grid={'n_estimators':[50,100,150,250,500,1000,1500,2000,2500,3000],
'max_depth':[1,2,3,4,5,6]}
gridSearch=GridSearchCV(clf,parameter_grid,cv=5,n_jobs=1,verbose=5)
gridSearchResults=gridSearch.fit(X,y)
print(gridSearchResults.best_estimator_)
clf=gridSearchResults.best_estimator_
model=clf.fit(X_train, y_train)
y_pred=model.predict(X_test)
#Model metrics
results=classifmodel_Metrics('rfopt',model,y_test,y_pred)
list_of_results.append(results)
print(list_of_results)
有人知道为什么会发生这种情况吗?我的代码有问题,还是可能大都会发生的事情? 我用来计算模型性能的功能是,是我使用的参考值(F1越高,模型是)
def classifmodel_Metrics(modelName, model, actual, predicted):
classes = list(np.unique(np.concatenate((actual,predicted))))
confMtx = confusion_matrix(actual,predicted)
print("Confusion Matrix")
print(confMtx)
report = classification_report(actual,predicted,output_dict = True)
precision = report["macro avg"]["precision"]
recall = report["macro avg"]["recall"]
f1 = report["macro avg"]["f1-score"] # Média ponderada da precision e recall
res = pd.Series({
"ModelName":modelName,
"Model":model,
"accuracy":round(accuracy_score(predicted,actual),3),
"precision": round(precision,3),
"recall": round(recall,3),
"f1": round(f1,3)
})
if len(classes) == 2:
print("\naccuracy: {0:.2%}".format(round(accuracy_score(predicted,actual),3)))
print("\nprecision: {0:.2%}".format(precision))
print("\nrecall: {0:.2%}".format(recall))
print("\nf1: {0:.2%}".format(f1))
else:
print("\n",classification_report(actual,predicted))
return res
I am training a model using a simple Random Forest and then another model with the exact same dataset with Random Forest using Grid Search. Supossely , since Grid Search looks for the best combination of values ,the perfomance of the later one should be higher, but the opposite is happening.
#Random Forest
clf=RandomForestClassifier()
model=clf.fit(X_train, y_train)
y_pred=model.predict(X_test)
#Model metrics
results=classifmodel_Metrics('rf',model, y_test, y_pred)
list_of_results.append(results)
#GridSearchCV
clf=RandomForestClassifier()
parameter_grid={'n_estimators':[50,100,150,250,500,1000,1500,2000,2500,3000],
'max_depth':[1,2,3,4,5,6]}
gridSearch=GridSearchCV(clf,parameter_grid,cv=5,n_jobs=1,verbose=5)
gridSearchResults=gridSearch.fit(X,y)
print(gridSearchResults.best_estimator_)
clf=gridSearchResults.best_estimator_
model=clf.fit(X_train, y_train)
y_pred=model.predict(X_test)
#Model metrics
results=classifmodel_Metrics('rfopt',model,y_test,y_pred)
list_of_results.append(results)
print(list_of_results)
Does anyone know why is this happening? Is something wrong with my code or is something that can esporadically happen?
The function I use to calcule my model performance is this , being F1 the value I use for reference( the higher the F1 the best the model is)
def classifmodel_Metrics(modelName, model, actual, predicted):
classes = list(np.unique(np.concatenate((actual,predicted))))
confMtx = confusion_matrix(actual,predicted)
print("Confusion Matrix")
print(confMtx)
report = classification_report(actual,predicted,output_dict = True)
precision = report["macro avg"]["precision"]
recall = report["macro avg"]["recall"]
f1 = report["macro avg"]["f1-score"] # Média ponderada da precision e recall
res = pd.Series({
"ModelName":modelName,
"Model":model,
"accuracy":round(accuracy_score(predicted,actual),3),
"precision": round(precision,3),
"recall": round(recall,3),
"f1": round(f1,3)
})
if len(classes) == 2:
print("\naccuracy: {0:.2%}".format(round(accuracy_score(predicted,actual),3)))
print("\nprecision: {0:.2%}".format(precision))
print("\nrecall: {0:.2%}".format(recall))
print("\nf1: {0:.2%}".format(f1))
else:
print("\n",classification_report(actual,predicted))
return res
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您的参数网格很可能只是捕获足够的深度以使您的模型学习得很好。
您有:
您将模型限制在最多64片叶子(2^6)的地方,最大深度为6。 n 样品在其自己的叶子(< n /2)中。
为了提高性能,我将使用更多的深度选项。您在森林中需要这么多树(回报率降低)也很不可能。尝试这样的事情:
如果您的最佳模型是最大深度,则可以尝试增加正在测试的限制。
It's likely that your parameter grid is just not capturing enough depth for your model to learn well.
You have:
Where you are limiting your model to at most 64 leaves (2^6) through having a max depth of 6. In contrast, the default for scikit-learn is have as many layers as necessary until each of your n samples is in its own leaf (<n/2).
To improve performance I would use more depth options. It's also highly unlikely that you need so many trees in your forest (diminishing returns). Try something like this instead:
If your best model is topping out the max depth, you can try increasing the limit you are testing.