与简单的随机森林相比,带有网格搜索的随机森林搜索最差

发布于 2025-01-23 19:33:42 字数 2024 浏览 0 评论 0原文

我正在使用一个简单的随机森林进行训练模型,然后使用网格搜索与随机森林完全相同的数据集进行训练。骄傲的是,由于网格搜索寻找值的最佳组合,因此后来的搜索应更高,但情况恰恰相反。

#Random Forest
clf=RandomForestClassifier()
model=clf.fit(X_train, y_train)
y_pred=model.predict(X_test)

#Model metrics
results=classifmodel_Metrics('rf',model, y_test, y_pred)
list_of_results.append(results)

#GridSearchCV

clf=RandomForestClassifier()
parameter_grid={'n_estimators':[50,100,150,250,500,1000,1500,2000,2500,3000],
                'max_depth':[1,2,3,4,5,6]}
gridSearch=GridSearchCV(clf,parameter_grid,cv=5,n_jobs=1,verbose=5)
gridSearchResults=gridSearch.fit(X,y)

print(gridSearchResults.best_estimator_)
clf=gridSearchResults.best_estimator_
model=clf.fit(X_train, y_train)
y_pred=model.predict(X_test)

#Model metrics
results=classifmodel_Metrics('rfopt',model,y_test,y_pred)
list_of_results.append(results)

print(list_of_results)

有人知道为什么会发生这种情况吗?我的代码有问题,还是可能大都会发生的事情? 我用来计算模型性能的功能是,是我使用的参考值(F1越高,模型是)


def classifmodel_Metrics(modelName, model, actual, predicted):

    classes = list(np.unique(np.concatenate((actual,predicted))))

    confMtx = confusion_matrix(actual,predicted)

    print("Confusion Matrix")
    print(confMtx)

    report = classification_report(actual,predicted,output_dict = True)

    precision = report["macro avg"]["precision"]
    recall = report["macro avg"]["recall"]
    f1 = report["macro avg"]["f1-score"] # Média ponderada da precision e recall

    res = pd.Series({
    "ModelName":modelName,
    "Model":model,
    "accuracy":round(accuracy_score(predicted,actual),3),
    "precision": round(precision,3),
    "recall": round(recall,3),
    "f1": round(f1,3)
    })

    if len(classes) == 2:
        print("\naccuracy: {0:.2%}".format(round(accuracy_score(predicted,actual),3)))
        print("\nprecision: {0:.2%}".format(precision))
        print("\nrecall: {0:.2%}".format(recall))
        print("\nf1: {0:.2%}".format(f1))
    else:
        print("\n",classification_report(actual,predicted))

    return res

I am training a model using a simple Random Forest and then another model with the exact same dataset with Random Forest using Grid Search. Supossely , since Grid Search looks for the best combination of values ,the perfomance of the later one should be higher, but the opposite is happening.

#Random Forest
clf=RandomForestClassifier()
model=clf.fit(X_train, y_train)
y_pred=model.predict(X_test)

#Model metrics
results=classifmodel_Metrics('rf',model, y_test, y_pred)
list_of_results.append(results)

#GridSearchCV

clf=RandomForestClassifier()
parameter_grid={'n_estimators':[50,100,150,250,500,1000,1500,2000,2500,3000],
                'max_depth':[1,2,3,4,5,6]}
gridSearch=GridSearchCV(clf,parameter_grid,cv=5,n_jobs=1,verbose=5)
gridSearchResults=gridSearch.fit(X,y)

print(gridSearchResults.best_estimator_)
clf=gridSearchResults.best_estimator_
model=clf.fit(X_train, y_train)
y_pred=model.predict(X_test)

#Model metrics
results=classifmodel_Metrics('rfopt',model,y_test,y_pred)
list_of_results.append(results)

print(list_of_results)

Does anyone know why is this happening? Is something wrong with my code or is something that can esporadically happen?
The function I use to calcule my model performance is this , being F1 the value I use for reference( the higher the F1 the best the model is)


def classifmodel_Metrics(modelName, model, actual, predicted):

    classes = list(np.unique(np.concatenate((actual,predicted))))

    confMtx = confusion_matrix(actual,predicted)

    print("Confusion Matrix")
    print(confMtx)

    report = classification_report(actual,predicted,output_dict = True)

    precision = report["macro avg"]["precision"]
    recall = report["macro avg"]["recall"]
    f1 = report["macro avg"]["f1-score"] # Média ponderada da precision e recall

    res = pd.Series({
    "ModelName":modelName,
    "Model":model,
    "accuracy":round(accuracy_score(predicted,actual),3),
    "precision": round(precision,3),
    "recall": round(recall,3),
    "f1": round(f1,3)
    })

    if len(classes) == 2:
        print("\naccuracy: {0:.2%}".format(round(accuracy_score(predicted,actual),3)))
        print("\nprecision: {0:.2%}".format(precision))
        print("\nrecall: {0:.2%}".format(recall))
        print("\nf1: {0:.2%}".format(f1))
    else:
        print("\n",classification_report(actual,predicted))

    return res

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

淑女气质 2025-01-30 19:33:42

您的参数网格很可能只是捕获足够的深度以使您的模型学习得很好。

您有:

parameter_grid={'n_estimators':[50,100,150,250,500,1000,1500,2000,2500,3000],
                'max_depth':[1,2,3,4,5,6]}

您将模型限制在最多64片叶子(2^6)的地方,最大深度为6。 n 样品在其自己的叶子(< n /2)中。

为了提高性能,我将使用更多的深度选项。您在森林中需要这么多树(回报率降低)也很不可能。尝试这样的事情:

parameter_grid={'n_estimators':[64, 128, 256],
                'max_depth':[2, 4, 8, 16, 36, 64]}

如果您的最佳模型是最大深度,则可以尝试增加正在测试的限制。

It's likely that your parameter grid is just not capturing enough depth for your model to learn well.

You have:

parameter_grid={'n_estimators':[50,100,150,250,500,1000,1500,2000,2500,3000],
                'max_depth':[1,2,3,4,5,6]}

Where you are limiting your model to at most 64 leaves (2^6) through having a max depth of 6. In contrast, the default for scikit-learn is have as many layers as necessary until each of your n samples is in its own leaf (<n/2).

To improve performance I would use more depth options. It's also highly unlikely that you need so many trees in your forest (diminishing returns). Try something like this instead:

parameter_grid={'n_estimators':[64, 128, 256],
                'max_depth':[2, 4, 8, 16, 36, 64]}

If your best model is topping out the max depth, you can try increasing the limit you are testing.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文