当前位置：文江博客话题详情

随机森林进一步改进

发布于 2025-01-13 05:53:42 字数 1455 浏览 1 评论 0原文

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

柒夜笙歌凉 2025-01-20 05:53:42

这对我来说很突出：

您分割了数据但不使用分割。
您正在缩放数据，但基于树的方法（例如随机森林）不需要此步骤。
您正在执行自己的调整循环，而不是使用 sklearn.model_selection.GridSearchCV。这很好，但它可能会变得非常繁琐（想象一下想要跨过另一个超参数）。
如果您使用 GridSearchCV ，则无需进行自己的交叉验证。
您使用准确性进行评估，这对于多类分类通常不是一个很好的评估指标。加权F1更好。
如果您正在进行交叉验证，则需要将缩放器放入 CV 循环中（例如使用管道），因为否则缩放器已经看到了验证数据......但是您不需要此学习算法的缩放器，因此这这点毫无意义。

我可能会做这样的事情：

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

X, y = make_classification()

# Split the data.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05, shuffle=True, random_state=0)

# Make things for the cross validation.
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
param_grid = {'max_depth': np.arange(3, 8)}
model = RandomForestClassifier(random_state=1)

# Create and train the cross validation.
clf = GridSearchCV(model, param_grid,
                   scoring='f1_weighted',
                   cv=cv, verbose=3)

clf.fit(X_train, y_train)

看看clf.cv_results_的分数等，如果你愿意的话你可以绘制出来。默认情况下，GridSearchCV 会根据最佳超参数训练最终模型，因此您可以使用 clf 进行预测。

差点忘了...您询问了有关改进模型的问题:)这里有一些想法：

以上内容将帮助您调整更多超参数（例如 max_features、n_estimators 和 min_samples_leaf）。但不要对超参数调整太过于得意忘形。
您可以尝试转换一些功能（X 中的列），或添加新功能。
寻找更多数据，例如更多行、更高质量的标签等。
解决任何类别不平衡的问题。
尝试更复杂的算法，例如梯度提升树（sklearn 中有模型，或者查看 xgboost）。

Here's what stands out to me:

You split the data but do not use the splits.
You're scaling the data, but tree-based methods like random forests do not need this step.
You are doing your own tuning loop, instead of using sklearn.model_selection.GridSearchCV. This is fine, but it can get quite fiddly (imagine wanting to step over another hyperparameter).
If you use GridSearchCV you don't need to do your own cross validation.
You're using accuracy for evaluation, which is usually not a great evaluation metric for multi-class classification. Weighted F1 is better.
If you're doing cross validation, you need to put the scaler in the CV loop (e.g. using a pipeline) because otherwise the scaler has seen the validation data... but you don't need a scaler for this learning algorithm so this point is moot.

I would probably do something like this:

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

X, y = make_classification()

# Split the data.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05, shuffle=True, random_state=0)

# Make things for the cross validation.
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
param_grid = {'max_depth': np.arange(3, 8)}
model = RandomForestClassifier(random_state=1)

# Create and train the cross validation.
clf = GridSearchCV(model, param_grid,
                   scoring='f1_weighted',
                   cv=cv, verbose=3)

clf.fit(X_train, y_train)

Take a look at clf.cv_results_ for the scores etc, which you can plot if you want. By default GridSearchCV trains a final model on the best hyperparameters, so you can make predictions with clf.

Almost forgot... you asked about improving the model :) Here are some ideas:

The above will help you tune on more hyperparameters (eg max_features, n_estimators, and min_samples_leaf). But don't get too carried away with hyperparameter tuning.
You could try transforming some features (columns in X), or adding new ones.
Look for more data, eg more rows, higher quality labels, etc.
Address any issues with class imbalance.
Try a more sophisticated algorithm, like gradient boosted trees (there are models in sklearn, or take a look at xgboost).

回复收藏 0 原文

~没有更多了~

关于作者

半透明的墙

暂无简介

文章

26 人气

关注发私信

Promise

文章 0 评论 0

关注

qq_lbRlsh

文章 0 评论 0

关注

待＂谢繁草

文章 0 评论 0

关注

yy2010hell

文章 0 评论 0

关注

漫无边际

文章 0 评论 0

关注

傲娇萝莉攻

文章 0 评论 0

友情链接

文江博客

随机森林进一步改进

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签