随机森林进一步改进

发布于 2025-01-13 05:53:42 字数 1455 浏览 1 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

柒夜笙歌凉 2025-01-20 05:53:42

这对我来说很突出:

  • 您分割了数据但不使用分割。
  • 您正在缩放数据,但基于树的方法(例如随机森林)不需要此步骤。
  • 您正在执行自己的调整循环,而不是使用 sklearn.model_selection.GridSearchCV。这很好,但它可能会变得非常繁琐(想象一下想要跨过另一个超参数)。
  • 如果您使用 GridSearchCV ,则无需进行自己的交叉验证。
  • 您使用准确性进行评估,这对于多类分类通常不是一个很好的评估指标。加权F1更好。
  • 如果您正在进行交叉验证,则需要将缩放器放入 CV 循环中(例如使用管道),因为否则缩放器已经看到了验证数据......但是您不需要此学习算法的缩放器,因此这这点毫无意义。

我可能会做这样的事情:

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

X, y = make_classification()

# Split the data.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05, shuffle=True, random_state=0)

# Make things for the cross validation.
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
param_grid = {'max_depth': np.arange(3, 8)}
model = RandomForestClassifier(random_state=1)

# Create and train the cross validation.
clf = GridSearchCV(model, param_grid,
                   scoring='f1_weighted',
                   cv=cv, verbose=3)

clf.fit(X_train, y_train)

看看clf.cv_results_的分数等,如果你愿意的话你可以绘制出来。默认情况下,GridSearchCV 会根据最佳超参数训练最终模型,因此您可以使用 clf 进行预测。

差点忘了...您询问了有关改进模型的问题:)这里有一些想法:

  • 以上内容将帮助您调整更多超参数(例如 max_featuresn_estimatorsmin_samples_leaf)。但不要对超参数调整太过于得意忘形。
  • 您可以尝试转换一些功能(X 中的列),或添加新功能。
  • 寻找更多数据,例如更多行、更高质量的标签等。
  • 解决任何类别不平衡的问题。
  • 尝试更复杂的算法,例如梯度提升树(sklearn 中有模型,或者查看 xgboost)。

Here's what stands out to me:

  • You split the data but do not use the splits.
  • You're scaling the data, but tree-based methods like random forests do not need this step.
  • You are doing your own tuning loop, instead of using sklearn.model_selection.GridSearchCV. This is fine, but it can get quite fiddly (imagine wanting to step over another hyperparameter).
  • If you use GridSearchCV you don't need to do your own cross validation.
  • You're using accuracy for evaluation, which is usually not a great evaluation metric for multi-class classification. Weighted F1 is better.
  • If you're doing cross validation, you need to put the scaler in the CV loop (e.g. using a pipeline) because otherwise the scaler has seen the validation data... but you don't need a scaler for this learning algorithm so this point is moot.

I would probably do something like this:

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

X, y = make_classification()

# Split the data.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05, shuffle=True, random_state=0)

# Make things for the cross validation.
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
param_grid = {'max_depth': np.arange(3, 8)}
model = RandomForestClassifier(random_state=1)

# Create and train the cross validation.
clf = GridSearchCV(model, param_grid,
                   scoring='f1_weighted',
                   cv=cv, verbose=3)

clf.fit(X_train, y_train)

Take a look at clf.cv_results_ for the scores etc, which you can plot if you want. By default GridSearchCV trains a final model on the best hyperparameters, so you can make predictions with clf.

Almost forgot... you asked about improving the model :) Here are some ideas:

  • The above will help you tune on more hyperparameters (eg max_features, n_estimators, and min_samples_leaf). But don't get too carried away with hyperparameter tuning.
  • You could try transforming some features (columns in X), or adding new ones.
  • Look for more data, eg more rows, higher quality labels, etc.
  • Address any issues with class imbalance.
  • Try a more sophisticated algorithm, like gradient boosted trees (there are models in sklearn, or take a look at xgboost).
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文