尽可能快地运行随机森林模型的最佳实践

发布于 2025-01-25 22:31:47 字数 1401 浏览 1 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

晨曦÷微暖 2025-02-01 22:31:47
  1. 随机子采样,然后在完整数据上进行调整很少有效,因为小的子样本可能不能代表完整数据。

  2. 关于数据的数量与模型质量:尝试使用Sklearn的学习曲线: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.learning_curve.html

    train_sizes, train_scores, test_scores, fit_times, _ = learning_curve(
        estimator,
        X,
        y,
        cv=cv,
        n_jobs=n_jobs,
        train_sizes=train_sizes,
        return_times=True,
    )

您将能够绘制数据与模型性能的量。
以下是绘图的一些示例:
https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_kernel_ridge_ridge_regression.html#sphx-glr-auto-auto-auto-auto-auto-examples-miscellaneous-miscellaneous-plot-kernel-ridge-ridge-ridge-ridge-ridge-ridge-ridge-py-py-py-py-py-py-py-ridge-ridge-py-py

“ https://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html#sphx-glr-auto-auto-auto-examples-model-selection-selection-plot-selection-plot-learnning-learnning-learnning-curve-curve-curve-py-py-py-curve-py-curve-py-curve-curve-py- /scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html#sphx-glr-auto-auto-examples-model-selection-selection-plot-learning-learning-curve-curve-curve-curve-curve-curve-curve-py-py

  1. 估计总时间是困难的,因为它很难T线性。

一些其他实用建议:

  • SET n_jobs = -1以在所有内核上并行运行该模型;
  • 使用任何功能选择方法来减少功能数量。 300个功能确实有很多,应该可以摆脱大约一半的功能,而不会大幅下降模型性能。
  1. Random subsampling and then tuning on the full data rarely works, as the small subsample could be not representative of the full data.

  2. About the amount of the data vs the model quality: try using learning curves from sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.learning_curve.html

    train_sizes, train_scores, test_scores, fit_times, _ = learning_curve(
        estimator,
        X,
        y,
        cv=cv,
        n_jobs=n_jobs,
        train_sizes=train_sizes,
        return_times=True,
    )

This way you'll be able to plot the amount of the data vs the model performance.
Here are some examples of plotting:
https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_kernel_ridge_regression.html#sphx-glr-auto-examples-miscellaneous-plot-kernel-ridge-regression-py

https://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html#sphx-glr-auto-examples-model-selection-plot-learning-curve-py

  1. Estimating total time is difficult, because it isn't linear.

Some additional practical suggestions:

  • set n_jobs=-1 to run the model in parallel on all cores;
  • use any feature selection approach to decrease the number of features. 300 features is really a lot, it should be possible to get rid of around half of them without serious decline of the model performance.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文