当前位置：文江博客话题详情

尽可能快地运行随机森林模型的最佳实践

发布于 2025-01-25 22:31:47 字数 1401 浏览 1 评论 0原文

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

晨曦÷微暖 2025-02-01 22:31:47

随机子采样，然后在完整数据上进行调整很少有效，因为小的子样本可能不能代表完整数据。
关于数据的数量与模型质量：尝试使用Sklearn的学习曲线： https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.learning_curve.html

    train_sizes, train_scores, test_scores, fit_times, _ = learning_curve(
        estimator,
        X,
        y,
        cv=cv,
        n_jobs=n_jobs,
        train_sizes=train_sizes,
        return_times=True,
    )

您将能够绘制数据与模型性能的量。
以下是绘图的一些示例：
https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_kernel_ridge_ridge_regression.html#sphx-glr-auto-auto-auto-auto-auto-examples-miscellaneous-miscellaneous-plot-kernel-ridge-ridge-ridge-ridge-ridge-ridge-ridge-py-py-py-py-py-py-py-ridge-ridge-py-py

“ https://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html#sphx-glr-auto-auto-auto-examples-model-selection-selection-plot-selection-plot-learnning-learnning-learnning-curve-curve-curve-py-py-py-curve-py-curve-py-curve-curve-py- /scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html#sphx-glr-auto-auto-examples-model-selection-selection-plot-learning-learning-curve-curve-curve-curve-curve-curve-curve-py-py

估计总时间是困难的，因为它很难T线性。

一些其他实用建议：

SET n_jobs = -1以在所有内核上并行运行该模型；
使用任何功能选择方法来减少功能数量。 300个功能确实有很多，应该可以摆脱大约一半的功能，而不会大幅下降模型性能。

Random subsampling and then tuning on the full data rarely works, as the small subsample could be not representative of the full data.
About the amount of the data vs the model quality: try using learning curves from sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.learning_curve.html

    train_sizes, train_scores, test_scores, fit_times, _ = learning_curve(
        estimator,
        X,
        y,
        cv=cv,
        n_jobs=n_jobs,
        train_sizes=train_sizes,
        return_times=True,
    )

This way you'll be able to plot the amount of the data vs the model performance.
Here are some examples of plotting:
https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_kernel_ridge_regression.html#sphx-glr-auto-examples-miscellaneous-plot-kernel-ridge-regression-py

https://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html#sphx-glr-auto-examples-model-selection-plot-learning-curve-py

Estimating total time is difficult, because it isn't linear.

Some additional practical suggestions:

set n_jobs=-1 to run the model in parallel on all cores;
use any feature selection approach to decrease the number of features. 300 features is really a lot, it should be possible to get rid of around half of them without serious decline of the model performance.

回复收藏 0 原文

~没有更多了~

关于作者

晚风撩人

暂无简介

文章

24514 人气

关注发私信

浪子阿飞

文章 0 评论 0

关注

JK.Yang

文章 0 评论 0

关注

人间不值得

文章 0 评论 0

关注

静待花开

文章 0 评论 0

关注

只涨不跌

文章 0 评论 0

关注

污浊的双黑

文章 0 评论 0

友情链接

文江博客

尽可能快地运行随机森林模型的最佳实践

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者