Estimating total time is difficult, because it isn't linear.
Some additional practical suggestions:
set n_jobs=-1 to run the model in parallel on all cores;
use any feature selection approach to decrease the number of features. 300 features is really a lot, it should be possible to get rid of around half of them without serious decline of the model performance.
发布评论
评论(1)
随机子采样,然后在完整数据上进行调整很少有效,因为小的子样本可能不能代表完整数据。
关于数据的数量与模型质量:尝试使用Sklearn的学习曲线: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.learning_curve.html
您将能够绘制数据与模型性能的量。
以下是绘图的一些示例:
https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_kernel_ridge_ridge_regression.html#sphx-glr-auto-auto-auto-auto-auto-examples-miscellaneous-miscellaneous-plot-kernel-ridge-ridge-ridge-ridge-ridge-ridge-ridge-py-py-py-py-py-py-py-ridge-ridge-py-py
“ https://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html#sphx-glr-auto-auto-auto-examples-model-selection-selection-plot-selection-plot-learnning-learnning-learnning-curve-curve-curve-py-py-py-curve-py-curve-py-curve-curve-py- /scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html#sphx-glr-auto-auto-examples-model-selection-selection-plot-learning-learning-curve-curve-curve-curve-curve-curve-curve-py-py
一些其他实用建议:
n_jobs = -1
以在所有内核上并行运行该模型;Random subsampling and then tuning on the full data rarely works, as the small subsample could be not representative of the full data.
About the amount of the data vs the model quality: try using learning curves from sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.learning_curve.html
This way you'll be able to plot the amount of the data vs the model performance.
Here are some examples of plotting:
https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_kernel_ridge_regression.html#sphx-glr-auto-examples-miscellaneous-plot-kernel-ridge-regression-py
https://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html#sphx-glr-auto-examples-model-selection-plot-learning-curve-py
Some additional practical suggestions:
n_jobs=-1
to run the model in parallel on all cores;