随机森林重要特征输出稳定性问题
我拟合2 几乎相同的随机森林回归模型。两种模型都使用具有60个功能和90个数据点的相同数据集。唯一的区别是他们使用不同的目标(当然,每个模型的目标列被排除在相应的功能范围之外)。在两个模型(折叠数,迭代次数,评分数)中,所有交叉验证设置均相同,并且超参数网格也相同。
我对功能重要性输出感兴趣。但是,其中一个模型始终输出相同的顶部功能,而另一个则没有输出。有人知道为什么是这种情况吗?
I'm fitting 2 almost identical Random Forest regression models. Both models use the same data set that have 60 features and 90 data points. The only difference is they're using different targets (the target column of each model is excluded from the respective features dataframes, of course). All of the cross validation settings are same of both models (number of folds, number of iterations, scoring) and the hyperparameter grids are also identical.
I'm interested in the feature importance output. However, one of the model consistently output the same top features while the other doesn't. Does anyone know why this is the case?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以设置 seed 或参数 random_state ,以防您依靠sklearn.ensemble.randomforestregressor来稳定结果。
由于森林是随机组装的,因此具有不同特征的重要性是很正常的。此外,特征重要性可能不是评估实际特征重要性的最佳指标。您可以尝试Boruta-Algorithm/置换功能以获取不同的观点。
在您的实际问题上,也许您的回归器更适合预测一个目标变量另一个目标变量。
两种模型如何在数据上执行精确度?这可能是解释为什么一种模型更稳定的一种可能性。对于安装大量的树木而言,特征的重要性仍然不稳定吗?
You can set a seed or the parameter random_state in case you rely on sklearn.ensemble.RandomForestRegressor in order to stabilize your results.
It's quite normal to get varying feature importance since the forest is assembled randomly. Furthermore, feature importance may not be the optimal metric to evaluate actual feature importance. You could try Boruta-Algorithm/Permutation Feature Importance to get a different perspective.
Towards your actual question, maybe your regressors are better suited to predict one target variable over the other.
How do both models perform accuracy-wise on the data? This might be one possibility to explain why one model is more stable. Do feature importances remain unstable for a larger amount of trees fitted?