Pycaret 不能很好地管理多重共线性
我在 Pycaret 库的输入中有一个 Panda Dataframe df 。 所以 df 具有:
3 categoricals variables:
LIB_SOURCE : values: 'arome_001', 'gfs_025' and 'arpege_01'
MonthNumber : values from 1 to 12
origine : 'Sencrop' and 'Visiogreen' values
3 continuous variables :
TEMPERATURE_PREDITE DIFF_HOURS TEMPERATURE_OBSERVEE
我让 Pycaret 将分类特征编码为 0/1 并管理多重共线性:
regression = setup(data = dataset_predictions_meteo,
target = 'TEMPERATURE_PREDITE',
categorical_features = ['MonthNumber' , 'origine' , 'LIB_SOURCE'],
numeric_features = ['DIFF_HOURS' , 'TEMPERATURE_OBSERVEE'],
session_id=123,
train_size=0.8,
normalize=True,
#transform_target=True,
remove_perfect_collinearity = True
)
但正如您在上面的屏幕中看到的,Pycaret不能很好地管理多重共线性:PyCaret 应该自行删除 3 列 'arome_001'、'gfs_025' 和 'arpege_01' 中的 1 列(get_config('X'))。 但 PyCaret 保留所有 3 列。
为什么 PyCaret 不删除 3 列之一? 谢谢。
I have a Panda Dataframe df in input to Pycaret library.
So the df has :
3 categoricals variables:
LIB_SOURCE : values: 'arome_001', 'gfs_025' and 'arpege_01'
MonthNumber : values from 1 to 12
origine : 'Sencrop' and 'Visiogreen' values
3 continuous variables :
TEMPERATURE_PREDITE DIFF_HOURS TEMPERATURE_OBSERVEE
I let Pycaret encoding categorical features to 0/1 and manage multicollinearity:
regression = setup(data = dataset_predictions_meteo,
target = 'TEMPERATURE_PREDITE',
categorical_features = ['MonthNumber' , 'origine' , 'LIB_SOURCE'],
numeric_features = ['DIFF_HOURS' , 'TEMPERATURE_OBSERVEE'],
session_id=123,
train_size=0.8,
normalize=True,
#transform_target=True,
remove_perfect_collinearity = True
)
But as you can see in the screen above, Pycaret doesn't well manage multicollinearity : PyCaret should remove by itself 1 of 3 columns 'arome_001', 'gfs_025' and 'arpege_01' (get_config('X')).
But PyCaret keeps all 3 columns.
Why PyCaret doesn't remove one of 3 columns?
Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
多重共线性意味着两个或多个特征相关,这意味着它们的相关系数接近+1.0或-1.0。如果两个特征相关,那么它们会一起改变:如果一个特征发生变化,另一个特征也会发生变化(它们相互影响)。这种情况会对模型性能产生负面影响。 PyCaret 在内部管理多重共线性以实现性能良好的模型。
在多重共线性的情况下,可以使用PLS(偏最小二乘回归)和PCA(主成分分析)来消除特征之间的相关性。 PLS回归可以将特征减少到较小的一组特征(通过消除一些特征),这些特征之间没有相关性。另一方面,PCA 创建不相关的新特征(用不相关的新特征替换旧特征)。
我不太清楚为什么你认为应该删除 3 列 'arome_001'、'gfs_025' 和 'arpege_01' 中的 1 列,我的猜测是 PyCaret 按预期工作。
Multicollinearity means that two or more features are correlated, meaning that they have a correlation coefficient close to +1.0 or -1.0. If two features are correlated, then they change together: if one changes, also the other one changes (they affect each other). This situation affects the model performance negatively. PyCaret manages multicollinearity internally to achieve well-performing models.
In the case of multicollinearity, PLS (Partial Least Squares Regresssion), and PCA (Principal Component Analysis) can be used to remove correlation among the features. PLS regression can reduce the features to a smaller set of features (by eliminating some of the features) that have no correlation among them. On the other hand, PCA creates new features which are uncorrelated (it replaces the old features with the uncorrelated new features).
I am not very clear about why you think that 1 of 3 columns 'arome_001', 'gfs_025' and 'arpege_01' should be removed, my guess is that PyCaret works as expected.
我认为正在计算浮点数和整数的共线性。它们确实是绝对的。
I suppose that colinearity is being calculated for floats and integers. They are indeed categorical.