使用Python的回归问题的分类特征(对象/浮点)选择
我有一组字母数字分类功能 (C_1,C_2,...,C_N)
和一个数字目标变量 (预测) )
作为熊猫数据框架。您能否向我建议我可以将其用于此数据集的任何功能选择算法?
I have a set of alphanumeric categorical features (c_1,c_2, ..., c_n)
and one numeric target variable (prediction)
as a pandas dataframe. Can you please suggest to me any feature selection algorithm that I can use for this data set?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我假设您正在解决一个有监督的学习问题,例如回归或分类。
首先,我建议使用单速编码将分类特征转换为数字功能。 Pandas提供了一个有用的功能,已经可以做到:
如果您的功能数量有限,并且模型在计算上不太昂贵,则可以测试所有可能的功能的组合,这是最好的方法,但是很少是可行的选项。
一种可能的替代方法是使用与目标的相关性对所有功能进行排序,然后将它们顺序添加到模型中,测量我的模型的优点,然后选择提供最佳性能的功能集。
如果您具有高维数据,则可以考虑使用PCA或其他维度降低技术降低维度,它将数据投射到较低的维空间,以减少功能的数量,显然,由于PCA近似,您会丢失一些信息。
这些只是执行特征选择的方法的一些示例,还有许多其他方法。
最终提示:
I'm assuming you are solving a supervised learning problem like Regression or Classification.
First of all I suggest to transform the categorical features into numeric ones using one-hot encoding. Pandas provides an useful function that already does it:
If you have a limited number of features and a model that is not too computationally expensive you can test the combination of all the possible features, it is the best way however it is seldom a viable option.
A possible alternative is to sort all the features using the correlation with the target, then sequentially add them to the model, measure the goodness of my model and select the set of features that provides the best performance.
If you have high dimensional data, you can consider to reduce the dimensionality using PCA or another dimensionality reduction technique, it projects the data into a lower dimensional space reducing the number of features, obviously you will loose some information due to the PCA approximation.
These are only some examples of methods to perform feature selection, there are many others.
Final tips: