以序数,分类名义和连续变量作为预测器的特征选择
我希望将贷款违规者分类为0,即“非违法者”,1对于包含13000个行和162个预测变量的数据集的违法者。预测变量由分类序数,分类标称和连续变量以及虚拟变量组成。
由于这是一个分类问题,我希望应用逻辑回归,SVM和决策树。我发现很难为这种多种预测变量池运行特征选择。我的第一次尝试是分别将分类变量(clumbing Ordinal和Numinal一起)和连续变量隔离,并分别使用Chi Square和Anova选择功能。
我希望这可以解释问题。
I am looking to classify loan defaulters i.e. 0 for non defaulter and 1 for defaulter from a dataset containing 13000+ rows and 162 predictor variables. The predictor variable consists of categorical ordinal, categorical nominal, and continuous variables along with Dummy variables.
As this is a classification problem, I am looking to apply Logistic Regression, SVM and Decision Trees. I am finding it difficult to run feature selection for such a varied pool of predictor variables. My first try is to segregate the categorical variables (clubbing ordinal and nominal together) and continuous variables, and select feature using Chi Square and Anova respectively.
I hope this explains the problem.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以尝试在某些功能之间运行交叉表,以查找您选择的用于应用交叉表的所选变量之间是否存在一些相关性。
例如,功能A和功能B似乎独立地促进了因变量。在执行跨表格时,您可能会遇到一个有趣的观察结果,该观察值A与B具有正相关,从而表明您可以从表中删除功能A,因为只要我们拥有功能B。
有时,一些交叉表可以创造奇迹。希望它有帮助
You could try running a cross tabulation among some of the features to find whether there is some correlation among the selected variables you have chosen to apply your cross tabulation on.
for instance , feature A and feature B seems to independently contribute to the dependent variable. While performing cross tabulation you may come across an interesting observation that feature A has a positive correlation with B thereby indicating that you can drop feature A from the table since we would be able to deduce feature A relevant values as long as we have feature B.
Sometimes a bit of cross tabulation can do wonders. Hope it helps