如何处理超过80%的缺失的功能
我正在使用一个非常糟糕的临床数据集,其中有300个样本和400个功能,这些功能将用于机器学习。我的顾问向我介绍了该数据集中一些具有有意义的生物学功能,并要求我保留它们,但是其中许多人丢失了50%以上,甚至超过80%。我应该怎么办?使用模式填充会影响其性能。
I'm working with a really bad clinical dataset, it has 300 samples, 400 features, which will be used for machine learning. My advisor told me about some biologically meaningful features in this dataset and asked me to keep them, but many of them are missing more than 50%, or even more than 80%. What should I do? Does padding with mode affect their performance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
简而言之:即使丢失的数据的比例很大,模型性能也不应降低归类方式和随机的数据。但是,选择正确的方法需要EDA和测试。
这些功能是数字还是分类?目标呢?
即使它们有意义,也不意味着它们还会影响目标。
如果这是一个分类问题,最好研究给定目标并运行t检验/U检验的这些变量的分布,以检查是否存在任何统计学上的显着差异。如果没有,您有有效的理由删除功能。对于回归案例,您可以研究相互信息,相关性和散点图。如果特征和目标都是分类的,则运行卡方检验等。
归纳数字值可能很棘手,因为我们通常不知道生物学中的基本分布。不过,您在最坏情况下获得的约60个样本应该足以估计。您应该研究它,看看归纳均值/中位数/群中位数/零/等是否有意义。可悲的是,这里没有一种完美的方式,您必须测试什么使模型表现最好的方法。
其他可能的技巧:
In short: model performance should not degrade given the proper way of imputation and the data missing at random, even if the proportion of missing data is large. However, choosing the proper way requires EDA and testing.
Are those features numeric or categorical? What about the target?
Even if they are meaningful, that does not mean they affect the target yet.
If that's a classification problem, it would be a good idea to investigate the distribution of those variables given the target and run t-test/u-test to check whether there is any statistically significant difference. If there's not, you have a valid reason to drop a feature. For the regression case, you may study mutual information, correlations and scatter plots. If both feature and target are categorical, run chi-squared test etc.
Imputing numeric values might be tricky since often we have no idea of the underlying distribution in biology. Still, ~60 samples you've got in your worst case should be enough to estimate. You should study it and see whether imputing mean/median/group median/zero/etc would make sense. Sadly, there's no one perfect way here, you'll have to test what makes your model perform the best way.
Other possible tricks:
您可以将丢失的值保持为NAN并仔细处理它们,从而不会影响管道下游。仅使用可用数据来做出有意义的步骤。
You can keep the missing values as NaNs and handle them carefully in such a way that they don’t affect the downstream of your pipeline. only use the available data to make meaningful steps.