如何处理超过80%的缺失的功能

发布于 2025-02-06 23:48:18 字数 135 浏览 1 评论 0原文

我正在使用一个非常糟糕的临床数据集,其中有300个样本和400个功能,这些功能将用于机器学习。我的顾问向我介绍了该数据集中一些具有有意义的生物学功能,并要求我保留它们,但是其中许多人丢失了50%以上,甚至超过80%。我应该怎么办?使用模式填充会影响其性能。

I'm working with a really bad clinical dataset, it has 300 samples, 400 features, which will be used for machine learning. My advisor told me about some biologically meaningful features in this dataset and asked me to keep them, but many of them are missing more than 50%, or even more than 80%. What should I do? Does padding with mode affect their performance.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

简单爱 2025-02-13 23:48:20

简而言之:即使丢失的数据的比例很大,模型性能也不应降低归类方式和随机的数据。但是,选择正确的方法需要EDA和测试。

这些功能是数字还是分类?目标呢?

即使它们有意义,也不意味着它们还会影响目标。
如果这是一个分类问题,最好研究给定目标并运行t检验/U检验的这些变量的分布,以检查是否存在任何统计学上的显着差异。如果没有,您有有效的理由删除功能。对于回归案例,您可以研究相互信息,相关性和散点图。如果特征和目标都是分类的,则运行卡方检验等。

归纳数字值可能很棘手,因为我们通常不知道生物学中的基本分布。不过,您在最坏情况下获得的约60个样本应该足以估计。您应该研究它,看看归纳均值/中位数/群中位数/零/等是否有意义。可悲的是,这里没有一种完美的方式,您必须测试什么使模型表现最好的方法。

其他可能的技巧:

  • 首先尝试预测缺失值(或使用knnimputer之类的东西)。
  • 将您选择的价值算为添加二进制功能,表示该值是否可靠。
  • 估算零,并尝试一种处理稀疏向量的降维技术(例如,截短SVD)。
  • 尝试以鲁棒方式处理丢失数据的模型(例如XGBoost)。

In short: model performance should not degrade given the proper way of imputation and the data missing at random, even if the proportion of missing data is large. However, choosing the proper way requires EDA and testing.

Are those features numeric or categorical? What about the target?

Even if they are meaningful, that does not mean they affect the target yet.
If that's a classification problem, it would be a good idea to investigate the distribution of those variables given the target and run t-test/u-test to check whether there is any statistically significant difference. If there's not, you have a valid reason to drop a feature. For the regression case, you may study mutual information, correlations and scatter plots. If both feature and target are categorical, run chi-squared test etc.

Imputing numeric values might be tricky since often we have no idea of the underlying distribution in biology. Still, ~60 samples you've got in your worst case should be enough to estimate. You should study it and see whether imputing mean/median/group median/zero/etc would make sense. Sadly, there's no one perfect way here, you'll have to test what makes your model perform the best way.

Other possible tricks:

  • Try predicting missing values first (or use something like KNNImputer).
  • Impute a value of your choice and add a binary feature signifying whether this value is reliable.
  • Impute zeros and try a dimensionality reduction technique which handles sparse vectors (e.g. TruncatedSVD).
  • Try models which can handle missing data in a robust way (such as XGBoost).
心奴独伤 2025-02-13 23:48:20

您可以将丢失的值保持为NAN并仔细处理它们,从而不会影响管道下游。仅使用可用数据来做出有意义的步骤。

You can keep the missing values as NaNs and handle them carefully in such a way that they don’t affect the downstream of your pipeline. only use the available data to make meaningful steps.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文