sciits 机器学习中的缺失值
scikit-learn 中是否可能存在缺失值?他们应该如何代表?我找不到任何相关文档。
Is it possible to have missing values in scikit-learn ? How should they be represented? I couldn't find any documentation about that.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
scikit-learn 根本不支持缺失值。以前在邮件列表上对此进行过讨论,但没有尝试实际编写代码来处理它们。无论你做什么,不要使用 NaN 来编码缺失值,因为许多算法拒绝处理包含 NaN 的样本。上面的答案已经过时了;最新版本的 scikit-learn 有一个类
Imputer
进行简单的、按特征缺失值插补。您可以向其提供包含 NaN 的数组,以将其替换为相应特征的均值、中位数或众数。Missing values are simply not supported in scikit-learn. There has been discussion on the mailing list about this before, but no attempt to actually write code to handle them.Whatever you do, don't use NaN to encode missing values, since many of the algorithms refuse to handle samples containing NaNs.The above answer is outdated; the latest release of scikit-learn has a class
Imputer
that does simple, per-feature missing value imputation. You can feed it arrays containing NaNs to have those replaced by the mean, median or mode of the corresponding feature.我希望我能提供一个简单的例子,但我发现 RandomForestRegressor 确实不能优雅地处理 NaN。当添加 NaN 百分比不断增加的功能时,性能会逐渐变差。具有“太多”NaN 的特征将被完全忽略,即使 nan 指示非常有用的信息。
这是因为算法永远不会对“isnan”或“ismissing”决策进行分割。如果该特征在该样本子集中具有单个 NaN,则该算法将忽略树的特定级别的特征。但是,在树的较低级别,当样本量较小时,样本子集的特定特征值更有可能不具有 NaN,并且该特征上可能会发生分割。
我尝试了各种插补技术来解决这个问题(用平均值/中位数替换,使用不同的模型预测缺失值等),但结果好坏参半。
相反,这是我的解决方案:用一个明显超出范围的值(例如 -1.0)替换 NaN。这使得树能够根据“未知值与已知值”的标准进行分裂。然而,使用这种超出范围的值会产生一个奇怪的副作用:当算法试图找到一个好的位置时,接近超出范围值的已知值可能会与超出范围的值混在一起。分裂。例如,已知的 0 可以与用于替换 NaN 的 -1 集中在一起。因此,您的模型可能会发生变化,具体取决于超出范围的值是否小于最小值或是否大于最大值(它可能分别与最小值或最大值集中在一起)。这可能有助于也可能无助于该技术的推广,结果将取决于行为最小值或最大值样本与 NaN 值样本的相似程度。
I wish I could provide a simple example, but I have found that RandomForestRegressor does not handle NaN's gracefully. Performance gets steadily worse when adding features with increasing percentages of NaN's. Features that have "too many" NaN's are completely ignored, even when the nan's indicate very useful information.
This is because the algorithm will never create a split on the decision "isnan" or "ismissing". The algorithm will ignore a feature at a particular level of the tree if that feature has a single NaN in that subset of samples. But, at lower levels of the tree, when sample sizes are smaller, it becomes more likely that a subset of samples won't have a NaN in a particular feature's values, and a split can occur on that feature.
I have tried various imputation techniques to deal with the problem (replace with mean/median, predict missing values using a different model, etc.), but the results were mixed.
Instead, this is my solution: replace NaN's with a single, obviously out-of-range value (like -1.0). This enables the tree to split on the criteria "unknown-value vs known-value". However, there is a strange side-effect of using such out-of-range values: known values near the out-of-range value could get lumped together with the out-of-range value when the algorithm tries to find a good place to split. For example, known 0's could get lumped with the -1's used to replace the NaN's. So your model could change depending on if your out-of-range value is less than the minimum or if it's greater than the maximum (it could get lumped in with the minimum value or maximum value, respectively). This may or may not help the generalization of the technique, the outcome will depend on how similar in behavior minimum- or maximum-value samples are to NaN-value samples.
在数据上运行RandomForestRegressor时,我遇到了非常类似的问题。 NA 值的存在导致预测中的“nan”被排除。通过滚动浏览多个讨论,Breiman 的文档分别推荐了连续数据和分类数据的两种解决方案。
this(连续数据)
(分类数据)
根据 Breiman 的说法,算法的随机性和树的数量将允许进行校正,而不会对预测的准确性产生太大影响。我认为如果 NA 值的存在稀疏,我认为包含许多 NA 值的特征很可能会产生影响。
I have come across very similar issue, when running the RandomForestRegressor on data. The presence of NA values were throwing out "nan" for predictions. From scrolling around several discussions, the Documentation by Breiman recommends two solutions for continuous and categorical data respectively.
this (Continuous Data)
(Categorical Data)
According to Breiman the random nature of the algorithm and the number of trees will allow for the correction without too much effect on the accuracy of the prediction. This I feel would be the case if the presence of NA values is sparse, a feature containing many NA values I think will most likely have an affect.
用平均值/中位数/其他统计数据替换缺失值可能无法解决问题,因为该值缺失的事实可能很重要。例如,在一项关于身体特征的调查中,如果受访者因异常高或矮而感到尴尬,他们可能不会填写自己的身高。这意味着缺失值表明受访者异常高或小——与中值相反。
所需要的是一个对缺失值有单独规则的模型,任何猜测缺失值的尝试都可能会降低模型的预测能力。
例如:
Replacing a missing value with a mean/median/other stat may not solve the problem as the fact that the value is missing may be significant. For example in a survey on physical characteristics a respondent may not put their height if they were embarrassed about being abnormally tall or small. This would imply that missing values indicate the respondent was unusually tall or small - the opposite of the median value.
What is necessary is a model that has a separate rule for missing values, any attempt to guess the missing value will likely reduce the predictive power of the model.
e.g:
Orange 是另一个 Python 机器学习拥有专门用于插补的设施的图书馆。我还没有机会使用它们,但可能很快就会有机会使用它们,因为用零、平均值或中位数替换 nan 的简单方法都有很大的问题。
Orange is another python machine learning library that has facilities dedicated to imputation. I have not had a chance to use them, but might be soon, since the simple methods of replacing nan's with zeros, averages, or medians all have significant problems.
我确实遇到这个问题。在实际案例中,我在 R 中发现了一个名为 missForest 的包,它可以很好地处理这个问题,估算缺失值并极大地增强我的预测。
missForest 取代了简单地用中位数或均值替换 NA他们预测缺失值应该是什么。它使用根据数据矩阵的观测值训练的随机森林进行预测。它在包含大量缺失值的大型数据集上运行速度可能非常慢。所以这种方法需要权衡。
python 中类似的选项是 predictive_imputer
I do encounter this problem. In a practical case, I found a package in R called missForest that can handle this problem well, imputing the missing value and greatly enhance my prediction.
Instead of simply replacing NAs with median or mean, missForest replaces them with a prediction of what it thinks the missing value should be. It makes the predictions using a random forest trained on the observed values of a data matrix. It can run very slow on a large data set that contains a high number of missing values. So there is a trade-off for this method.
A similar option in python is predictive_imputer
当您遇到输入要素的缺失值时,首要任务不是如何估算缺失值。最重要的问题是你为什么应该这样做。除非您清楚、明确地了解数据背后的“真实”现实是什么,否则您可能需要抑制归因的冲动。这首先与技术或封装无关。
从历史上看,我们之所以采用决策树等树方法,主要是因为我们中的一些人至少认为,用插补缺失值来估计回归(如线性回归、逻辑回归,甚至神经网络)的扭曲性足够大,因此我们应该拥有不需要插补缺失值的方法。列'。所谓信息缺失。对于那些熟悉贝叶斯的人来说,这应该是熟悉的概念。
如果你真的在大数据上建模,除了谈论它之外,你很可能面临大量的列。在文本分析等特征提取的常见实践中,您很可能会说缺失意味着 count=0。这很好,因为您知道根本原因。现实情况是,尤其是在面对结构化数据源时,您不知道或根本没有时间了解根本原因。但是你的引擎强制插入一个值,无论是 NAN 还是引擎可以容忍的其他占位符,我很可能会认为你的模型和你估算的一样好,这是没有意义的。
一个有趣的问题是:如果我们在分裂过程中通过其紧密的上下文(一级或二级代理)来判断缺失,那么林业实际上是否会使上下文判断变得毫无意义,因为上下文本身是随机选择的?然而,这是一个“更好”的问题。至少不会那么痛。它当然应该使保留缺失变得不必要。
实际上,如果您有大量输入特征,您可能根本无法有一个“好的”策略来进行插补。从纯粹的插补角度来看,最佳实践绝不是单变量。在 RF 的竞争中,这几乎意味着在建模之前使用 RF 进行插补。
因此,除非有人告诉我(或我们),“我们无法做到这一点”,否则我认为我们应该能够继续推进缺失的“细胞”,完全绕过如何“最好”估算的主题。
When you run into missing values on input features, the first order of business is not how to impute the missing. The most important question is WHY SHOULD you. Unless you have clear and definitive mind what the 'true' reality behind the data is, you may want to curtail urge to impute. This is not about technique or package in the first place.
Historically we resorted to tree methods like decision trees mainly because some of us at least felt that imputing missing to estimate regression like linear regression, logistic regression, or even NN is distortive enough that we should have methods that do not require imputing missing 'among the columns'. The so-called missing informativeness. Which should be familiar concept to those familiar with, say, Bayesian.
If you are really modeling on big data, besides talking about it, the chance is you face large number of columns. In common practice of feature extraction like text analytics, you may very well say missing means count=0. That is fine because you know the root cause. The reality, especially when facing structured data sources, is you don't know or simply don't have time to know the root cause. But your engine forces to plug in a value, be it NAN or other place holders that the engine can tolerate, I may very well argue your model is as good as you impute, which does not make sense.
One intriguing question is : if we leave missingness to be judged by its close context inside the splitting process, first or second degree surrogate, does foresting actually make the contextual judgement a moot because the context per se is random selection? This, however, is a 'better' problem. At least it does not hurt that much. It certainly should make preserving missingness unnecessary.
As a practical matter, if you have large number of input features, you probably cannot have a 'good' strategy to impute after all. From the sheer imputation perspective, the best practice is anything but univariate. Which is in the contest of RF pretty much means to use the RF to impute before modeling with it.
Therefore, unless somebody tells me (or us), "we are not able to do that", I think we should enable carrying forward missing 'cells', entirely bypassing the subject of how 'best' to impute.