如何使用 numpy/scipy 处理丢失的数据?
我在数据清理中最常处理的事情之一就是缺失值。 R 使用其“NA”缺失数据标签很好地处理了这个问题。在Python中,看来我必须处理屏蔽数组,这似乎是设置的一个主要难题,而且似乎没有很好的文档记录。关于用 Python 简化这个过程有什么建议吗?这正在成为使用 Python 进行数据分析的一个障碍。谢谢
更新 显然我已经有一段时间没有查看 numpy.ma 模块中的方法了。看起来至少基本的分析函数可用于屏蔽数组,并且提供的示例帮助我了解如何创建屏蔽数组(感谢作者)。我想看看Python中一些较新的统计方法(今年GSoC中正在开发)是否包含这方面,并且至少做了完整的案例分析。
One of the things I deal with most in data cleaning is missing values. R deals with this well using its "NA" missing data label. In python, it appears that I'll have to deal with masked arrays which seem to be a major pain to set up and don't seem to be well documented. Any suggestions on making this process easier in Python? This is becoming a deal-breaker in moving into Python for data analysis. Thanks
Update It's obviously been a while since I've looked at the methods in the numpy.ma module. It appears that at least the basic analysis functions are available for masked arrays, and the examples provided helped me understand how to create masked arrays (thanks to the authors). I would like to see if some of the newer statistical methods in Python (being developed in this year's GSoC) incorporates this aspect, and at least does the complete case analysis.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
如果您愿意考虑一个库,pandas (http://pandas.pydata.org/) 是一个构建在 numpy 之上的库,它提供了许多其他功能:
我在金融行业使用它已经近一年了,在金融行业,丢失和对齐不良的数据是常态这确实让我的生活变得更轻松。
If you are willing to consider a library, pandas (http://pandas.pydata.org/) is a library built on top of numpy which amongst many other things provides:
I've been using it for almost one year in the financial industry where missing and badly aligned data is the norm and it really made my life easier.
我还质疑屏蔽数组的问题。这里有几个例子:
I also question the problem with masked arrays. Here are a couple of examples:
正如 DpplerShift 所描述的,屏蔽数组就是答案。为了快速和肮脏的使用,您可以使用带有布尔数组的花哨索引:
您现在也可以使用 valid_idx 作为其他数据的快速掩码
Masked arrays are the anwswer, as DpplerShift describes. For quick and dirty use, you can use fancy indexing with boolean arrays:
You can now use valid_idx as a quick mask on other data as well
请参阅 sklearn.preprocessing.Imputer
示例href="http://scikit-learn.org/stable/modules/preprocessing.html#imputation" rel="nofollow noreferrer">http://scikit-learn.org/
See sklearn.preprocessing.Imputer
Example from http://scikit-learn.org/