如何使用 numpy/scipy 处理丢失的数据?

发布于 2024-08-03 15:50:15 字数 321 浏览 8 评论 0原文

我在数据清理中最常处理的事情之一就是缺失值。 R 使用其“NA”缺失数据标签很好地处理了这个问题。在Python中,看来我必须处理屏蔽数组,这似乎是设置的一个主要难题,而且似乎没有很好的文档记录。关于用 Python 简化这个过程有什么建议吗?这正在成为使用 Python 进行数据分析的一个障碍。谢谢

更新 显然我已经有一段时间没有查看 numpy.ma 模块中的方法了。看起来至少基本的分析函数可用于屏蔽数组,并且提供的示例帮助我了解如何创建屏蔽数组(感谢作者)。我想看看Python中一些较新的统计方法(今年GSoC中正在开发)是否包含这方面,并且至少做了完整的案例分析。

One of the things I deal with most in data cleaning is missing values. R deals with this well using its "NA" missing data label. In python, it appears that I'll have to deal with masked arrays which seem to be a major pain to set up and don't seem to be well documented. Any suggestions on making this process easier in Python? This is becoming a deal-breaker in moving into Python for data analysis. Thanks

Update It's obviously been a while since I've looked at the methods in the numpy.ma module. It appears that at least the basic analysis functions are available for masked arrays, and the examples provided helped me understand how to create masked arrays (thanks to the authors). I would like to see if some of the newer statistical methods in Python (being developed in this year's GSoC) incorporates this aspect, and at least does the complete case analysis.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

慵挽 2024-08-10 15:50:15

如果您愿意考虑一个库,pandas (http://pandas.pydata.org/) 是一个构建在 numpy 之上的库,它提供了许多其他功能:

智能数据对齐和缺失数据的集成处理:在计算中获得基于标签的自动对齐,轻松将杂乱的数据整理成有序的形式

我在金融行业使用它已经近一年了,在金融行业,丢失和对齐不良的数据是常态这确实让我的生活变得更轻松。

If you are willing to consider a library, pandas (http://pandas.pydata.org/) is a library built on top of numpy which amongst many other things provides:

Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form

I've been using it for almost one year in the financial industry where missing and badly aligned data is the norm and it really made my life easier.

寻梦旅人 2024-08-10 15:50:15

我还质疑屏蔽数组的问题。这里有几个例子:

import numpy as np
data = np.ma.masked_array(np.arange(10))
data[5] = np.ma.masked # Mask a specific value

data[data>6] = np.ma.masked # Mask any value greater than 6

# Same thing done at initialization time
init_data = np.arange(10)
data = np.ma.masked_array(init_data, mask=(init_data > 6))

I also question the problem with masked arrays. Here are a couple of examples:

import numpy as np
data = np.ma.masked_array(np.arange(10))
data[5] = np.ma.masked # Mask a specific value

data[data>6] = np.ma.masked # Mask any value greater than 6

# Same thing done at initialization time
init_data = np.arange(10)
data = np.ma.masked_array(init_data, mask=(init_data > 6))
脱离于你 2024-08-10 15:50:15

正如 DpplerShift 所描述的,屏蔽数组就是答案。为了快速和肮脏的使用,您可以使用带有布尔数组的花哨索引:

>>> import numpy as np
>>> data = np.arange(10)
>>> valid_idx = data % 2 == 0 #pretend that even elements are missing

>>> # Get non-missing data
>>> data[valid_idx]
array([0, 2, 4, 6, 8])

您现在也可以使用 valid_idx 作为其他数据的快速掩码

>>> comparison = np.arange(10) + 10
>>> comparison[valid_idx]
array([10, 12, 14, 16, 18])

Masked arrays are the anwswer, as DpplerShift describes. For quick and dirty use, you can use fancy indexing with boolean arrays:

>>> import numpy as np
>>> data = np.arange(10)
>>> valid_idx = data % 2 == 0 #pretend that even elements are missing

>>> # Get non-missing data
>>> data[valid_idx]
array([0, 2, 4, 6, 8])

You can now use valid_idx as a quick mask on other data as well

>>> comparison = np.arange(10) + 10
>>> comparison[valid_idx]
array([10, 12, 14, 16, 18])
听风念你 2024-08-10 15:50:15

请参阅 sklearn.preprocessing.Imputer

import numpy as np
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit([[1, 2], [np.nan, 3], [7, 6]])
X = [[np.nan, 2], [6, np.nan], [7, 6]]
print(imp.transform(X))  

示例href="http://scikit-learn.org/stable/modules/preprocessing.html#imputation" rel="nofollow noreferrer">http://scikit-learn.org/

See sklearn.preprocessing.Imputer

import numpy as np
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit([[1, 2], [np.nan, 3], [7, 6]])
X = [[np.nan, 2], [6, np.nan], [7, 6]]
print(imp.transform(X))  

Example from http://scikit-learn.org/

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文