返回介绍

01. Python 工具

02. Python 基础

03. Numpy

04. Scipy

05. Python 进阶

06. Matplotlib

07. 使用其他语言进行扩展

08. 面向对象编程

09. Theano 基础

10. 有趣的第三方模块

11. 有用的工具

12. Pandas

十分钟上手 Pandas

发布于 2022-09-03 20:46:15 字数 59260 浏览 0 评论 0 收藏 0

pandas 是一个 Python Data Analysis Library

安装请参考官网的教程,如果安装了 Anaconda,则不需要安装 pandas 库。

In [1]:

%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

产生 Pandas 对象

pandas 中有三种基本结构:

  • Series
    • 1D labeled homogeneously-typed array
  • DataFrame
    • General 2D labeled, size-mutable tabular structure with potentially heterogeneously-typed columns
  • Panel
    • General 3D labeled, also size-mutable array

Series

一维 Series 可以用一维列表初始化:

In [2]:

s = pd.Series([1,3,5,np.nan,6,8])

print s
0     1
1     3
2     5
3   NaN
4     6
5     8
dtype: float64

默认情况下,Series 的下标都是数字(可以使用额外参数指定),类型是统一的。

DataFrame

DataFrame 则是个二维结构,这里首先构造一组时间序列,作为我们第一维的下标:

In [3]:

dates = pd.date_range('20130101', periods=6)

print dates
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

然后创建一个 DataFrame 结构:

In [4]:

df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))

df

Out[4]:

ABCD
2013-01-01-0.605936-0.861658-1.0019241.528584
2013-01-02-0.1654080.3883381.1871871.819818
2013-01-030.065255-1.608074-1.282331-0.286067
2013-01-041.2893050.497115-0.2253510.040239
2013-01-050.0382320.875057-0.0925260.934432
2013-01-06-2.163453-0.0102791.6998861.291653

默认情况下,如果不指定 index 参数和 columns,那么他们的值将用从 0 开始的数字替代。

除了向 DataFrame 中传入二维数组,我们也可以使用字典传入数据:

In [5]:

df2 = pd.DataFrame({'A' : 1.,
                    'B' : pd.Timestamp('20130102'),
                    'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                    'D' : np.array([3] * 4,dtype='int32'),
                    'E' : pd.Categorical(["test","train","test","train"]),
                    'F' : 'foo' })

df2

Out[5]:

ABCDEF
012013-01-0213testfoo
112013-01-0213trainfoo
212013-01-0213testfoo
312013-01-0213trainfoo

字典的每个 key 代表一列,其 value 可以是各种能够转化为 Series 的对象。

Series 要求所有的类型都一致不同,DataFrame 值要求每一列数据的格式相同:

In [6]:

df2.dtypes

Out[6]:

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

查看数据

头尾数据

headtail 方法可以分别查看最前面几行和最后面几行的数据(默认为 5):

In [7]:

df.head()

Out[7]:

ABCD
2013-01-01-0.605936-0.861658-1.0019241.528584
2013-01-02-0.1654080.3883381.1871871.819818
2013-01-030.065255-1.608074-1.282331-0.286067
2013-01-041.2893050.497115-0.2253510.040239
2013-01-050.0382320.875057-0.0925260.934432

最后 3 行:

In [8]:

df.tail(3)

Out[8]:

ABCD
2013-01-041.2893050.497115-0.2253510.040239
2013-01-050.0382320.875057-0.0925260.934432
2013-01-06-2.163453-0.0102791.6998861.291653

下标,列标,数据

下标使用 index 属性查看:

In [9]:

df.index

Out[9]:

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

列标使用 columns 属性查看:

In [10]:

df.columns

Out[10]:

Index([u'A', u'B', u'C', u'D'], dtype='object')

数据值使用 values 查看:

In [11]:

df.values

Out[11]:

array([[-0.60593585, -0.86165752, -1.00192387,  1.52858443],
       [-0.16540784,  0.38833783,  1.18718697,  1.81981793],
       [ 0.06525454, -1.60807414, -1.2823306 , -0.28606716],
       [ 1.28930486,  0.49711531, -0.22535143,  0.04023897],
       [ 0.03823179,  0.87505664, -0.0925258 ,  0.93443212],
       [-2.16345271, -0.01027865,  1.69988608,  1.29165337]])

统计数据

查看简单的统计数据:

In [12]:

df.describe()

Out[12]:

ABCD
count6.0000006.0000006.0000006.000000
mean-0.257001-0.1199170.0474900.888110
std1.1266570.9387051.1826290.841529
min-2.163453-1.608074-1.282331-0.286067
25%-0.495804-0.648813-0.8077810.263787
50%-0.0635880.189030-0.1589391.113043
75%0.0584990.4699210.8672591.469352
max1.2893050.8750571.6998861.819818

转置

In [13]:

df.T

Out[13]:

2013-01-01 00:00:002013-01-02 00:00:002013-01-03 00:00:002013-01-04 00:00:002013-01-05 00:00:002013-01-06 00:00:00
A-0.605936-0.1654080.0652551.2893050.038232-2.163453
B-0.8616580.388338-1.6080740.4971150.875057-0.010279
C-1.0019241.187187-1.282331-0.225351-0.0925261.699886
D1.5285841.819818-0.2860670.0402390.9344321.291653

排序

sort_index(axis=0, ascending=True) 方法按照下标大小进行排序,axis=0 表示按第 0 维进行排序。

In [14]:

df.sort_index(ascending=False)

Out[14]:

ABCD
2013-01-06-2.163453-0.0102791.6998861.291653
2013-01-050.0382320.875057-0.0925260.934432
2013-01-041.2893050.497115-0.2253510.040239
2013-01-030.065255-1.608074-1.282331-0.286067
2013-01-02-0.1654080.3883381.1871871.819818
2013-01-01-0.605936-0.861658-1.0019241.528584

In [15]:

df.sort_index(axis=1, ascending=False)

Out[15]:

DCBA
2013-01-011.528584-1.001924-0.861658-0.605936
2013-01-021.8198181.1871870.388338-0.165408
2013-01-03-0.286067-1.282331-1.6080740.065255
2013-01-040.040239-0.2253510.4971151.289305
2013-01-050.934432-0.0925260.8750570.038232
2013-01-061.2916531.699886-0.010279-2.163453

sort_values(by, axis=0, ascending=True) 方法按照 by 的值的大小进行排序,例如按照 B 列的大小:

In [16]:

df.sort_values(by="B")

Out[16]:

ABCD
2013-01-030.065255-1.608074-1.282331-0.286067
2013-01-01-0.605936-0.861658-1.0019241.528584
2013-01-06-2.163453-0.0102791.6998861.291653
2013-01-02-0.1654080.3883381.1871871.819818
2013-01-041.2893050.497115-0.2253510.040239
2013-01-050.0382320.875057-0.0925260.934432

索引

虽然 DataFrame 支持 Python/Numpy 的索引语法,但是推荐使用 .at, .iat, .loc, .iloc 和 .ix 方法进行索引。

读取数据

选择单列数据:

In [17]:

df["A"]

Out[17]:

2013-01-01   -0.605936
2013-01-02   -0.165408
2013-01-03    0.065255
2013-01-04    1.289305
2013-01-05    0.038232
2013-01-06   -2.163453
Freq: D, Name: A, dtype: float64

也可以用 df.A

In [18]:

df.A

Out[18]:

2013-01-01   -0.605936
2013-01-02   -0.165408
2013-01-03    0.065255
2013-01-04    1.289305
2013-01-05    0.038232
2013-01-06   -2.163453
Freq: D, Name: A, dtype: float64

使用切片读取多行:

In [19]:

df[0:3]

Out[19]:

ABCD
2013-01-01-0.605936-0.861658-1.0019241.528584
2013-01-02-0.1654080.3883381.1871871.819818
2013-01-030.065255-1.608074-1.282331-0.286067

index 名字也可以进行切片:

In [20]:

df["20130101":"20130103"]

Out[20]:

ABCD
2013-01-01-0.605936-0.861658-1.0019241.528584
2013-01-02-0.1654080.3883381.1871871.819818
2013-01-030.065255-1.608074-1.282331-0.286067

使用 label 索引

loc 可以方便的使用 label 进行索引:

In [21]:

df.loc[dates[0]]

Out[21]:

A   -0.605936
B   -0.861658
C   -1.001924
D    1.528584
Name: 2013-01-01 00:00:00, dtype: float64

多列数据:

In [22]:

df.loc[:,['A','B']]

Out[22]:

AB
2013-01-01-0.605936-0.861658
2013-01-02-0.1654080.388338
2013-01-030.065255-1.608074
2013-01-041.2893050.497115
2013-01-050.0382320.875057
2013-01-06-2.163453-0.010279

选择多行多列:

In [23]:

df.loc['20130102':'20130104',['A','B']]

Out[23]:

AB
2013-01-02-0.1654080.388338
2013-01-030.065255-1.608074
2013-01-041.2893050.497115

数据降维:

In [24]:

df.loc['20130102',['A','B']]

Out[24]:

A   -0.165408
B    0.388338
Name: 2013-01-02 00:00:00, dtype: float64

得到标量值:

In [25]:

df.loc[dates[0],'B']

Out[25]:

-0.86165751902832299

不过得到标量值可以用 at,速度更快:

In [26]:

%timeit -n100 df.loc[dates[0],'B']
%timeit -n100 df.at[dates[0],'B']

print df.at[dates[0],'B']
100 loops, best of 3: 329 µs per loop
100 loops, best of 3: 31.1 µs per loop
-0.861657519028

使用位置索引

iloc 使用位置进行索引:

In [27]:

df.iloc[3]

Out[27]:

A    1.289305
B    0.497115
C   -0.225351
D    0.040239
Name: 2013-01-04 00:00:00, dtype: float64

连续切片:

In [28]:

df.iloc[3:5,0:2]

Out[28]:

AB
2013-01-041.2893050.497115
2013-01-050.0382320.875057

索引不连续的部分:

In [29]:

df.iloc[[1,2,4],[0,2]]

Out[29]:

AC
2013-01-02-0.1654081.187187
2013-01-030.065255-1.282331
2013-01-050.038232-0.092526

索引整行:

In [30]:

df.iloc[1:3,:]

Out[30]:

ABCD
2013-01-02-0.1654080.3883381.1871871.819818
2013-01-030.065255-1.608074-1.282331-0.286067

整列:

In [31]:

df.iloc[:, 1:3]

Out[31]:

BC
2013-01-01-0.861658-1.001924
2013-01-020.3883381.187187
2013-01-03-1.608074-1.282331
2013-01-040.497115-0.225351
2013-01-050.875057-0.092526
2013-01-06-0.0102791.699886

标量值:

In [32]:

df.iloc[1,1]

Out[32]:

0.3883378290420279

当然,使用 iat 索引标量值更快:

In [33]:

%timeit -n100 df.iloc[1,1]
%timeit -n100 df.iat[1,1]

df.iat[1,1]
100 loops, best of 3: 236 µs per loop
100 loops, best of 3: 14.5 µs per loop

Out[33]:

0.3883378290420279

布尔型索引

所有 A 列大于 0 的行:

In [34]:

df[df.A > 0]

Out[34]:

ABCD
2013-01-030.065255-1.608074-1.282331-0.286067
2013-01-041.2893050.497115-0.2253510.040239
2013-01-050.0382320.875057-0.0925260.934432

只留下所有大于 0 的数值:

In [35]:

df[df > 0]

Out[35]:

ABCD
2013-01-01NaNNaNNaN1.528584
2013-01-02NaN0.3883381.1871871.819818
2013-01-030.065255NaNNaNNaN
2013-01-041.2893050.497115NaN0.040239
2013-01-050.0382320.875057NaN0.934432
2013-01-06NaNNaN1.6998861.291653

使用 isin 方法做 filter 过滤:

In [36]:

df2 = df.copy()
df2['E'] = ['one', 'one','two','three','four','three']

df2

Out[36]:

ABCDE
2013-01-01-0.605936-0.861658-1.0019241.528584one
2013-01-02-0.1654080.3883381.1871871.819818one
2013-01-030.065255-1.608074-1.282331-0.286067two
2013-01-041.2893050.497115-0.2253510.040239three
2013-01-050.0382320.875057-0.0925260.934432four
2013-01-06-2.163453-0.0102791.6998861.291653three

In [37]:

df2[df2['E'].isin(['two','four'])]

Out[37]:

ABCDE
2013-01-030.065255-1.608074-1.282331-0.286067two
2013-01-050.0382320.875057-0.0925260.934432four

设定数据的值

In [38]:

s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6))

s1

Out[38]:

2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64

像字典一样,直接指定 F 列的值为 s1,此时以 df 已有的 index 为标准将二者进行合并,s1 中没有的 index 项设为 NaN,多余的项舍去:

In [39]:

df['F'] = s1

df

Out[39]:

ABCDF
2013-01-01-0.605936-0.861658-1.0019241.528584NaN
2013-01-02-0.1654080.3883381.1871871.8198181
2013-01-030.065255-1.608074-1.282331-0.2860672
2013-01-041.2893050.497115-0.2253510.0402393
2013-01-050.0382320.875057-0.0925260.9344324
2013-01-06-2.163453-0.0102791.6998861.2916535

或者使用 atiat 修改单个值:

In [40]:

df.at[dates[0],'A'] = 0

df

Out[40]:

ABCDF
2013-01-010.000000-0.861658-1.0019241.528584NaN
2013-01-02-0.1654080.3883381.1871871.8198181
2013-01-030.065255-1.608074-1.282331-0.2860672
2013-01-041.2893050.497115-0.2253510.0402393
2013-01-050.0382320.875057-0.0925260.9344324
2013-01-06-2.163453-0.0102791.6998861.2916535

In [41]:

df.iat[0, 1] = 0

df

Out[41]:

ABCDF
2013-01-010.0000000.000000-1.0019241.528584NaN
2013-01-02-0.1654080.3883381.1871871.8198181
2013-01-030.065255-1.608074-1.282331-0.2860672
2013-01-041.2893050.497115-0.2253510.0402393
2013-01-050.0382320.875057-0.0925260.9344324
2013-01-06-2.163453-0.0102791.6998861.2916535

设定一整列:

In [42]:

df.loc[:,'D'] = np.array([5] * len(df))

df

Out[42]:

ABCDF
2013-01-010.0000000.000000-1.0019245NaN
2013-01-02-0.1654080.3883381.18718751
2013-01-030.065255-1.608074-1.28233152
2013-01-041.2893050.497115-0.22535153
2013-01-050.0382320.875057-0.09252654
2013-01-06-2.163453-0.0102791.69988655

设定满足条件的数值:

In [43]:

df2 = df.copy()

df2[df2 > 0] = -df2

df2

Out[43]:

ABCDF
2013-01-010.0000000.000000-1.001924-5NaN
2013-01-02-0.165408-0.388338-1.187187-5-1
2013-01-03-0.065255-1.608074-1.282331-5-2
2013-01-04-1.289305-0.497115-0.225351-5-3
2013-01-05-0.038232-0.875057-0.092526-5-4
2013-01-06-2.163453-0.010279-1.699886-5-5

缺失数据

In [44]:

df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
df1.loc[dates[0]:dates[1],'E'] = 1

df1

Out[44]:

ABCDFE
2013-01-010.0000000.000000-1.0019245NaN1
2013-01-02-0.1654080.3883381.187187511
2013-01-030.065255-1.608074-1.28233152NaN
2013-01-041.2893050.497115-0.22535153NaN

丢弃所有缺失数据的行得到的新数据:

In [45]:

df1.dropna(how='any')

Out[45]:

ABCDFE
2013-01-02-0.1654080.3883381.187187511

填充缺失数据:

In [46]:

df1.fillna(value=5)

Out[46]:

ABCDFE
2013-01-010.0000000.000000-1.001924551
2013-01-02-0.1654080.3883381.187187511
2013-01-030.065255-1.608074-1.282331525
2013-01-041.2893050.497115-0.225351535

检查缺失数据的位置:

In [47]:

pd.isnull(df1)

Out[47]:

ABCDFE
2013-01-01FalseFalseFalseFalseTrueFalse
2013-01-02FalseFalseFalseFalseFalseFalse
2013-01-03FalseFalseFalseFalseFalseTrue
2013-01-04FalseFalseFalseFalseFalseTrue

计算操作

统计信息

每一列的均值:

In [48]:

df.mean()

Out[48]:

A   -0.156012
B    0.023693
C    0.047490
D    5.000000
F    3.000000
dtype: float64

每一行的均值:

In [49]:

df.mean(1)

Out[49]:

2013-01-01    0.999519
2013-01-02    1.482023
2013-01-03    0.834970
2013-01-04    1.912214
2013-01-05    1.964153
2013-01-06    1.905231
Freq: D, dtype: float64

多个对象之间的操作,如果维度不对,pandas 会自动调用 broadcasting 机制:

In [50]:

s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2)

print s
2013-01-01   NaN
2013-01-02   NaN
2013-01-03     1
2013-01-04     3
2013-01-05     5
2013-01-06   NaN
Freq: D, dtype: float64

相减 df - s

In [51]:

df.sub(s, axis='index')

Out[51]:

ABCDF
2013-01-01NaNNaNNaNNaNNaN
2013-01-02NaNNaNNaNNaNNaN
2013-01-03-0.934745-2.608074-2.28233141
2013-01-04-1.710695-2.502885-3.22535120
2013-01-05-4.961768-4.124943-5.0925260-1
2013-01-06NaNNaNNaNNaNNaN

apply 操作

R 中的 apply 操作类似,接收一个函数,默认是对将函数作用到每一列上:

In [52]:

df.apply(np.cumsum)

Out[52]:

ABCDF
2013-01-010.0000000.000000-1.0019245NaN
2013-01-02-0.1654080.3883380.185263101
2013-01-03-0.100153-1.219736-1.097067153
2013-01-041.189152-0.722621-1.322419206
2013-01-051.2273830.152436-1.4149452510
2013-01-06-0.9360690.1421570.2849413015

求每列最大最小值之差:

In [53]:

df.apply(lambda x: x.max() - x.min())

Out[53]:

A    3.452758
B    2.483131
C    2.982217
D    0.000000
F    4.000000
dtype: float64

直方图

In [54]:

s = pd.Series(np.random.randint(0, 7, size=10))
print s
0    2
1    5
2    6
3    6
4    6
5    3
6    5
7    0
8    4
9    4
dtype: int64

直方图信息:

In [55]:

print s.value_counts()
6    3
5    2
4    2
3    1
2    1
0    1
dtype: int64

绘制直方图信息:

In [56]:

h = s.hist()

https://www.wenjiangs.com/wp-content/uploads/2022/docimg20/aXqn3so60qyBEXro-Z6woI6.png alt="">

字符串方法

Series 或者 DataFrame 的某一列是字符串时,我们可以用 .str 对这个字符串数组进行字符串的基本操作:

In [57]:

s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])

print s.str.lower()
0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

合并

连接

In [58]:

df = pd.DataFrame(np.random.randn(10, 4))

df

Out[58]:

0123
0-2.3463730.105651-0.0480270.010637
1-0.6821980.9430430.147312-0.657871
20.515766-0.7682860.3615701.146278
3-0.607277-0.003086-1.4990011.165728
4-1.226279-0.177246-1.379631-0.639261
50.807364-1.8550600.3259681.898831
60.438539-0.728131-0.0099240.398360
71.497457-1.506314-1.5576240.869043
80.945985-0.519435-0.510359-1.077751
91.597679-0.285955-1.0607360.608629

可以使用 pd.concat 函数将多个 pandas 对象进行连接:

In [59]:

pieces = [df[:2], df[4:5], df[7:]]

pd.concat(pieces)

Out[59]:

0123
0-2.3463730.105651-0.0480270.010637
1-0.6821980.9430430.147312-0.657871
4-1.226279-0.177246-1.379631-0.639261
71.497457-1.506314-1.5576240.869043
80.945985-0.519435-0.510359-1.077751
91.597679-0.285955-1.0607360.608629

数据库中的 Join

merge 可以实现数据库中的 join 操作:

In [60]:

left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})

print left
print right
   key  lval
0  foo     1
1  foo     2
   key  rval
0  foo     4
1  foo     5

In [61]:

pd.merge(left, right, on='key')

Out[61]:

keylvalrval
0foo14
1foo15
2foo24
3foo25

append

DataFrame 中添加行:

In [62]:

df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])

df

Out[62]:

ABCD
01.587778-0.1102970.6022451.212597
1-0.5511090.337387-0.2209190.363332
21.207373-0.1283940.619937-0.612694
3-0.978282-1.0381700.048995-0.788973
40.843893-1.0790210.0922120.485422
5-0.0565941.8312061.910864-1.331739
6-0.487106-1.4953670.8534400.410854
71.830852-0.0148930.2540250.197422

将第三行的值添加到最后:

In [63]:

s = df.iloc[3]

df.append(s, ignore_index=True)

Out[63]:

ABCD
01.587778-0.1102970.6022451.212597
1-0.5511090.337387-0.2209190.363332
21.207373-0.1283940.619937-0.612694
3-0.978282-1.0381700.048995-0.788973
40.843893-1.0790210.0922120.485422
5-0.0565941.8312061.910864-1.331739
6-0.487106-1.4953670.8534400.410854
71.830852-0.0148930.2540250.197422
8-0.978282-1.0381700.048995-0.788973

Grouping

In [64]:

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})

df

Out[64]:

ABCD
0fooone0.7730620.206503
1barone1.414609-0.346719
2footwo0.9641740.706623
3barthree0.182239-1.516509
4footwo-0.0962550.494177
5bartwo-0.759471-0.389213
6fooone-0.257519-1.411693
7foothree-0.1093680.241862

按照 A 的值进行分类:

In [65]:

df.groupby('A').sum()

Out[65]:

CD
A
---------
bar0.837377-2.252441
foo1.2740940.237472

按照 A, B 的值进行分类:

In [66]:

df.groupby(['A', 'B']).sum()

Out[66]:

CD
AB
------------
barone1.414609-0.346719
three0.182239-1.516509
two-0.759471-0.389213
fooone0.515543-1.205191
three-0.1093680.241862
two0.8679191.200800

改变形状

Stack

产生一个多 indexDataFrame

In [67]:

tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
                     'foo', 'foo', 'qux', 'qux'],
                    ['one', 'two', 'one', 'two',
                     'one', 'two', 'one', 'two']]))

index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])

df

Out[67]:

AB
firstsecond
------------
barone-0.1091740.958551
two-0.254743-0.975924
bazone-0.132039-0.119009
two0.587063-0.819037
fooone-0.7541230.430747
two-0.4265440.389822
quxone-0.382501-0.562910
two-0.5292870.826337

stack 方法将 columns 变成一个新的 index 部分:

In [68]:

df2 = df[:4]

stacked = df2.stack()

stacked

Out[68]:

first  second   
bar    one     A   -0.109174
               B    0.958551
       two     A   -0.254743
               B   -0.975924
baz    one     A   -0.132039
               B   -0.119009
       two     A    0.587063
               B   -0.819037
dtype: float64

可以使用 unstack() 将最后一级 index 放回 column

In [69]:

stacked.unstack()

Out[69]:

AB
firstsecond
------------
barone-0.1091740.958551
two-0.254743-0.975924
bazone-0.132039-0.119009
two0.587063-0.819037

也可以指定其他的级别:

In [70]:

stacked.unstack(1)

Out[70]:

secondonetwo
first
------------
barA-0.109174-0.254743
B0.958551-0.975924
bazA-0.1320390.587063
B-0.119009-0.819037

时间序列

金融分析中常用到时间序列数据:

In [71]:

rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')
ts = pd.Series(np.random.randn(len(rng)), rng)

ts

Out[71]:

2012-03-06    1.096788
2012-03-07    0.029678
2012-03-08    0.511461
2012-03-09   -0.332369
2012-03-10    1.720321
Freq: D, dtype: float64

标准时间表示:

In [72]:

ts_utc = ts.tz_localize('UTC')

ts_utc

Out[72]:

2012-03-06 00:00:00+00:00    1.096788
2012-03-07 00:00:00+00:00    0.029678
2012-03-08 00:00:00+00:00    0.511461
2012-03-09 00:00:00+00:00   -0.332369
2012-03-10 00:00:00+00:00    1.720321
Freq: D, dtype: float64

In [ ]:

改变时区表示:

In [73]:

ts_utc.tz_convert('US/Eastern')

Out[73]:

2012-03-05 19:00:00-05:00    1.096788
2012-03-06 19:00:00-05:00    0.029678
2012-03-07 19:00:00-05:00    0.511461
2012-03-08 19:00:00-05:00   -0.332369
2012-03-09 19:00:00-05:00    1.720321
Freq: D, dtype: float64

Categoricals

In [74]:

df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})

df

Out[74]:

idraw_grade
01a
12b
23b
34a
45a
56e

可以将 grade 变成类别:

In [75]:

df["grade"] = df["raw_grade"].astype("category")

df["grade"]

Out[75]:

0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): [a, b, e]

将类别的表示转化为有意义的字符:

In [76]:

df["grade"].cat.categories = ["very good", "good", "very bad"]

df["grade"]

Out[76]:

0    very good
1         good
2         good
3    very good
4    very good
5     very bad
Name: grade, dtype: category
Categories (3, object): [very good, good, very bad]

添加缺失的类别:

In [77]:

df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
df["grade"]

Out[77]:

0    very good
1         good
2         good
3    very good
4    very good
5     very bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]

使用 grade 分组:

In [78]:

df.groupby("grade").size()

Out[78]:

grade
very bad     1
bad          0
medium       0
good         2
very good    3
dtype: int64

绘图

使用 ggplot 风格:

In [79]:

plt.style.use('ggplot')

Series 绘图:

In [80]:

ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))

p = ts.cumsum().plot()

https://www.wenjiangs.com/wp-content/uploads/2022/docimg20/UeJdYrKU0QZ45FeH-vC4jkl.png alt="">

DataFrame 按照 columns 绘图:

In [81]:

df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
                  columns=['A', 'B', 'C', 'D'])

df.cumsum().plot()
p = plt.legend(loc="best")

https://www.wenjiangs.com/wp-content/uploads/2022/docimg20/i3E0Ttf6GBShRSEV-GGNhW3.png alt="">

文件读写

csv

写入文件:

In [82]:

df.to_csv('foo.csv')

从文件中读取:

In [83]:

pd.read_csv('foo.csv').head()

Out[83]:

Unnamed: 0ABCD
02000-01-01-1.0115541.200283-0.310949-1.060734
12000-01-02-1.0308940.660518-0.214002-0.422014
22000-01-03-0.4886921.709209-0.6022081.115456
32000-01-04-0.4402430.8266920.321648-0.351698
42000-01-05-0.1656841.2973030.8172330.174767

hdf5

写入文件:

In [84]:

df.to_hdf("foo.h5", "df")

读取文件:

In [85]:

pd.read_hdf('foo.h5','df').head()

Out[85]:

ABCD
2000-01-01-1.0115541.200283-0.310949-1.060734
2000-01-02-1.0308940.660518-0.214002-0.422014
2000-01-03-0.4886921.709209-0.6022081.115456
2000-01-04-0.4402430.8266920.321648-0.351698
2000-01-05-0.1656841.2973030.8172330.174767

excel

写入文件:

In [86]:

df.to_excel('foo.xlsx', sheet_name='Sheet1')

读取文件:

In [87]:

pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA']).head()

Out[87]:

ABCD
2000-01-01-1.0115541.200283-0.310949-1.060734
2000-01-02-1.0308940.660518-0.214002-0.422014
2000-01-03-0.4886921.709209-0.6022081.115456
2000-01-04-0.4402430.8266920.321648-0.351698
2000-01-05-0.1656841.2973030.8172330.174767

清理生成的临时文件:

In [88]:

import glob
import os

for f in glob.glob("foo*"):
    os.remove(f)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
    我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
    原文