- 4.1 The NumPy ndarray 多维数组对象
- 4.2 Universal Functions 通用函数
- 4.3 Array-Oriented Programming with Arrays 数组导向编程
- 5.1 Introduction to pandas Data Structures pandas 的数据结构
- 5.2 Essential Functionality 主要功能
- 5.3 Summarizing and Computing Descriptive Statistics 汇总和描述性统计
- 7.1 Handling Missing Data 处理缺失数据
- 7.2 Data Transformation 数据变换
- 7.3 String Manipulation 字符串处理
- 11.1 Date and Time Data Types and Tools 日期和时间数据类型及其工具
- 11.2 Time Series Basics 时间序列基础
- 11.3 Date Ranges, Frequencies, and Shifting 日期范围,频度,和位移
- 12.1 Categorical Data 类别数据
- 14.1 USA.gov Data from Bitly USA.gov 数据集
- 14.2 MovieLens 1M Dataset MovieLens 1M 数据集
- 14.3 US Baby Names 1880–2010 1880年至2010年美国婴儿姓名
11.2 Time Series Basics 时间序列基础
在 pandas 中,一个基本的时间序列对象,是一个用时间戳作为索引的 Series,在 pandas 外部的话,通常是用 python 字符串或 datetime 对象来表示的:
import pandas as pd import numpy as np from datetime import datetime
dates = [datetime(2011, 1, 2), datetime(2011, 1, 5), datetime(2011, 1, 7), datetime(2011, 1, 8), datetime(2011, 1, 10), datetime(2011, 1, 12)]
ts = pd.Series(np.random.randn(6), index=dates) ts
2011-01-02 0.384868 2011-01-05 0.669181 2011-01-07 2.553288 2011-01-08 -1.808783 2011-01-10 1.180570 2011-01-12 -0.928942 dtype: float64
上面的转化原理是,datetime 对象被放进了 DatetimeIndex:
ts.index
DatetimeIndex(['2011-01-02', '2011-01-05', '2011-01-07', '2011-01-08', '2011-01-10', '2011-01-12'], dtype='datetime64[ns]', freq=None)
像其他的 Series 一行,数值原色会自动按时间序列索引进行对齐:
ts[::2]
2011-01-02 0.384868 2011-01-07 2.553288 2011-01-10 1.180570 dtype: float64
ts + ts[::2]
2011-01-02 0.769735 2011-01-05 NaN 2011-01-07 5.106575 2011-01-08 NaN 2011-01-10 2.361140 2011-01-12 NaN dtype: float64
ts[::2]会在 ts 中,每隔两个元素选一个元素。
pandas 中的时间戳,是按 numpy 中的 datetime64 数据类型进行保存的,可以精确到纳秒的级别:
ts.index.dtype
dtype('<M8[ns]')
DatetimeIndex 的标量是 pandas 的 Timestamp 对象:
stamp = ts.index[0] stamp
Timestamp('2011-01-02 00:00:00')
Timestamp 可以在任何地方用 datetime 对象进行替换。
1 Indexing, Selection, Subsetting(索引,选择,取子集)
当我们基于标签进行索引和选择时,时间序列就像是 pandas.Series:
ts
2011-01-02 0.384868 2011-01-05 0.669181 2011-01-07 2.553288 2011-01-08 -1.808783 2011-01-10 1.180570 2011-01-12 -0.928942 dtype: float64
stamp = ts.index[2]
ts[stamp]
2.5532875030792592
为了方便,我们可以直接传入一个字符串用来表示日期:
ts['1/10/2011']
1.1805698813038874
ts['20110110']
1.1805698813038874
对于比较长的时间序列,我们可以直接传入一年或一年一个月,来进行数据选取:
longer_ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000)) longer_ts
2000-01-01 -0.801668 2000-01-02 -0.325797 2000-01-03 0.047318 2000-01-04 0.239576 2000-01-05 -0.467691 2000-01-06 1.394063 2000-01-07 0.416262 2000-01-08 -0.739839 2000-01-09 -1.504631 2000-01-10 -0.798753 2000-01-11 0.758856 2000-01-12 1.163517 2000-01-13 1.233826 2000-01-14 0.675056 2000-01-15 -1.079219 2000-01-16 0.212076 2000-01-17 -0.242134 2000-01-18 -0.318024 2000-01-19 0.040686 2000-01-20 -1.342025 2000-01-21 -0.130905 2000-01-22 -0.122308 2000-01-23 -0.924727 2000-01-24 0.071544 2000-01-25 0.483302 2000-01-26 -0.264231 2000-01-27 0.815791 2000-01-28 0.652885 2000-01-29 0.203818 2000-01-30 0.007890 ... 2002-08-28 -2.375283 2002-08-29 0.843647 2002-08-30 0.069483 2002-08-31 -1.151590 2002-09-01 -2.348154 2002-09-02 -0.309723 2002-09-03 -1.017466 2002-09-04 -2.078659 2002-09-05 -1.828568 2002-09-06 0.546299 2002-09-07 0.861304 2002-09-08 -0.823128 2002-09-09 -0.150047 2002-09-10 -1.984674 2002-09-11 0.468010 2002-09-12 -0.066440 2002-09-13 -1.629502 2002-09-14 0.044870 2002-09-15 0.007970 2002-09-16 0.812104 2002-09-17 -1.835575 2002-09-18 -0.218055 2002-09-19 -0.271351 2002-09-20 -1.852212 2002-09-21 0.546462 2002-09-22 0.776960 2002-09-23 -1.140997 2002-09-24 -2.213685 2002-09-25 -0.586588 2002-09-26 -1.472430 Freq: D, dtype: float64
longer_ts['2001']
2001-01-01 0.588405 2001-01-02 -3.027909 2001-01-03 -0.492280 2001-01-04 -0.807809 2001-01-05 -0.124139 2001-01-06 -0.198966 2001-01-07 2.015447 2001-01-08 1.454119 2001-01-09 0.157505 2001-01-10 1.077689 2001-01-11 -0.246538 2001-01-12 -0.865122 2001-01-13 -0.082186 2001-01-14 1.928050 2001-01-15 0.320741 2001-01-16 0.473770 2001-01-17 0.036649 2001-01-18 1.405034 2001-01-19 0.560502 2001-01-20 -0.695138 2001-01-21 3.318884 2001-01-22 1.704966 2001-01-23 0.145167 2001-01-24 0.366667 2001-01-25 -0.565675 2001-01-26 0.940406 2001-01-27 -1.468772 2001-01-28 0.098759 2001-01-29 0.267449 2001-01-30 -0.221643 ... 2001-12-02 0.002522 2001-12-03 -0.046712 2001-12-04 1.825249 2001-12-05 -1.000655 2001-12-06 -0.807582 2001-12-07 0.750439 2001-12-08 1.531707 2001-12-09 -0.195239 2001-12-10 -0.087465 2001-12-11 -0.041450 2001-12-12 1.992200 2001-12-13 -0.294916 2001-12-14 1.215363 2001-12-15 0.029039 2001-12-16 -0.165380 2001-12-17 1.192535 2001-12-18 0.737760 2001-12-19 0.044022 2001-12-20 0.582560 2001-12-21 -0.213569 2001-12-22 -0.024512 2001-12-23 -1.140873 2001-12-24 -1.351333 2001-12-25 0.725253 2001-12-26 -0.943740 2001-12-27 -2.134039 2001-12-28 -0.548597 2001-12-29 1.497907 2001-12-30 -0.594708 2001-12-31 0.068177 Freq: D, dtype: float64
这里,字符串'2001'就直接被解析为一年,然后选中这个时期的数据。我们也可以指定月份:
longer_ts['2001-05']
2001-05-01 -0.560227 2001-05-02 2.160259 2001-05-03 -0.826092 2001-05-04 -0.183020 2001-05-05 -0.294708 2001-05-06 -1.210785 2001-05-07 0.609260 2001-05-08 -1.155377 2001-05-09 -0.127132 2001-05-10 0.576327 2001-05-11 -0.955206 2001-05-12 -2.002019 2001-05-13 -0.969865 2001-05-14 0.820993 2001-05-15 0.557336 2001-05-16 -0.262222 2001-05-17 -0.086760 2001-05-18 0.151608 2001-05-19 1.097604 2001-05-20 0.212148 2001-05-21 -1.164944 2001-05-22 -0.100020 2001-05-23 0.734738 2001-05-24 1.730438 2001-05-25 1.352858 2001-05-26 0.644984 2001-05-27 0.997554 2001-05-28 1.434452 2001-05-29 0.395946 2001-05-30 -0.142523 2001-05-31 1.205485 Freq: D, dtype: float64
利用 datetime 进行切片(slicing)也没问题:
ts[datetime(2011, 1, 7)]
2.5532875030792592
因为大部分时间序列是按年代时间顺序来排列的,我们可以用时间戳来进行切片,选中一段范围内的时间:
ts
2011-01-02 0.384868 2011-01-05 0.669181 2011-01-07 2.553288 2011-01-08 -1.808783 2011-01-10 1.180570 2011-01-12 -0.928942 dtype: float64
ts['1/6/2011':'1/11/2011']
2011-01-07 2.553288 2011-01-08 -1.808783 2011-01-10 1.180570 dtype: float64
记住,这种方式的切片得到的只是原来数据的一个视图,如果我们在切片的结果上进行更改的的,原来的数据也会变化。
有一个相等的实例方法(instance method)也能切片,truncate,能在两个日期上,对 Series 进行切片:
ts.truncate(after='1/9/2011')
2011-01-02 0.384868 2011-01-05 0.669181 2011-01-07 2.553288 2011-01-08 -1.808783 dtype: float64
所有这些都适用于 DataFrame,我们对行进行索引:
dates = pd.date_range('1/1/2000', periods=100, freq='W-WED')
long_df = pd.DataFrame(np.random.randn(100, 4), index=dates, columns=['Colorado', 'Texas', 'New York', 'Ohio'])
long_df.loc['5-2001']
Colorado | Texas | New York | Ohio | |
---|---|---|---|---|
2001-05-02 | -0.477517 | 0.722685 | 0.337141 | -0.345072 |
2001-05-09 | -0.401860 | -0.475821 | 0.685129 | -0.809288 |
2001-05-16 | 1.900541 | 0.348590 | -0.805042 | -0.410077 |
2001-05-23 | -0.220870 | 1.654666 | -0.846395 | -0.207802 |
2001-05-30 | 2.094319 | -0.972588 | 1.276059 | -1.056146 |
2 Time Series with Duplicate Indices(重复索引的时间序列)
在某些数据中,可能会遇到多个数据在同一时间戳下的情况:
dates = pd.DatetimeIndex(['1/1/2000', '1/2/2000', '1/2/2000', '1/2/2000', '1/3/2000'])
dup_ts = pd.Series(np.arange(5), index=dates) dup_ts
2000-01-01 0 2000-01-02 1 2000-01-02 2 2000-01-02 3 2000-01-03 4 dtype: int64
我们通过 is_unique 属性来查看 index 是否是唯一值:
dup_ts.index.is_unique
False
对这个时间序列取索引的的话, 要么得到标量,要么得到切片,这取决于时间戳是否是重复的:
dup_ts['1/3/2000'] # not duplicated
4
dup_ts['1/2/2000'] # duplicated
2000-01-02 1 2000-01-02 2 2000-01-02 3 dtype: int64
假设我们想要聚合那些有重复时间戳的数据,一种方法是用 groupby,设定 level=0:
grouped = dup_ts.groupby(level=0) grouped.mean()
2000-01-01 0 2000-01-02 2 2000-01-03 4 dtype: int64
grouped.count()
2000-01-01 1 2000-01-02 3 2000-01-03 1 dtype: int64
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论