- 一个 Python 的数据分析库
- 关于 Pandas
- 获取 Pandas
- v0.25.0 版本特性(2019年7月18日)
- 安装
- 快速入门
- Pandas 用户指南目录
- IO工具(文本,CSV,HDF5,…)
- 索引和数据选择器
- 多层级索引和高级索引
- Merge, join, and concatenate
- Reshaping and pivot tables
- Pandas 处理文本字符串
- Working with missing data
- Categorical data
- Nullable 整型数据类型
- Visualization
- Computational tools
- Group By: split-apply-combine
- 时间序列与日期用法
- 时间差
- Styling
- Options and settings
- Enhancing performance
- Sparse data structures
- Frequently Asked Questions (FAQ)
- 烹饪指南
- Pandas 生态圈
- API 参考手册
- 开发者文档
- 发布日志
v0.25.0 版本特性(2019年7月18日)
警告
从0.25.x系列版本开始,Pandas仅支持Python 3.5.3及更高版本。有关更多详细信息,请参见计划移除对Python 2.7的支持。
警告
在未来的版本中,支持的最低Python版本将提高到3.6。
警告
面板(Panel) 已完全删除。对于N-D标记的数据结构,请使用 xarrayopen in new window。
警告
read_pickle()
和read_msgpack()
方法仅保证向后兼容的 Pandas 版本为0.20.3(GH27082open in new window)。
这些是 Pandas v0.25.0 版本的改变。有关完整的更新日志(包括其他版本的Pandas),请参见发布日志。
增强
具有重新标记的Groupby聚合
Pandas添加了特殊的groupby行为,称为“命名聚合”,用于在将多个聚合函数应用于特定列时命名输出列(GH18366open in new window, GH26512open in new window)。
In [1]: animals = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
...: 'height': [9.1, 6.0, 9.5, 34.0],
...: 'weight': [7.9, 7.5, 9.9, 198.0]})
...:
In [2]: animals
Out[2]:
kind height weight
0 cat 9.1 7.9
1 dog 6.0 7.5
2 cat 9.5 9.9
3 dog 34.0 198.0
[4 rows x 3 columns]
In [3]: animals.groupby("kind").agg(
...: min_height=pd.NamedAgg(column='height', aggfunc='min'),
...: max_height=pd.NamedAgg(column='height', aggfunc='max'),
...: average_weight=pd.NamedAgg(column='weight', aggfunc=np.mean),
...: )
...:
Out[3]:
min_height max_height average_weight
kind
cat 9.1 9.5 8.90
dog 6.0 34.0 102.75
[2 rows x 3 columns]
将所需的列名称作为 **kwargs
传递给 .agg
。 **kwargs
的值应该是元组,其中第一个元素是列选择,第二个元素是要应用的聚合函数。Pandas提供了pandas.NamedAgg
(命名为元组),使函数的参数更清晰,但也接受了普通元组。
In [4]: animals.groupby("kind").agg(
...: min_height=('height', 'min'),
...: max_height=('height', 'max'),
...: average_weight=('weight', np.mean),
...: )
...:
Out[4]:
min_height max_height average_weight
kind
cat 9.1 9.5 8.90
dog 6.0 34.0 102.75
[2 rows x 3 columns]
命名聚合是建议替代不推荐使用的 “dict-of-dicts” 方法来命名特定于列的聚合的输出(重命名时使用字典弃用groupby.agg())。
类似的方法现在也可用于Series Groupby对象。因为不需要选择列,所以值可以只是要应用的函数。
In [5]: animals.groupby("kind").height.agg(
...: min_height="min",
...: max_height="max",
...: )
...:
Out[5]:
min_height max_height
kind
cat 9.1 9.5
dog 6.0 34.0
[2 rows x 2 columns]
在将dict传递给Series groupby聚合(重命名时使用字典时不推荐使用groupby.agg()方法)时,建议使用这种类型的聚合来替代不建议使用的方法和操作。
有关更多信息,请参见命名聚合。
具有多个Lambda的Groupby聚合
您现在可以在 pandas.core.groupby.GroupBy.agg
open in new window (GH26430open in new window) 中为类似列表的聚合提供多个lambda函数。
In [6]: animals.groupby('kind').height.agg([
...: lambda x: x.iloc[0], lambda x: x.iloc[-1]
...: ])
...:
Out[6]:
<lambda_0> <lambda_1>
kind
cat 9.1 9.5
dog 6.0 34.0
[2 rows x 2 columns]
In [7]: animals.groupby('kind').agg([
...: lambda x: x.iloc[0] - x.iloc[1],
...: lambda x: x.iloc[0] + x.iloc[1]
...: ])
...:
Out[7]:
height weight
<lambda_0> <lambda_1> <lambda_0> <lambda_1>
kind
cat -0.4 18.6 -2.0 17.8
dog -28.0 40.0 -190.5 205.5
[2 rows x 4 columns]
以前的版本,这些行为会引发 SpecificationError
异常。
更好的多索引 repr
MultiIndex
open in new window 实例的打印现在将会显示每行的元组数据,并确保元组项垂直对齐,因此现在更容易理解MultiIndex的结构。(GH13480open in new window):
repr现在看起来像这样:
In [8]: pd.MultiIndex.from_product([['a', 'abc'], range(500)])
Out[8]:
MultiIndex([( 'a', 0),
( 'a', 1),
( 'a', 2),
( 'a', 3),
( 'a', 4),
( 'a', 5),
( 'a', 6),
( 'a', 7),
( 'a', 8),
( 'a', 9),
...
('abc', 490),
('abc', 491),
('abc', 492),
('abc', 493),
('abc', 494),
('abc', 495),
('abc', 496),
('abc', 497),
('abc', 498),
('abc', 499)],
length=1000)
在以前的版本中,输出 MultiIndex
open in new window 操作会打印MultiIndex的所有级别和代码,这在视觉和排版上没有吸引力,并使输出的内容更难以定位。例如(将范围限制为5):
In [1]: pd.MultiIndex.from_product([['a', 'abc'], range(5)])
Out[1]: MultiIndex(levels=[['a', 'abc'], [0, 1, 2, 3]],
...: codes=[[0, 0, 0, 0, 1, 1, 1, 1], [0, 1, 2, 3, 0, 1, 2, 3]])
在新的repr中,如果行数小于 options.display.max_seq_items
(默认值:100个项目),则将显示所有值。水平方向上,如果输出比options.display.width
宽(默认值:80个字符),则输出将被截断。
用于Series和DataFrame的较短截断 repr
目前,pandas的默认显示选项确保当Series或DataFrame具有超过60行时,其repr将被截断为最多60行(display.max_rows选项)。 然而,这仍然给出一个占据垂直屏幕区域很大一部分的repr。 因此,引入了一个新选项 display.min_rows
,默认值为10,它确定截断的repr中显示的行数:
- 对于较小的 Series 或 DataFrame,最多显示
max_rows
数行 (默认值:60)。 - 对于长度大于
max_rows
的长度较大的DataFrame Series,仅限显示min_rows
数行(默认值:10,即第一个和最后一个5行)。
这个双重选项允许仍然可以看到相对较小的对象的全部内容(例如 df.head(20)
显示所有20行),同时为大对象提供简短的repr。
要恢复单个阈值的先前行为,请设置 pd.options.display.min_rows = None
。
使用max_level参数支持进行JSON规范化
json_normalize()
normalizes the provided input dict to all nested levels. The new max_level parameter provides more control over which level to end normalization (GH23843open in new window):
The repr now looks like this:
In [9]: from pandas.io.json import json_normalize
In [10]: data = [{
....: 'CreatedBy': {'Name': 'User001'},
....: 'Lookup': {'TextField': 'Some text',
....: 'UserField': {'Id': 'ID001', 'Name': 'Name001'}},
....: 'Image': {'a': 'b'}
....: }]
....:
In [11]: json_normalize(data, max_level=1)
Out[11]:
CreatedBy.Name Lookup.TextField Lookup.UserField Image.a
0 User001 Some text {'Id': 'ID001', 'Name': 'Name001'} b
[1 rows x 4 columns]
Series.explode 将类似列表的值拆分为行
Series
open in new window and DataFrame
open in new window have gained the DataFrame.explode()
open in new window methods to transform list-likes to individual rows. See section on Exploding list-like columnopen in new window in docs for more information (GH16538open in new window, GH10511open in new window)
Here is a typical usecase. You have comma separated string in a column.
In [12]: df = pd.DataFrame([{'var1': 'a,b,c', 'var2': 1},
....: {'var1': 'd,e,f', 'var2': 2}])
....:
In [13]: df
Out[13]:
var1 var2
0 a,b,c 1
1 d,e,f 2
[2 rows x 2 columns]
Creating a long form DataFrame
is now straightforward using chained operations
In [14]: df.assign(var1=df.var1.str.split(',')).explode('var1')
Out[14]:
var1 var2
0 a 1
0 b 1
0 c 1
1 d 2
1 e 2
1 f 2
[6 rows x 2 columns]
其他增强功能
DataFrame.plot()
open in new window keywordslogy
,logx
andloglog
can now accept the value'sym'
for symlog scaling. (GH24867open in new window)- Added support for ISO week year format (‘%G-%V-%u’) when parsing datetimes using
to_datetime()
open in new window (GH16607open in new window) - Indexing of
DataFrame
andSeries
now accepts zerodimnp.ndarray
(GH24919open in new window) Timestamp.replace()
open in new window now supports thefold
argument to disambiguate DST transition times (GH25017open in new window)DataFrame.at_time()
open in new window andSeries.at_time()
open in new window now supportdatetime.time
open in new window objects with timezones (GH24043open in new window)DataFrame.pivot_table()
open in new window now accepts anobserved
parameter which is passed to underlying calls toDataFrame.groupby()
open in new window to speed up grouping categorical data. (GH24923open in new window)Series.str
has gainedSeries.str.casefold()
open in new window method to removes all case distinctions present in a string (GH25405open in new window)DataFrame.set_index()
open in new window now works for instances ofabc.Iterator
, provided their output is of the same length as the calling frame (GH22484open in new window, GH24984open in new window)DatetimeIndex.union()
now supports thesort
argument. The behavior of the sort parameter matches that ofIndex.union()
open in new window (GH24994open in new window)RangeIndex.union()
now supports thesort
argument. Ifsort=False
an unsortedInt64Index
is always returned.sort=None
is the default and returns a monotonically increasingRangeIndex
if possible or a sortedInt64Index
if not (GH24471open in new window)TimedeltaIndex.intersection()
now also supports thesort
keyword (GH24471open in new window)DataFrame.rename()
open in new window now supports theerrors
argument to raise errors when attempting to rename nonexistent keys (GH13473open in new window)- Added Sparse accessoropen in new window for working with a
DataFrame
whose values are sparse (GH25681open in new window) RangeIndex
open in new window has gainedstart
open in new window,stop
open in new window, andstep
open in new window attributes (GH25710open in new window)datetime.timezone
open in new window objects are now supported as arguments to timezone methods and constructors (GH25065open in new window)DataFrame.query()
open in new window andDataFrame.eval()
open in new window now supports quoting column names with backticks to refer to names with spaces (GH6508open in new window)merge_asof()
open in new window now gives a more clear error message when merge keys are categoricals that are not equal (GH26136open in new window)pandas.core.window.Rolling()
supports exponential (or Poisson) window type (GH21303open in new window)- Error message for missing required imports now includes the original import error’s text (GH23868open in new window)
DatetimeIndex
open in new window andTimedeltaIndex
open in new window now have amean
method (GH24757open in new window)DataFrame.describe()
open in new window now formats integer percentiles without decimal point (GH26660open in new window)- Added support for reading SPSS .sav files using
read_spss()
(GH26537open in new window) - Added new option
plotting.backend
to be able to select a plotting backend different than the existingmatplotlib
one. Usepandas.set_option('plotting.backend', '')
where ``GH14130) pandas.offsets.BusinessHour
supports multiple opening hours intervals (GH15481open in new window)read_excel()
open in new window can now useopenpyxl
to read Excel files via theengine='openpyxl'
argument. This will become the default in a future release (GH11499open in new window)pandas.io.excel.read_excel()
supports reading OpenDocument tables. Specifyengine='odf'
to enable. Consult the IO User Guideopen in new window for more details (GH9070open in new window)Interval
open in new window,IntervalIndex
open in new window, andIntervalArray
open in new window have gained anis_empty
open in new window attribute denoting if the given interval(s) are empty (GH27219open in new window)
向后不兼容的API更改
使用UTC偏移量对日期字符串进行索引
Indexing a DataFrame
open in new window or Series
open in new window with a DatetimeIndex
open in new window with a date string with a UTC offset would previously ignore the UTC offset. Now, the UTC offset is respected in indexing. (GH24076open in new window, GH16785open in new window)
In [15]: df = pd.DataFrame([0], index=pd.DatetimeIndex(['2019-01-01'], tz='US/Pacific'))
In [16]: df
Out[16]:
0
2019-01-01 00:00:00-08:00 0
[1 rows x 1 columns]
Previous behavior:
In [3]: df['2019-01-01 00:00:00+04:00':'2019-01-01 01:00:00+04:00']
Out[3]:
0
2019-01-01 00:00:00-08:00 0
New behavior:
In [17]: df['2019-01-01 12:00:00+04:00':'2019-01-01 13:00:00+04:00']
Out[17]:
0
2019-01-01 00:00:00-08:00 0
[1 rows x 1 columns]
MultiIndex
由级别和代码构造
Constructing a MultiIndex
open in new window with NaN
levels or codes value < -1 was allowed previously. Now, construction with codes value < -1 is not allowed and NaN
levels’ corresponding codes would be reassigned as -1. (GH19387open in new window)
Previous behavior:
In [1]: pd.MultiIndex(levels=[[np.nan, None, pd.NaT, 128, 2]],
...: codes=[[0, -1, 1, 2, 3, 4]])
...:
Out[1]: MultiIndex(levels=[[nan, None, NaT, 128, 2]],
codes=[[0, -1, 1, 2, 3, 4]])
In [2]: pd.MultiIndex(levels=[[1, 2]], codes=[[0, -2]])
Out[2]: MultiIndex(levels=[[1, 2]],
codes=[[0, -2]])
New behavior:
In [18]: pd.MultiIndex(levels=[[np.nan, None, pd.NaT, 128, 2]],
....: codes=[[0, -1, 1, 2, 3, 4]])
....:
Out[18]:
MultiIndex([(nan,),
(nan,),
(nan,),
(nan,),
(128,),
( 2,)],
)
In [19]: pd.MultiIndex(levels=[[1, 2]], codes=[[0, -2]])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-19-225a01af3975> in <module>
----> 1 pd.MultiIndex(levels=[[1, 2]], codes=[[0, -2]])
/pandas/pandas/util/_decorators.py in wrapper(*args, **kwargs)
206 else:
207 kwargs[new_arg_name] = new_arg_value
--> 208 return func(*args, **kwargs)
209
210 return wrapper
/pandas/pandas/core/indexes/multi.py in __new__(cls, levels, codes, sortorder, names, dtype, copy, name, verify_integrity, _set_identity)
270
271 if verify_integrity:
--> 272 new_codes = result._verify_integrity()
273 result._codes = new_codes
274
/pandas/pandas/core/indexes/multi.py in _verify_integrity(self, codes, levels)
348 raise ValueError(
349 "On level {level}, code value ({code})"
--> 350 " < -1".format(level=i, code=level_codes.min())
351 )
352 if not level.is_unique:
ValueError: On level 0, code value (-2) < -1
DataFrame
上的 Groupby.apply
只对第一组求值一次
The implementation of DataFrameGroupBy.apply()
previously evaluated the supplied function consistently twice on the first group to infer if it is safe to use a fast code path. Particularly for functions with side effects, this was an undesired behavior and may have led to surprises. (GH2936open in new window, GH2656open in new window, GH7739open in new window, GH10519open in new window, GH12155open in new window, GH20084open in new window, GH21417open in new window)
Now every group is evaluated only a single time.
In [20]: df = pd.DataFrame({"a": ["x", "y"], "b": [1, 2]})
In [21]: df
Out[21]:
a b
0 x 1
1 y 2
[2 rows x 2 columns]
In [22]: def func(group):
....: print(group.name)
....: return group
....:
Previous behavior:
In [3]: df.groupby('a').apply(func)
x
x
y
Out[3]:
a b
0 x 1
1 y 2
New behavior:
In [23]: df.groupby("a").apply(func)
x
y
Out[23]:
a b
0 x 1
1 y 2
[2 rows x 2 columns]
连接稀疏值
When passed DataFrames whose values are sparse, concat()
open in new window will now return a Series
open in new window or DataFrame
open in new window with sparse values, rather than a SparseDataFrame
(GH25702open in new window).
In [24]: df = pd.DataFrame({"A": pd.SparseArray([0, 1])})
Previous behavior:
In [2]: type(pd.concat([df, df]))
pandas.core.sparse.frame.SparseDataFrame
New behavior:
In [25]: type(pd.concat([df, df]))
Out[25]: pandas.core.frame.DataFrame
This now matches the existing behavior of concat
open in new window on Series
with sparse values. concat()
open in new window will continue to return a SparseDataFrame
when all the values are instances of SparseDataFrame
.
This change also affects routines using concat()
open in new window internally, like get_dummies()
open in new window, which now returns a DataFrame
open in new window in all cases (previously a SparseDataFrame
was returned if all the columns were dummy encoded, and a DataFrame
open in new window otherwise).
Providing any SparseSeries
or SparseDataFrame
to concat()
open in new window will cause a SparseSeries
or SparseDataFrame
to be returned, as before.
`.str``-访问器执行更严格的类型检查
Due to the lack of more fine-grained dtypes, Series.str
open in new window so far only checked whether the data was of object
dtype. Series.str
open in new window will now infer the dtype data within the Series; in particular, 'bytes'
-only data will raise an exception (except for Series.str.decode()
open in new window, Series.str.get()
open in new window, Series.str.len()
open in new window, Series.str.slice()
open in new window), see GH23163open in new window, GH23011open in new window, GH23551open in new window.
Previous behavior:
In [1]: s = pd.Series(np.array(['a', 'ba', 'cba'], 'S'), dtype=object)
In [2]: s
Out[2]:
0 b'a'
1 b'ba'
2 b'cba'
dtype: object
In [3]: s.str.startswith(b'a')
Out[3]:
0 True
1 False
2 False
dtype: bool
New behavior:
In [26]: s = pd.Series(np.array(['a', 'ba', 'cba'], 'S'), dtype=object)
In [27]: s
Out[27]:
0 b'a'
1 b'ba'
2 b'cba'
Length: 3, dtype: object
In [28]: s.str.startswith(b'a')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-28-ac784692b361> in <module>
----> 1 s.str.startswith(b'a')
/pandas/pandas/core/strings.py in wrapper(self, *args, **kwargs)
1840 )
1841 )
-> 1842 raise TypeError(msg)
1843 return func(self, *args, **kwargs)
1844
TypeError: Cannot use .str.startswith with values of inferred dtype 'bytes'.
在groupby期间保留分类dtypes
Previously, columns that were categorical, but not the groupby key(s) would be converted to object
dtype during groupby operations. Pandas now will preserve these dtypes. (GH18502open in new window)
In [29]: cat = pd.Categorical(["foo", "bar", "bar", "qux"], ordered=True)
In [30]: df = pd.DataFrame({'payload': [-1, -2, -1, -2], 'col': cat})
In [31]: df
Out[31]:
payload col
0 -1 foo
1 -2 bar
2 -1 bar
3 -2 qux
[4 rows x 2 columns]
In [32]: df.dtypes
Out[32]:
payload int64
col category
Length: 2, dtype: object
Previous Behavior:
In [5]: df.groupby('payload').first().col.dtype
Out[5]: dtype('O')
New Behavior:
In [33]: df.groupby('payload').first().col.dtype
Out[33]: CategoricalDtype(categories=['bar', 'foo', 'qux'], ordered=True)
不兼容的索引类型联合
When performing Index.union()
open in new window operations between objects of incompatible dtypes, the result will be a base Index
open in new window of dtype object
. This behavior holds true for unions between Index
open in new window objects that previously would have been prohibited. The dtype of empty Index
open in new window objects will now be evaluated before performing union operations rather than simply returning the other Index
open in new window object. Index.union()
open in new window can now be considered commutative, such that A.union(B) == B.union(A)
(GH23525open in new window).
Previous behavior:
In [1]: pd.period_range('19910905', periods=2).union(pd.Int64Index([1, 2, 3]))
...
ValueError: can only call with other PeriodIndex-ed objects
In [2]: pd.Index([], dtype=object).union(pd.Index([1, 2, 3]))
Out[2]: Int64Index([1, 2, 3], dtype='int64')
New behavior:
In [34]: pd.period_range('19910905', periods=2).union(pd.Int64Index([1, 2, 3]))
Out[34]: Index([1991-09-05, 1991-09-06, 1, 2, 3], dtype='object')
In [35]: pd.Index([], dtype=object).union(pd.Index([1, 2, 3]))
Out[35]: Index([1, 2, 3], dtype='object')
Note that integer- and floating-dtype indexes are considered “compatible”. The integer values are coerced to floating point, which may result in loss of precision. See Set operations on Index objectsopen in new window for more.
DataFrame
groupby ffill/bfill不再返回组标签
The methods ffill
, bfill
, pad
and backfill
of DataFrameGroupBy
previously included the group labels in the return value, which was inconsistent with other groupby transforms. Now only the filled values are returned. (GH21521open in new window)
In [36]: df = pd.DataFrame({"a": ["x", "y"], "b": [1, 2]})
In [37]: df
Out[37]:
a b
0 x 1
1 y 2
[2 rows x 2 columns]
Previous behavior:
In [3]: df.groupby("a").ffill()
Out[3]:
a b
0 x 1
1 y 2
New behavior:
In [38]: df.groupby("a").ffill()
Out[38]:
b
0 1
1 2
[2 rows x 1 columns]
DataFrame
在空的分类/对象列上描述将返回top和freq
When calling DataFrame.describe()
open in new window with an empty categorical / object column, the ‘top’ and ‘freq’ columns were previously omitted, which was inconsistent with the output for non-empty columns. Now the ‘top’ and ‘freq’ columns will always be included, with numpy.nan
in the case of an empty DataFrame
open in new window (GH26397open in new window)
In [39]: df = pd.DataFrame({"empty_col": pd.Categorical([])})
In [40]: df
Out[40]:
Empty DataFrame
Columns: [empty_col]
Index: []
[0 rows x 1 columns]
Previous behavior:
In [3]: df.describe()
Out[3]:
empty_col
count 0
unique 0
New behavior:
In [41]: df.describe()
Out[41]:
empty_col
count 0
unique 0
top NaN
freq NaN
[4 rows x 1 columns]
__str__
方法现在调用__repr__
而不是反之亦然
Pandas has until now mostly defined string representations in a Pandas objects’s __str__
/__unicode__
/__bytes__
methods, and called __str__
from the __repr__
method, if a specific __repr__
method is not found. This is not needed for Python3. In Pandas 0.25, the string representations of Pandas objects are now generally defined in __repr__
, and calls to __str__
in general now pass the call on to the __repr__
, if a specific __str__
method doesn’t exist, as is standard for Python. This change is backward compatible for direct usage of Pandas, but if you subclass Pandas objects and give your subclasses specific __str__
/__repr__
methods, you may have to adjust your __str__
/__repr__
methods (GH26495open in new window).
使用Interval
对象索引IntervalIndex
Indexing methods for IntervalIndex
open in new window have been modified to require exact matches only for Interval
open in new window queries. IntervalIndex
methods previously matched on any overlapping Interval
. Behavior with scalar points, e.g. querying with an integer, is unchanged (GH16316open in new window).
In [42]: ii = pd.IntervalIndex.from_tuples([(0, 4), (1, 5), (5, 8)])
In [43]: ii
Out[43]:
IntervalIndex([(0, 4], (1, 5], (5, 8]],
closed='right',
dtype='interval[int64]')
The in
operator (__contains__
) now only returns True
for exact matches to Intervals
in the IntervalIndex
, whereas this would previously return True
for any Interval
overlapping an Interval
in the IntervalIndex
.
Previous behavior:
In [4]: pd.Interval(1, 2, closed='neither') in ii
Out[4]: True
In [5]: pd.Interval(-10, 10, closed='both') in ii
Out[5]: True
New behavior:
In [44]: pd.Interval(1, 2, closed='neither') in ii
Out[44]: False
In [45]: pd.Interval(-10, 10, closed='both') in ii
Out[45]: False
The get_loc()
open in new window method now only returns locations for exact matches to Interval
queries, as opposed to the previous behavior of returning locations for overlapping matches. A KeyError
will be raised if an exact match is not found.
Previous behavior:
In [6]: ii.get_loc(pd.Interval(1, 5))
Out[6]: array([0, 1])
In [7]: ii.get_loc(pd.Interval(2, 6))
Out[7]: array([0, 1, 2])
New behavior:
In [6]: ii.get_loc(pd.Interval(1, 5))
Out[6]: 1
In [7]: ii.get_loc(pd.Interval(2, 6))
---------------------------------------------------------------------------
KeyError: Interval(2, 6, closed='right')
Likewise, get_indexer()
open in new window and get_indexer_non_unique()
will also only return locations for exact matches to Interval
queries, with -1
denoting that an exact match was not found.
These indexing changes extend to querying a Series
open in new window or DataFrame
open in new window with an IntervalIndex
index.
In [46]: s = pd.Series(list('abc'), index=ii)
In [47]: s
Out[47]:
(0, 4] a
(1, 5] b
(5, 8] c
Length: 3, dtype: object
Selecting from a Series
or DataFrame
using []
(__getitem__
) or loc
now only returns exact matches for Interval
queries.
Previous behavior:
In [8]: s[pd.Interval(1, 5)]
Out[8]:
(0, 4] a
(1, 5] b
dtype: object
In [9]: s.loc[pd.Interval(1, 5)]
Out[9]:
(0, 4] a
(1, 5] b
dtype: object
New behavior:
In [48]: s[pd.Interval(1, 5)]
Out[48]: 'b'
In [49]: s.loc[pd.Interval(1, 5)]
Out[49]: 'b'
Similarly, a KeyError
will be raised for non-exact matches instead of returning overlapping matches.
Previous behavior:
In [9]: s[pd.Interval(2, 3)]
Out[9]:
(0, 4] a
(1, 5] b
dtype: object
In [10]: s.loc[pd.Interval(2, 3)]
Out[10]:
(0, 4] a
(1, 5] b
dtype: object
New behavior:
In [6]: s[pd.Interval(2, 3)]
---------------------------------------------------------------------------
KeyError: Interval(2, 3, closed='right')
In [7]: s.loc[pd.Interval(2, 3)]
---------------------------------------------------------------------------
KeyError: Interval(2, 3, closed='right')
The overlaps()
open in new window method can be used to create a boolean indexer that replicates the previous behavior of returning overlapping matches.
New behavior:
In [50]: idxr = s.index.overlaps(pd.Interval(2, 3))
In [51]: idxr
Out[51]: array([ True, True, False])
In [52]: s[idxr]
Out[52]:
(0, 4] a
(1, 5] b
Length: 2, dtype: object
In [53]: s.loc[idxr]
Out[53]:
(0, 4] a
(1, 5] b
Length: 2, dtype: object
Series 上的二进制ufunc现在对齐
Applying a binary ufunc like numpy.power()
now aligns the inputs when both are Series
open in new window (GH23293open in new window).
In [54]: s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
In [55]: s2 = pd.Series([3, 4, 5], index=['d', 'c', 'b'])
In [56]: s1
Out[56]:
a 1
b 2
c 3
Length: 3, dtype: int64
In [57]: s2
Out[57]:
d 3
c 4
b 5
Length: 3, dtype: int64
Previous behavior
In [5]: np.power(s1, s2)
Out[5]:
a 1
b 16
c 243
dtype: int64
New behavior
In [58]: np.power(s1, s2)
Out[58]:
a 1.0
b 32.0
c 81.0
d NaN
Length: 4, dtype: float64
This matches the behavior of other binary operations in pandas, like Series.add()
open in new window. To retain the previous behavior, convert the other Series
to an array before applying the ufunc.
In [59]: np.power(s1, s2.array)
Out[59]:
a 1
b 16
c 243
Length: 3, dtype: int64
Categorical.argsort现在在最后放置缺失值
Categorical.argsort()
now places missing values at the end of the array, making it consistent with NumPy and the rest of pandas (GH21801open in new window).
In [60]: cat = pd.Categorical(['b', None, 'a'], categories=['a', 'b'], ordered=True)
Previous behavior
In [2]: cat = pd.Categorical(['b', None, 'a'], categories=['a', 'b'], ordered=True)
In [3]: cat.argsort()
Out[3]: array([1, 2, 0])
In [4]: cat[cat.argsort()]
Out[4]:
[NaN, a, b]
categories (2, object): [a < b]
New behavior
In [61]: cat.argsort()
Out[61]: array([2, 0, 1])
In [62]: cat[cat.argsort()]
Out[62]:
[a, b, NaN]
Categories (2, object): [a < b]
将字典列表传递给DataFrame时,将保留列顺序
Starting with Python 3.7 the key-order of dict
is guaranteedopen in new window. In practice, this has been true since Python 3.6. The DataFrame
open in new window constructor now treats a list of dicts in the same way as it does a list of OrderedDict
, i.e. preserving the order of the dicts. This change applies only when pandas is running on Python>=3.6 (GH27309open in new window).
In [63]: data = [
....: {'name': 'Joe', 'state': 'NY', 'age': 18},
....: {'name': 'Jane', 'state': 'KY', 'age': 19, 'hobby': 'Minecraft'},
....: {'name': 'Jean', 'state': 'OK', 'age': 20, 'finances': 'good'}
....: ]
....:
Previous Behavior:
The columns were lexicographically sorted previously,
In [1]: pd.DataFrame(data)
Out[1]:
age finances hobby name state
0 18 NaN NaN Joe NY
1 19 NaN Minecraft Jane KY
2 20 good NaN Jean OK
New Behavior:
The column order now matches the insertion-order of the keys in the dict
, considering all the records from top to bottom. As a consequence, the column order of the resulting DataFrame has changed compared to previous pandas verisons.
In [64]: pd.DataFrame(data)
Out[64]:
name state age hobby finances
0 Joe NY 18 NaN NaN
1 Jane KY 19 Minecraft NaN
2 Jean OK 20 NaN good
[3 rows x 5 columns]
增加了依赖项的最低版本
Due to dropping support for Python 2.7, a number of optional dependencies have updated minimum versions (GH25725open in new window, GH24942open in new window, GH25752open in new window). Independently, some minimum supported versions of dependencies were updated (GH23519open in new window, GH25554open in new window). If installed, we now require:
Package | Minimum Version | Required |
---|---|---|
numpy | 1.13.3 | X |
pytz | 2015.4 | X |
python-dateutil | 2.6.1 | X |
bottleneck | 1.2.1 | |
numexpr | 2.6.2 | |
pytest (dev) | 4.0.2 |
For optional librariesopen in new window the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.
Package | Minimum Version |
---|---|
beautifulsoup4 | 4.6.0 |
fastparquet | 0.2.1 |
gcsfs | 0.2.2 |
lxml | 3.8.0 |
matplotlib | 2.2.2 |
openpyxl | 2.4.8 |
pyarrow | 0.9.0 |
pymysql | 0.7.1 |
pytables | 3.4.2 |
scipy | 0.19.0 |
sqlalchemy | 1.1.4 |
xarray | 0.8.2 |
xlrd | 1.1.0 |
xlsxwriter | 0.9.8 |
xlwt | 1.2.0 |
See Dependenciesopen in new window and Optional dependenciesopen in new window for more.
其他API更改
DatetimeTZDtype
open in new window will now standardize pytz timezones to a common timezone instance (GH24713open in new window)Timestamp
open in new window andTimedelta
open in new window scalars now implement theto_numpy()
method as aliases toTimestamp.to_datetime64()
open in new window andTimedelta.to_timedelta64()
open in new window, respectively. (GH24653open in new window)Timestamp.strptime()
open in new window will now rise aNotImplementedError
(GH25016open in new window)- Comparing
Timestamp
open in new window with unsupported objects now returnsNotImplemented
open in new window instead of raisingTypeError
. This implies that unsupported rich comparisons are delegated to the other object, and are now consistent with Python 3 behavior fordatetime
objects (GH24011open in new window) - Bug in
DatetimeIndex.snap()
open in new window which didn’t preserving thename
of the inputIndex
open in new window (GH25575open in new window) - The
arg
argument inpandas.core.groupby.DataFrameGroupBy.agg()
has been renamed tofunc
(GH26089open in new window) - The
arg
argument inpandas.core.window._Window.aggregate()
has been renamed tofunc
(GH26372open in new window) - Most Pandas classes had a
__bytes__
method, which was used for getting a python2-style bytestring representation of the object. This method has been removed as a part of dropping Python2 (GH26447open in new window) - The
.str
-accessor has been disabled for 1-levelMultiIndex
open in new window, useMultiIndex.to_flat_index()
open in new window if necessary (GH23679open in new window) - Removed support of gtk package for clipboards (GH26563open in new window)
- Using an unsupported version of Beautiful Soup 4 will now raise an
ImportError
instead of aValueError
(GH27063open in new window) Series.to_excel()
open in new window andDataFrame.to_excel()
open in new window will now raise aValueError
when saving timezone aware data. (GH27008open in new window, GH7056open in new window)ExtensionArray.argsort()
places NA values at the end of the sorted array. (GH21801open in new window)DataFrame.to_hdf()
open in new window andSeries.to_hdf()
open in new window will now raise aNotImplementedError
when saving aMultiIndex
open in new window with extention data types for afixed
format. (GH7775open in new window)- Passing duplicate
names
inread_csv()
open in new window will now raise aValueError
(GH17346open in new window)
弃用
Sparse的子类
The SparseSeries
and SparseDataFrame
subclasses are deprecated. Their functionality is better-provided by a Series
or DataFrame
with sparse values.
Previous way
In [65]: df = pd.SparseDataFrame({"A": [0, 0, 1, 2]})
In [66]: df.dtypes
Out[66]:
A Sparse[int64, nan]
Length: 1, dtype: object
New way
In [67]: df = pd.DataFrame({"A": pd.SparseArray([0, 0, 1, 2])})
In [68]: df.dtypes
Out[68]:
A Sparse[int64, 0]
Length: 1, dtype: object
The memory usage of the two approaches is identical. See Migratingopen in new window for more (GH19239open in new window).
msgpack格式
The msgpack format is deprecated as of 0.25 and will be removed in a future version. It is recommended to use pyarrow for on-the-wire transmission of pandas objects. (GH27084open in new window)
其他弃用
- The deprecated
.ix[]
indexer now raises a more visibleFutureWarning
instead ofDeprecationWarning
(GH26438open in new window). - Deprecated the
units=M
(months) andunits=Y
(year) parameters forunits
ofpandas.to_timedelta()
open in new window,pandas.Timedelta()
open in new window andpandas.TimedeltaIndex()
open in new window (GH16344open in new window) pandas.concat()
open in new window has deprecated thejoin_axes
-keyword. Instead, useDataFrame.reindex()
open in new window orDataFrame.reindex_like()
open in new window on the result or on the inputs (GH21951open in new window)- The
SparseArray.values
attribute is deprecated. You can usenp.asarray(...)
or theSparseArray.to_dense()
method instead (GH26421open in new window). - The functions
pandas.to_datetime()
open in new window andpandas.to_timedelta()
open in new window have deprecated thebox
keyword. Instead, useto_numpy()
orTimestamp.to_datetime64()
open in new window orTimedelta.to_timedelta64()
open in new window. (GH24416open in new window) - The
DataFrame.compound()
open in new window andSeries.compound()
open in new window methods are deprecated and will be removed in a future version (GH26405open in new window). - The internal attributes
_start
,_stop
and_step
attributes ofRangeIndex
open in new window have been deprecated. Use the public attributesstart
open in new window,stop
open in new window andstep
open in new window instead (GH26581open in new window). - The
Series.ftype()
open in new window,Series.ftypes()
open in new window andDataFrame.ftypes()
open in new window methods are deprecated and will be removed in a future version. Instead, useSeries.dtype()
open in new window andDataFrame.dtypes()
open in new window (GH26705open in new window). - The
Series.get_values()
open in new window,DataFrame.get_values()
open in new window,Index.get_values()
open in new window,SparseArray.get_values()
andCategorical.get_values()
methods are deprecated. One ofnp.asarray(..)
orto_numpy()
open in new window can be used instead (GH19617open in new window). - The ‘outer’ method on NumPy ufuncs, e.g.
np.subtract.outer
has been deprecated onSeries
open in new window objects. Convert the input to an array withSeries.array
open in new window first (GH27186open in new window) Timedelta.resolution()
open in new window is deprecated and replaced withTimedelta.resolution_string()
open in new window. In a future version,Timedelta.resolution()
open in new window will be changed to behave like the standard librarydatetime.timedelta.resolution
open in new window (GH21344open in new window)read_table()
open in new window has been undeprecated. (GH25220open in new window)Index.dtype_str
open in new window is deprecated. (GH18262open in new window)Series.imag
open in new window andSeries.real
open in new window are deprecated. (GH18262open in new window)Series.put()
open in new window is deprecated. (GH18262open in new window)Index.item()
open in new window andSeries.item()
open in new window is deprecated. (GH18262open in new window)- The default value
ordered=None
inCategoricalDtype
has been deprecated in favor ofordered=False
. When converting between categorical typesordered=True
must be explicitly passed in order to be preserved. (GH26336open in new window) Index.contains()
open in new window is deprecated. Usekey in index
(__contains__
) instead (GH17753open in new window).DataFrame.get_dtype_counts()
open in new window is deprecated. (GH18262open in new window)Categorical.ravel()
will return aCategorical
open in new window instead of anp.ndarray
(GH27199open in new window)
删除先前版本的弃用/更改
- Removed
Panel
(GH25047open in new window, GH25191open in new window, GH25231open in new window) - Removed the previously deprecated
sheetname
keyword inread_excel()
open in new window (GH16442open in new window, GH20938open in new window) - Removed the previously deprecated
TimeGrouper
(GH16942open in new window) - Removed the previously deprecated
parse_cols
keyword inread_excel()
open in new window (GH16488open in new window) - Removed the previously deprecated
pd.options.html.border
(GH16970open in new window) - Removed the previously deprecated
convert_objects
(GH11221open in new window) - Removed the previously deprecated
select
method ofDataFrame
andSeries
(GH17633open in new window) - Removed the previously deprecated behavior of
Series
open in new window treated as list-like inrename_categories()
open in new window (GH17982open in new window) - Removed the previously deprecated
DataFrame.reindex_axis
andSeries.reindex_axis
(GH17842open in new window) - Removed the previously deprecated behavior of altering column or index labels with
Series.rename_axis()
open in new window orDataFrame.rename_axis()
open in new window (GH17842open in new window) - Removed the previously deprecated
tupleize_cols
keyword argument inread_html()
open in new window,read_csv()
open in new window, andDataFrame.to_csv()
open in new window (GH17877open in new window, GH17820open in new window) - Removed the previously deprecated
DataFrame.from.csv
andSeries.from_csv
(GH17812open in new window) - Removed the previously deprecated
raise_on_error
keyword argument inDataFrame.where()
open in new window andDataFrame.mask()
open in new window (GH17744open in new window) - Removed the previously deprecated
ordered
andcategories
keyword arguments inastype
(GH17742open in new window) - Removed the previously deprecated
cdate_range
(GH17691open in new window) - Removed the previously deprecated
True
option for thedropna
keyword argument inSeriesGroupBy.nth()
(GH17493open in new window) - Removed the previously deprecated
convert
keyword argument inSeries.take()
open in new window andDataFrame.take()
open in new window (GH17352open in new window)
性能改进
- Significant speedup in
SparseArray
open in new window initialization that benefits most operations, fixing performance regression introduced in v0.20.0 (GH24985open in new window) DataFrame.to_stata()
open in new window is now faster when outputting data with any string or non-native endian columns (GH25045open in new window)- Improved performance of
Series.searchsorted()
open in new window. The speedup is especially large when the dtype is int8/int16/int32 and the searched key is within the integer bounds for the dtype (GH22034open in new window) - Improved performance of
pandas.core.groupby.GroupBy.quantile()
(GH20405open in new window) - Improved performance of slicing and other selected operation on a
RangeIndex
open in new window (GH26565open in new window, GH26617open in new window, GH26722open in new window) RangeIndex
open in new window now performs standard lookup without instantiating an actual hashtable, hence saving memory (GH16685open in new window)- Improved performance of
read_csv()
open in new window by faster tokenizing and faster parsing of small float numbers (GH25784open in new window) - Improved performance of
read_csv()
open in new window by faster parsing of N/A and boolean values (GH25804open in new window) - Improved performance of
IntervalIndex.is_monotonic
,IntervalIndex.is_monotonic_increasing
andIntervalIndex.is_monotonic_decreasing
by removing conversion toMultiIndex
open in new window (GH24813open in new window) - Improved performance of
DataFrame.to_csv()
open in new window when writing datetime dtypes (GH25708open in new window) - Improved performance of
read_csv()
open in new window by much faster parsing ofMM/YYYY
andDD/MM/YYYY
datetime formats (GH25922open in new window) - Improved performance of nanops for dtypes that cannot store NaNs. Speedup is particularly prominent for
Series.all()
open in new window andSeries.any()
open in new window (GH25070open in new window) - Improved performance of
Series.map()
open in new window for dictionary mappers on categorical series by mapping the categories instead of mapping all values (GH23785open in new window) - Improved performance of
IntervalIndex.intersection()
(GH24813open in new window) - Improved performance of
read_csv()
open in new window by faster concatenating date columns without extra conversion to string for integer/float zero and floatNaN
; by faster checking the string for the possibility of being a date (GH25754open in new window) - Improved performance of
IntervalIndex.is_unique
by removing conversion toMultiIndex
(GH24813open in new window) - Restored performance of
DatetimeIndex.__iter__()
by re-enabling specialized code path (GH26702open in new window) - Improved performance when building
MultiIndex
open in new window with at least oneCategoricalIndex
open in new window level (GH22044open in new window) - Improved performance by removing the need for a garbage collect when checking for
SettingWithCopyWarning
(GH27031open in new window) - For
to_datetime()
open in new window changed default value of cache parameter toTrue
(GH26043open in new window) - Improved performance of
DatetimeIndex
open in new window andPeriodIndex
open in new window slicing given non-unique, monotonic data (GH27136open in new window). - Improved performance of
pd.read_json()
for index-oriented data. (GH26773open in new window) - Improved performance of
MultiIndex.shape()
(GH27384open in new window).
Bug修复
Categorical相关
- Bug in
DataFrame.at()
open in new window andSeries.at()
open in new window that would raise exception if the index was aCategoricalIndex
open in new window (GH20629open in new window) - Fixed bug in comparison of ordered
Categorical
open in new window that contained missing values with a scalar which sometimes incorrectly resulted inTrue
(GH26504open in new window) - Bug in
DataFrame.dropna()
open in new window when theDataFrame
open in new window has aCategoricalIndex
open in new window containingInterval
open in new window objects incorrectly raised aTypeError
(GH25087open in new window)
和Datetime相关的
- Bug in
to_datetime()
open in new window which would raise an (incorrect)ValueError
when called with a date far into the future and theformat
argument specified instead of raisingOutOfBoundsDatetime
(GH23830open in new window) - Bug in
to_datetime()
open in new window which would raiseInvalidIndexError: Reindexing only valid with uniquely valued Index objects
when called withcache=True
, witharg
including at least two different elements from the set{None, numpy.nan, pandas.NaT}
(GH22305open in new window) - Bug in
DataFrame
open in new window andSeries
open in new window where timezone aware data withdtype='datetime64[ns]
was not cast to naive (GH25843open in new window) - Improved
Timestamp
open in new window type checking in various datetime functions to prevent exceptions when using a subclasseddatetime
(GH25851open in new window) - Bug in
Series
open in new window andDataFrame
open in new window repr wherenp.datetime64('NaT')
andnp.timedelta64('NaT')
withdtype=object
would be represented asNaN
(GH25445open in new window) - Bug in
to_datetime()
open in new window which does not replace the invalid argument withNaT
when error is set to coerce (GH26122open in new window) - Bug in adding
DateOffset
with nonzero month toDatetimeIndex
open in new window would raiseValueError
(GH26258open in new window) - Bug in
to_datetime()
open in new window which raises unhandledOverflowError
when called with mix of invalid dates andNaN
values withformat='%Y%m%d'
anderror='coerce'
(GH25512open in new window) - Bug in
isin()
for datetimelike indexes;DatetimeIndex
open in new window,TimedeltaIndex
open in new window andPeriodIndex
open in new window where thelevels
parameter was ignored. (GH26675open in new window) - Bug in
to_datetime()
open in new window which raisesTypeError
forformat='%Y%m%d'
when called for invalid integer dates with length >= 6 digits witherrors='ignore'
- Bug when comparing a
PeriodIndex
open in new window against a zero-dimensional numpy array (GH26689open in new window) - Bug in constructing a
Series
orDataFrame
from a numpydatetime64
array with a non-ns unit and out-of-bound timestamps generating rubbish data, which will now correctly raise anOutOfBoundsDatetime
error (GH26206open in new window). - Bug in
date_range()
open in new window with unnecessaryOverflowError
being raised for very large or very small dates (GH26651open in new window) - Bug where adding
Timestamp
open in new window to anp.timedelta64
object would raise instead of returning aTimestamp
open in new window (GH24775open in new window) - Bug where comparing a zero-dimensional numpy array containing a
np.datetime64
object to aTimestamp
open in new window would incorrect raiseTypeError
(GH26916open in new window) - Bug in
to_datetime()
open in new window which would raiseValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True
when called withcache=True
, witharg
including datetime strings with different offset (GH26097open in new window)
Timedelta相关
- Bug in
TimedeltaIndex.intersection()
where for non-monotonic indices in some cases an emptyIndex
was returned when in fact an intersection existed (GH25913open in new window) - Bug with comparisons between
Timedelta
open in new window andNaT
raisingTypeError
(GH26039open in new window) - Bug when adding or subtracting a
BusinessHour
to aTimestamp
open in new window with the resulting time landing in a following or prior day respectively (GH26381open in new window) - Bug when comparing a
TimedeltaIndex
open in new window against a zero-dimensional numpy array (GH26689open in new window)
Timezones相关
- Bug in
DatetimeIndex.to_frame()
open in new window where timezone aware data would be converted to timezone naive data (GH25809open in new window) - Bug in
to_datetime()
open in new window withutc=True
and datetime strings that would apply previously parsed UTC offsets to subsequent arguments (GH24992open in new window) - Bug in
Timestamp.tz_localize()
open in new window andTimestamp.tz_convert()
open in new window does not propagatefreq
(GH25241open in new window) - Bug in
Series.at()
open in new window where settingTimestamp
open in new window with timezone raisesTypeError
(GH25506open in new window) - Bug in
DataFrame.update()
open in new window when updating with timezone aware data would return timezone naive data (GH25807open in new window) - Bug in
to_datetime()
open in new window where an uninformativeRuntimeError
was raised when passing a naiveTimestamp
open in new window with datetime strings with mixed UTC offsets (GH25978open in new window) - Bug in
to_datetime()
open in new window withunit='ns'
would drop timezone information from the parsed argument (GH26168open in new window) - Bug in
DataFrame.join()
open in new window where joining a timezone aware index with a timezone aware column would result in a column ofNaN
(GH26335open in new window) - Bug in
date_range()
open in new window where ambiguous or nonexistent start or end times were not handled by theambiguous
ornonexistent
keywords respectively (GH27088open in new window) - Bug in
DatetimeIndex.union()
when combining a timezone aware and timezone unawareDatetimeIndex
open in new window (GH21671open in new window) - Bug when applying a numpy reduction function (e.g.
numpy.minimum()
) to a timezone awareSeries
open in new window (GH15552open in new window)
Numeric相关
- Bug in
to_numeric()
open in new window in which large negative numbers were being improperly handled (GH24910open in new window) - Bug in
to_numeric()
open in new window in which numbers were being coerced to float, even thougherrors
was notcoerce
(GH24910open in new window) - Bug in
to_numeric()
open in new window in which invalid values forerrors
were being allowed (GH26466open in new window) - Bug in
format
in which floating point complex numbers were not being formatted to proper display precision and trimming (GH25514open in new window) - Bug in error messages in
DataFrame.corr()
open in new window andSeries.corr()
open in new window. Added the possibility of using a callable. (GH25729open in new window) - Bug in
Series.divmod()
open in new window andSeries.rdivmod()
open in new window which would raise an (incorrect)ValueError
rather than return a pair ofSeries
open in new window objects as result (GH25557open in new window) - Raises a helpful exception when a non-numeric index is sent to
interpolate()
with methods which require numeric index. (GH21662open in new window) - Bug in
eval()
open in new window when comparing floats with scalar operators, for example:x < -0.1
(GH25928open in new window) - Fixed bug where casting all-boolean array to integer extension array failed (GH25211open in new window)
- Bug in
divmod
with aSeries
open in new window object containing zeros incorrectly raisingAttributeError
(GH26987open in new window) - Inconsistency in
Series
open in new window floor-division (//) anddivmod
filling positive//zero withNaN
instead ofInf
(GH27321open in new window)
转换相关
- Bug in
DataFrame.astype()
open in new window when passing a dict of columns and types theerrors
parameter was ignored. (GH25905open in new window)
字符串相关
- Bug in the
__name__
attribute of several methods ofSeries.str
open in new window, which were set incorrectly (GH23551open in new window) - Improved error message when passing
Series
open in new window of wrong dtype toSeries.str.cat()
open in new window (GH22722open in new window)
“间隔”相关
- Construction of
Interval
open in new window is restricted to numeric,Timestamp
open in new window andTimedelta
open in new window endpoints (GH23013open in new window) - Fixed bug in
Series
open in new window/DataFrame
open in new window not displayingNaN
inIntervalIndex
open in new window with missing values (GH25984open in new window) - Bug in
IntervalIndex.get_loc()
open in new window where aKeyError
would be incorrectly raised for a decreasingIntervalIndex
open in new window (GH25860open in new window) - Bug in
Index
open in new window constructor where passing mixed closedInterval
open in new window objects would result in aValueError
instead of anobject
dtypeIndex
(GH27172open in new window)
索引相关
- Improved exception message when calling
DataFrame.iloc()
open in new window with a list of non-numeric objects (GH25753open in new window). - Improved exception message when calling
.iloc
or.loc
with a boolean indexer with different length (GH26658open in new window). - Bug in
KeyError
exception message when indexing aMultiIndex
open in new window with a non-existant key not displaying the original key (GH27250open in new window). - Bug in
.iloc
and.loc
with a boolean indexer not raising anIndexError
when too few items are passed (GH26658open in new window). - Bug in
DataFrame.loc()
open in new window andSeries.loc()
open in new window whereKeyError
was not raised for aMultiIndex
when the key was less than or equal to the number of levels in theMultiIndex
open in new window (GH14885open in new window). - Bug in which
DataFrame.append()
open in new window produced an erroneous warning indicating that aKeyError
will be thrown in the future when the data to be appended contains new columns (GH22252open in new window). - Bug in which
DataFrame.to_csv()
open in new window caused a segfault for a reindexed data frame, when the indices were single-levelMultiIndex
open in new window (GH26303open in new window). - Fixed bug where assigning a
arrays.PandasArray
open in new window to apandas.core.frame.DataFrame
would raise error (GH26390open in new window) - Allow keyword arguments for callable local reference used in the
DataFrame.query()
open in new window string (GH26426open in new window) - Fixed a
KeyError
when indexing a ``MultiIndex``` level with a list containing exactly one label, which is missing (GH27148open in new window) - Bug which produced
AttributeError
on partial matchingTimestamp
open in new window in aMultiIndex
open in new window (GH26944open in new window) - Bug in
Categorical
open in new window andCategoricalIndex
open in new window withInterval
open in new window values when using thein
operator (__contains
) with objects that are not comparable to the values in theInterval
(GH23705open in new window) - Bug in
DataFrame.loc()
open in new window andDataFrame.iloc()
open in new window on aDataFrame
open in new window with a single timezone-aware datetime64[ns] column incorrectly returning a scalar instead of aSeries
open in new window (GH27110open in new window) - Bug in
CategoricalIndex
open in new window andCategorical
open in new window incorrectly raisingValueError
instead ofTypeError
when a list is passed using thein
operator (__contains__
) (GH21729open in new window) - Bug in setting a new value in a
Series
open in new window with aTimedelta
open in new window object incorrectly casting the value to an integer (GH22717open in new window) - Bug in
Series
open in new window setting a new key (__setitem__
) with a timezone-aware datetime incorrectly raisingValueError
(GH12862open in new window) - Bug in
DataFrame.iloc()
open in new window when indexing with a read-only indexer (GH17192open in new window) - Bug in
Series
open in new window setting an existing tuple key (__setitem__
) with timezone-aware datetime values incorrectly raisingTypeError
(GH20441open in new window)
缺失(Missing)相关
- Fixed misleading exception message in
Series.interpolate()
open in new window if argumentorder
is required, but omitted (GH10633open in new window, GH24014open in new window). - Fixed class type displayed in exception message in
DataFrame.dropna()
open in new window if invalidaxis
parameter passed (GH25555open in new window) - A
ValueError
will now be thrown byDataFrame.fillna()
open in new window whenlimit
is not a positive integer (GH27042open in new window)
多索引(MultiIndex)相关
- Bug in which incorrect exception raised by
Timedelta
open in new window when testing the membership ofMultiIndex
open in new window (GH24570open in new window)
输入输出(I/O)相关
- Bug in
DataFrame.to_html()
open in new window where values were truncated using display options instead of outputting the full content (GH17004open in new window) - Fixed bug in missing text when using
to_clipboard()
if copying utf-16 characters in Python 3 on Windows (GH25040open in new window) - Bug in
read_json()
open in new window fororient='table'
when it tries to infer dtypes by default, which is not applicable as dtypes are already defined in the JSON schema (GH21345open in new window) - Bug in
read_json()
open in new window fororient='table'
and float index, as it infers index dtype by default, which is not applicable because index dtype is already defined in the JSON schema (GH25433open in new window) - Bug in
read_json()
open in new window fororient='table'
and string of float column names, as it makes a column name type conversion toTimestamp
open in new window, which is not applicable because column names are already defined in the JSON schema (GH25435open in new window) - Bug in
json_normalize()
forerrors='ignore'
where missing values in the input data, were filled in resultingDataFrame
with the string"nan"
instead ofnumpy.nan
(GH25468open in new window) DataFrame.to_html()
open in new window now raisesTypeError
when using an invalid type for theclasses
parameter instead ofAssertionError
(GH25608open in new window)- Bug in
DataFrame.to_string()
open in new window andDataFrame.to_latex()
open in new window that would lead to incorrect output when theheader
keyword is used (GH16718open in new window) - Bug in
read_csv()
open in new window not properly interpreting the UTF8 encoded filenames on Windows on Python 3.6+ (GH15086open in new window) - Improved performance in
pandas.read_stata()
open in new window andpandas.io.stata.StataReader
when converting columns that have missing values (GH25772open in new window) - Bug in
DataFrame.to_html()
open in new window where header numbers would ignore display options when rounding (GH17280open in new window) - Bug in
read_hdf()
open in new window where reading a table from an HDF5 file written directly with PyTables fails with aValueError
when using a sub-selection via thestart
orstop
arguments (GH11188open in new window) - Bug in
read_hdf()
open in new window not properly closing store after aKeyError
is raised (GH25766open in new window) - Improved the explanation for the failure when value labels are repeated in Stata dta files and suggested work-arounds (GH25772open in new window)
- Improved
pandas.read_stata()
open in new window andpandas.io.stata.StataReader
to read incorrectly formatted 118 format files saved by Stata (GH25960open in new window) - Improved the
col_space
parameter inDataFrame.to_html()
open in new window to accept a string so CSS length values can be set correctly (GH25941open in new window) - Fixed bug in loading objects from S3 that contain
#
characters in the URL (GH25945open in new window) - Adds
use_bqstorage_api
parameter toread_gbq()
open in new window to speed up downloads of large data frames. This feature requires version 0.10.0 of thepandas-gbq
library as well as thegoogle-cloud-bigquery-storage
andfastavro
libraries. (GH26104open in new window) - Fixed memory leak in
DataFrame.to_json()
open in new window when dealing with numeric data (GH24889open in new window) - Bug in
read_json()
open in new window where date strings withZ
were not converted to a UTC timezone (GH26168open in new window) - Added
cache_dates=True
parameter toread_csv()
open in new window, which allows to cache unique dates when they are parsed (GH25990open in new window) DataFrame.to_excel()
open in new window now raises aValueError
when the caller’s dimensions exceed the limitations of Excel (GH26051open in new window)- Fixed bug in
pandas.read_csv()
open in new window where a BOM would result in incorrect parsing using engine=’python’ (GH26545open in new window) read_excel()
open in new window now raises aValueError
when input is of typepandas.io.excel.ExcelFile
andengine
param is passed sincepandas.io.excel.ExcelFile
has an engine defined (GH26566open in new window)- Bug while selecting from
HDFStore
withwhere=''
specified (GH26610open in new window). - Fixed bug in
DataFrame.to_excel()
open in new window where custom objects (i.e. PeriodIndex) inside merged cells were not being converted into types safe for the Excel writer (GH27006open in new window) - Bug in
read_hdf()
open in new window where reading a timezone awareDatetimeIndex
open in new window would raise aTypeError
(GH11926open in new window) - Bug in
to_msgpack()
andread_msgpack()
open in new window which would raise aValueError
rather than aFileNotFoundError
for an invalid path (GH27160open in new window) - Fixed bug in
DataFrame.to_parquet()
open in new window which would raise aValueError
when the dataframe had no columns (GH27339open in new window) - Allow parsing of
PeriodDtype
open in new window columns when usingread_csv()
open in new window (GH26934open in new window)
绘图(Plotting)相关
- Fixed bug where
api.extensions.ExtensionArray
open in new window could not be used in matplotlib plotting (GH25587open in new window) - Bug in an error message in
DataFrame.plot()
open in new window. Improved the error message if non-numerics are passed toDataFrame.plot()
open in new window (GH25481open in new window) - Bug in incorrect ticklabel positions when plotting an index that are non-numeric / non-datetime (GH7612open in new window, GH15912open in new window, GH22334open in new window)
- Fixed bug causing plots of
PeriodIndex
open in new window timeseries to fail if the frequency is a multiple of the frequency rule code (GH14763open in new window) - Fixed bug when plotting a
DatetimeIndex
open in new window withdatetime.timezone.utc
timezone (GH17173open in new window)
分组/重采样/滚动
- Bug in
pandas.core.resample.Resampler.agg()
with a timezone aware index whereOverflowError
would raise when passing a list of functions (GH22660open in new window) - Bug in
pandas.core.groupby.DataFrameGroupBy.nunique()
open in new window in which the names of column levels were lost (GH23222open in new window) - Bug in
pandas.core.groupby.GroupBy.agg()
open in new window when applying an aggregation function to timezone aware data (GH23683open in new window) - Bug in
pandas.core.groupby.GroupBy.first()
open in new window andpandas.core.groupby.GroupBy.last()
open in new window where timezone information would be dropped (GH21603open in new window) - Bug in
pandas.core.groupby.GroupBy.size()
open in new window when grouping only NA values (GH23050open in new window) - Bug in
Series.groupby()
open in new window whereobserved
kwarg was previously ignored (GH24880open in new window) - Bug in
Series.groupby()
open in new window where usinggroupby
with aMultiIndex
open in new window Series with a list of labels equal to the length of the series caused incorrect grouping (GH25704open in new window) - Ensured that ordering of outputs in
groupby
aggregation functions is consistent across all versions of Python (GH25692open in new window) - Ensured that result group order is correct when grouping on an ordered
Categorical
and specifyingobserved=True
(GH25871open in new window, GH25167open in new window) - Bug in
pandas.core.window.Rolling.min()
open in new window andpandas.core.window.Rolling.max()
open in new window that caused a memory leak (GH25893open in new window) - Bug in
pandas.core.window.Rolling.count()
open in new window andpandas.core.window.Expanding.count
was previously ignoring theaxis
keyword (GH13503open in new window) - Bug in
pandas.core.groupby.GroupBy.idxmax()
andpandas.core.groupby.GroupBy.idxmin()
with datetime column would return incorrect dtype (GH25444open in new window, GH15306open in new window) - Bug in
pandas.core.groupby.GroupBy.cumsum()
open in new window,pandas.core.groupby.GroupBy.cumprod()
open in new window,pandas.core.groupby.GroupBy.cummin()
open in new window andpandas.core.groupby.GroupBy.cummax()
open in new window with categorical column having absent categories, would return incorrect result or segfault (GH16771open in new window) - Bug in
pandas.core.groupby.GroupBy.nth()
open in new window where NA values in the grouping would return incorrect results (GH26011open in new window) - Bug in
pandas.core.groupby.SeriesGroupBy.transform()
where transforming an empty group would raise aValueError
(GH26208open in new window) - Bug in
pandas.core.frame.DataFrame.groupby()
where passing apandas.core.groupby.grouper.Grouper
would return incorrect groups when using the.groups
accessor (GH26326open in new window) - Bug in
pandas.core.groupby.GroupBy.agg()
open in new window where incorrect results are returned for uint64 columns. (GH26310open in new window) - Bug in
pandas.core.window.Rolling.median()
open in new window andpandas.core.window.Rolling.quantile()
open in new window where MemoryError is raised with empty window (GH26005open in new window) - Bug in
pandas.core.window.Rolling.median()
open in new window andpandas.core.window.Rolling.quantile()
open in new window where incorrect results are returned withclosed='left'
andclosed='neither'
(GH26005open in new window) - Improved
pandas.core.window.Rolling
,pandas.core.window.Window
andpandas.core.window.EWM
functions to exclude nuisance columns from results instead of raising errors and raise aDataError
only if all columns are nuisance (GH12537open in new window) - Bug in
pandas.core.window.Rolling.max()
open in new window andpandas.core.window.Rolling.min()
open in new window where incorrect results are returned with an empty variable window (GH26005open in new window) - Raise a helpful exception when an unsupported weighted window function is used as an argument of
pandas.core.window.Window.aggregate()
(GH26597open in new window)
重塑(Reshaping)相关
- Bug in
pandas.merge()
open in new window adds a string ofNone
, ifNone
is assigned in suffixes instead of remain the column name as-is (GH24782open in new window). - Bug in
merge()
open in new window when merging by index name would sometimes result in an incorrectly numbered index (missing index values are now assigned NA) (GH24212open in new window, GH25009open in new window) to_records()
now accepts dtypes to itscolumn_dtypes
parameter (GH24895open in new window)- Bug in
concat()
open in new window where order ofOrderedDict
(anddict
in Python 3.6+) is not respected, when passed in asobjs
argument (GH21510open in new window) - Bug in
pivot_table()
open in new window where columns withNaN
values are dropped even ifdropna
argument isFalse
, when theaggfunc
argument contains alist
(GH22159open in new window) - Bug in
concat()
open in new window where the resultingfreq
of twoDatetimeIndex
open in new window with the samefreq
would be dropped (GH3232open in new window). - Bug in
merge()
open in new window where merging with equivalent Categorical dtypes was raising an error (GH22501open in new window) - bug in
DataFrame
open in new window instantiating with a dict of iterators or generators (e.g.pd.DataFrame({'A': reversed(range(3))})
) raised an error (GH26349open in new window). - Bug in
DataFrame
open in new window instantiating with arange
(e.g.pd.DataFrame(range(3))
) raised an error (GH26342open in new window). - Bug in
DataFrame
open in new window constructor when passing non-empty tuples would cause a segmentation fault (GH25691open in new window) - Bug in
Series.apply()
open in new window failed when the series is a timezone awareDatetimeIndex
open in new window (GH25959open in new window) - Bug in
pandas.cut()
open in new window where large bins could incorrectly raise an error due to an integer overflow (GH26045open in new window) - Bug in
DataFrame.sort_index()
open in new window where an error is thrown when a multi-indexedDataFrame
is sorted on all levels with the initial level sorted last (GH26053open in new window) - Bug in
Series.nlargest()
open in new window treatsTrue
as smaller thanFalse
(GH26154open in new window) - Bug in
DataFrame.pivot_table()
open in new window with aIntervalIndex
open in new window as pivot index would raiseTypeError
(GH25814open in new window) - Bug in which
DataFrame.from_dict()
open in new window ignored order ofOrderedDict
whenorient='index'
(GH8425open in new window). - Bug in
DataFrame.transpose()
open in new window where transposing a DataFrame with a timezone-aware datetime column would incorrectly raiseValueError
(GH26825open in new window) - Bug in
pivot_table()
open in new window when pivoting a timezone aware column as thevalues
would remove timezone information (GH14948open in new window) - Bug in
merge_asof()
open in new window when specifying multipleby
columns where one isdatetime64[ns, tz]
dtype (GH26649open in new window)
零散(Sparse)
- Significant speedup in
SparseArray
open in new window initialization that benefits most operations, fixing performance regression introduced in v0.20.0 (GH24985open in new window) - Bug in
SparseFrame
constructor where passingNone
as the data would causedefault_fill_value
to be ignored (GH16807open in new window) - Bug in
SparseDataFrame
when adding a column in which the length of values does not match length of index,AssertionError
is raised instead of raisingValueError
(GH25484open in new window) - Introduce a better error message in
Series.sparse.from_coo()
open in new window so it returns aTypeError
for inputs that are not coo matrices (GH26554open in new window) - Bug in
numpy.modf()
on aSparseArray
open in new window. Now a tuple ofSparseArray
open in new window is returned (GH26946open in new window).
构建相关更改
- Fix install error with PyPy on macOS (GH26536open in new window)
扩展数组
- Bug in
factorize()
open in new window when passing anExtensionArray
with a customna_sentinel
(GH25696open in new window). Series.count()
open in new window miscounts NA values in ExtensionArrays (GH26835open in new window)- Added
Series.__array_ufunc__
to better handle NumPy ufuncs applied to Series backed by extension arrays (GH23293open in new window). - Keyword argument
deep
has been removed fromExtensionArray.copy()
(GH27083open in new window)
其他
- Removed unused C functions from vendored UltraJSON implementation (GH26198open in new window)
- Allow
Index
open in new window andRangeIndex
open in new window to be passed to numpymin
andmax
functions (GH26125open in new window) - Use actual class name in repr of empty objects of a
Series
subclass (GH27001open in new window). - Bug in
DataFrame
open in new window where passing an object array of timezone-aware datetime objects would incorrectly raiseValueError
(GH13287open in new window)
贡献者
(译者注:官方未公布)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论