文江博客开发文档 Pandas 中文文档文章详情

文章来源于网络收集而来，版权归原创者所有，如有侵权请及时联系！

v0.25.0 版本特性（2019年7月18日）

发布于 2023-06-23 21:37:03 字数 172265 浏览 0 评论 0 收藏 0

警告

从0.25.x系列版本开始，Pandas仅支持Python 3.5.3及更高版本。有关更多详细信息，请参见计划移除对Python 2.7的支持。

警告

在未来的版本中，支持的最低Python版本将提高到3.6。

警告

面板（Panel） 已完全删除。对于N-D标记的数据结构，请使用 xarrayopen in new window。

警告

read_pickle()和read_msgpack()方法仅保证向后兼容的 Pandas 版本为0.20.3（GH27082open in new window）。

这些是 Pandas v0.25.0 版本的改变。有关完整的更新日志(包括其他版本的Pandas)，请参见发布日志。

增强

具有重新标记的Groupby聚合

Pandas添加了特殊的groupby行为，称为“命名聚合”，用于在将多个聚合函数应用于特定列时命名输出列（GH18366open in new window, GH26512open in new window）。

In [1]: animals = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
   ...:                         'height': [9.1, 6.0, 9.5, 34.0],
   ...:                         'weight': [7.9, 7.5, 9.9, 198.0]})
   ...: 

In [2]: animals
Out[2]: 
  kind  height  weight
0  cat     9.1     7.9
1  dog     6.0     7.5
2  cat     9.5     9.9
3  dog    34.0   198.0

[4 rows x 3 columns]

In [3]: animals.groupby("kind").agg(
   ...:     min_height=pd.NamedAgg(column='height', aggfunc='min'),
   ...:     max_height=pd.NamedAgg(column='height', aggfunc='max'),
   ...:     average_weight=pd.NamedAgg(column='weight', aggfunc=np.mean),
   ...: )
   ...: 
Out[3]: 
      min_height  max_height  average_weight
kind                                        
cat          9.1         9.5            8.90
dog          6.0        34.0          102.75

[2 rows x 3 columns]

将所需的列名称作为 **kwargs 传递给 .agg。 **kwargs 的值应该是元组，其中第一个元素是列选择，第二个元素是要应用的聚合函数。Pandas提供了pandas.NamedAgg （命名为元组），使函数的参数更清晰，但也接受了普通元组。

In [4]: animals.groupby("kind").agg(
   ...:     min_height=('height', 'min'),
   ...:     max_height=('height', 'max'),
   ...:     average_weight=('weight', np.mean),
   ...: )
   ...: 
Out[4]: 
      min_height  max_height  average_weight
kind                                        
cat          9.1         9.5            8.90
dog          6.0        34.0          102.75

[2 rows x 3 columns]

命名聚合是建议替代不推荐使用的 “dict-of-dicts” 方法来命名特定于列的聚合的输出（重命名时使用字典弃用groupby.agg()）。

类似的方法现在也可用于Series Groupby对象。因为不需要选择列，所以值可以只是要应用的函数。

In [5]: animals.groupby("kind").height.agg(
   ...:     min_height="min",
   ...:     max_height="max",
   ...: )
   ...: 
Out[5]: 
      min_height  max_height
kind                        
cat          9.1         9.5
dog          6.0        34.0

[2 rows x 2 columns]

在将dict传递给Series groupby聚合（重命名时使用字典时不推荐使用groupby.agg()方法）时，建议使用这种类型的聚合来替代不建议使用的方法和操作。

有关更多信息，请参见命名聚合。

具有多个Lambda的Groupby聚合

您现在可以在 pandas.core.groupby.GroupBy.aggopen in new window (GH26430open in new window) 中为类似列表的聚合提供多个lambda函数。

In [6]: animals.groupby('kind').height.agg([
   ...:     lambda x: x.iloc[0], lambda x: x.iloc[-1]
   ...: ])
   ...: 
Out[6]: 
      <lambda_0>  <lambda_1>
kind                        
cat          9.1         9.5
dog          6.0        34.0

[2 rows x 2 columns]

In [7]: animals.groupby('kind').agg([
   ...:     lambda x: x.iloc[0] - x.iloc[1],
   ...:     lambda x: x.iloc[0] + x.iloc[1]
   ...: ])
   ...: 
Out[7]: 
         height                weight           
     <lambda_0> <lambda_1> <lambda_0> <lambda_1>
kind                                            
cat        -0.4       18.6       -2.0       17.8
dog       -28.0       40.0     -190.5      205.5

[2 rows x 4 columns]

以前的版本，这些行为会引发 SpecificationError 异常。

更好的多索引 repr

MultiIndexopen in new window 实例的打印现在将会显示每行的元组数据，并确保元组项垂直对齐，因此现在更容易理解MultiIndex的结构。(GH13480open in new window):

repr现在看起来像这样：

In [8]: pd.MultiIndex.from_product([['a', 'abc'], range(500)])
Out[8]: 
MultiIndex([(  'a',   0),
            (  'a',   1),
            (  'a',   2),
            (  'a',   3),
            (  'a',   4),
            (  'a',   5),
            (  'a',   6),
            (  'a',   7),
            (  'a',   8),
            (  'a',   9),
            ...
            ('abc', 490),
            ('abc', 491),
            ('abc', 492),
            ('abc', 493),
            ('abc', 494),
            ('abc', 495),
            ('abc', 496),
            ('abc', 497),
            ('abc', 498),
            ('abc', 499)],
           length=1000)

在以前的版本中，输出 MultiIndexopen in new window 操作会打印MultiIndex的所有级别和代码，这在视觉和排版上没有吸引力，并使输出的内容更难以定位。例如（将范围限制为5）：

In [1]: pd.MultiIndex.from_product([['a', 'abc'], range(5)])
Out[1]: MultiIndex(levels=[['a', 'abc'], [0, 1, 2, 3]],
   ...:            codes=[[0, 0, 0, 0, 1, 1, 1, 1], [0, 1, 2, 3, 0, 1, 2, 3]])

在新的repr中，如果行数小于 options.display.max_seq_items（默认值：100个项目），则将显示所有值。水平方向上，如果输出比options.display.width 宽(默认值：80个字符)，则输出将被截断。

用于Series和DataFrame的较短截断 repr

目前，pandas的默认显示选项确保当Series或DataFrame具有超过60行时，其repr将被截断为最多60行（display.max_rows选项）。然而，这仍然给出一个占据垂直屏幕区域很大一部分的repr。因此，引入了一个新选项 display.min_rows，默认值为10，它确定截断的repr中显示的行数：

对于较小的 Series 或 DataFrame，最多显示 max_rows 数行（默认值：60）。
对于长度大于 max_rows 的长度较大的DataFrame Series，仅限显示 min_rows 数行（默认值：10，即第一个和最后一个5行）。

这个双重选项允许仍然可以看到相对较小的对象的全部内容（例如 df.head(20) 显示所有20行），同时为大对象提供简短的repr。

要恢复单个阈值的先前行为，请设置 pd.options.display.min_rows = None。

使用max_level参数支持进行JSON规范化

json_normalize() normalizes the provided input dict to all nested levels. The new max_level parameter provides more control over which level to end normalization (GH23843open in new window):

The repr now looks like this:

In [9]: from pandas.io.json import json_normalize

In [10]: data = [{
   ....:     'CreatedBy': {'Name': 'User001'},
   ....:     'Lookup': {'TextField': 'Some text',
   ....:                'UserField': {'Id': 'ID001', 'Name': 'Name001'}},
   ....:     'Image': {'a': 'b'}
   ....: }]
   ....: 

In [11]: json_normalize(data, max_level=1)
Out[11]: 
  CreatedBy.Name Lookup.TextField                    Lookup.UserField Image.a
0        User001        Some text  {'Id': 'ID001', 'Name': 'Name001'}       b

[1 rows x 4 columns]

Series.explode 将类似列表的值拆分为行

Seriesopen in new window and DataFrameopen in new window have gained the DataFrame.explode()open in new window methods to transform list-likes to individual rows. See section on Exploding list-like columnopen in new window in docs for more information (GH16538open in new window, GH10511open in new window)

Here is a typical usecase. You have comma separated string in a column.

In [12]: df = pd.DataFrame([{'var1': 'a,b,c', 'var2': 1},
   ....:                    {'var1': 'd,e,f', 'var2': 2}])
   ....: 

In [13]: df
Out[13]: 
    var1  var2
0  a,b,c     1
1  d,e,f     2

[2 rows x 2 columns]

Creating a long form DataFrame is now straightforward using chained operations

In [14]: df.assign(var1=df.var1.str.split(',')).explode('var1')
Out[14]: 
  var1  var2
0    a     1
0    b     1
0    c     1
1    d     2
1    e     2
1    f     2

[6 rows x 2 columns]

其他增强功能

DataFrame.plot()open in new window keywords logy, logx and loglog can now accept the value 'sym' for symlog scaling. (GH24867open in new window)
Added support for ISO week year format (‘%G-%V-%u’) when parsing datetimes using to_datetime()open in new window (GH16607open in new window)
Indexing of DataFrame and Series now accepts zerodim np.ndarray (GH24919open in new window)
Timestamp.replace()open in new window now supports the fold argument to disambiguate DST transition times (GH25017open in new window)
DataFrame.at_time()open in new window and Series.at_time()open in new window now support datetime.timeopen in new window objects with timezones (GH24043open in new window)
DataFrame.pivot_table()open in new window now accepts an observed parameter which is passed to underlying calls to DataFrame.groupby()open in new window to speed up grouping categorical data. (GH24923open in new window)
Series.str has gained Series.str.casefold()open in new window method to removes all case distinctions present in a string (GH25405open in new window)
DataFrame.set_index()open in new window now works for instances of abc.Iterator, provided their output is of the same length as the calling frame (GH22484open in new window, GH24984open in new window)
DatetimeIndex.union() now supports the sort argument. The behavior of the sort parameter matches that of Index.union()open in new window (GH24994open in new window)
RangeIndex.union() now supports the sort argument. If sort=False an unsorted Int64Index is always returned. sort=None is the default and returns a monotonically increasing RangeIndex if possible or a sorted Int64Index if not (GH24471open in new window)
TimedeltaIndex.intersection() now also supports the sort keyword (GH24471open in new window)
DataFrame.rename()open in new window now supports the errors argument to raise errors when attempting to rename nonexistent keys (GH13473open in new window)
Added Sparse accessoropen in new window for working with a DataFrame whose values are sparse (GH25681open in new window)
RangeIndexopen in new window has gained startopen in new window, stopopen in new window, and stepopen in new window attributes (GH25710open in new window)
datetime.timezoneopen in new window objects are now supported as arguments to timezone methods and constructors (GH25065open in new window)
DataFrame.query()open in new window and DataFrame.eval()open in new window now supports quoting column names with backticks to refer to names with spaces (GH6508open in new window)
merge_asof()open in new window now gives a more clear error message when merge keys are categoricals that are not equal (GH26136open in new window)
pandas.core.window.Rolling() supports exponential (or Poisson) window type (GH21303open in new window)
Error message for missing required imports now includes the original import error’s text (GH23868open in new window)
DatetimeIndexopen in new window and TimedeltaIndexopen in new window now have a mean method (GH24757open in new window)
DataFrame.describe()open in new window now formats integer percentiles without decimal point (GH26660open in new window)
Added support for reading SPSS .sav files using read_spss() (GH26537open in new window)
Added new option plotting.backend to be able to select a plotting backend different than the existing matplotlib one. Use pandas.set_option('plotting.backend', '') where ``GH14130)
pandas.offsets.BusinessHour supports multiple opening hours intervals (GH15481open in new window)
read_excel()open in new window can now use openpyxl to read Excel files via the engine='openpyxl' argument. This will become the default in a future release (GH11499open in new window)
pandas.io.excel.read_excel() supports reading OpenDocument tables. Specify engine='odf' to enable. Consult the IO User Guideopen in new window for more details (GH9070open in new window)
Intervalopen in new window, IntervalIndexopen in new window, and IntervalArrayopen in new window have gained an is_emptyopen in new window attribute denoting if the given interval(s) are empty (GH27219open in new window)

向后不兼容的API更改

使用UTC偏移量对日期字符串进行索引

Indexing a DataFrameopen in new window or Seriesopen in new window with a DatetimeIndexopen in new window with a date string with a UTC offset would previously ignore the UTC offset. Now, the UTC offset is respected in indexing. (GH24076open in new window, GH16785open in new window)

In [15]: df = pd.DataFrame([0], index=pd.DatetimeIndex(['2019-01-01'], tz='US/Pacific'))

In [16]: df
Out[16]: 
                           0
2019-01-01 00:00:00-08:00  0

[1 rows x 1 columns]

Previous behavior:

In [3]: df['2019-01-01 00:00:00+04:00':'2019-01-01 01:00:00+04:00']
Out[3]:
                           0
2019-01-01 00:00:00-08:00  0

New behavior:

In [17]: df['2019-01-01 12:00:00+04:00':'2019-01-01 13:00:00+04:00']
Out[17]: 
                           0
2019-01-01 00:00:00-08:00  0

[1 rows x 1 columns]

`MultiIndex`由级别和代码构造

Constructing a MultiIndexopen in new window with NaN levels or codes value < -1 was allowed previously. Now, construction with codes value < -1 is not allowed and NaN levels’ corresponding codes would be reassigned as -1. (GH19387open in new window)

Previous behavior:

In [1]: pd.MultiIndex(levels=[[np.nan, None, pd.NaT, 128, 2]],
   ...:               codes=[[0, -1, 1, 2, 3, 4]])
   ...:
Out[1]: MultiIndex(levels=[[nan, None, NaT, 128, 2]],
                   codes=[[0, -1, 1, 2, 3, 4]])

In [2]: pd.MultiIndex(levels=[[1, 2]], codes=[[0, -2]])
Out[2]: MultiIndex(levels=[[1, 2]],
                   codes=[[0, -2]])

New behavior:

In [18]: pd.MultiIndex(levels=[[np.nan, None, pd.NaT, 128, 2]],
   ....:               codes=[[0, -1, 1, 2, 3, 4]])
   ....: 
Out[18]: 
MultiIndex([(nan,),
            (nan,),
            (nan,),
            (nan,),
            (128,),
            (  2,)],
           )

In [19]: pd.MultiIndex(levels=[[1, 2]], codes=[[0, -2]])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-19-225a01af3975> in <module>
----> 1 pd.MultiIndex(levels=[[1, 2]], codes=[[0, -2]])

/pandas/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    206                 else:
    207                     kwargs[new_arg_name] = new_arg_value
--> 208             return func(*args, **kwargs)
    209 
    210         return wrapper

/pandas/pandas/core/indexes/multi.py in __new__(cls, levels, codes, sortorder, names, dtype, copy, name, verify_integrity, _set_identity)
    270 
    271         if verify_integrity:
--> 272             new_codes = result._verify_integrity()
    273             result._codes = new_codes
    274 

/pandas/pandas/core/indexes/multi.py in _verify_integrity(self, codes, levels)
    348                 raise ValueError(
    349                     "On level {level}, code value ({code})"
--> 350                     " < -1".format(level=i, code=level_codes.min())
    351                 )
    352             if not level.is_unique:

ValueError: On level 0, code value (-2) < -1

`DataFrame` 上的 `Groupby.apply` 只对第一组求值一次

The implementation of DataFrameGroupBy.apply() previously evaluated the supplied function consistently twice on the first group to infer if it is safe to use a fast code path. Particularly for functions with side effects, this was an undesired behavior and may have led to surprises. (GH2936open in new window, GH2656open in new window, GH7739open in new window, GH10519open in new window, GH12155open in new window, GH20084open in new window, GH21417open in new window)

Now every group is evaluated only a single time.

In [20]: df = pd.DataFrame({"a": ["x", "y"], "b": [1, 2]})

In [21]: df
Out[21]: 
   a  b
0  x  1
1  y  2

[2 rows x 2 columns]

In [22]: def func(group):
   ....:     print(group.name)
   ....:     return group
   ....:

Previous behavior:

In [3]: df.groupby('a').apply(func)
x
x
y
Out[3]:
   a  b
0  x  1
1  y  2

New behavior:

In [23]: df.groupby("a").apply(func)
x
y
Out[23]: 
   a  b
0  x  1
1  y  2

[2 rows x 2 columns]

连接稀疏值

When passed DataFrames whose values are sparse, concat()open in new window will now return a Seriesopen in new window or DataFrameopen in new window with sparse values, rather than a SparseDataFrame (GH25702open in new window).

In [24]: df = pd.DataFrame({"A": pd.SparseArray([0, 1])})

Previous behavior:

In [2]: type(pd.concat([df, df]))
pandas.core.sparse.frame.SparseDataFrame

New behavior:

In [25]: type(pd.concat([df, df]))
Out[25]: pandas.core.frame.DataFrame

This now matches the existing behavior of concatopen in new window on Series with sparse values. concat()open in new window will continue to return a SparseDataFrame when all the values are instances of SparseDataFrame.

This change also affects routines using concat()open in new window internally, like get_dummies()open in new window, which now returns a DataFrameopen in new window in all cases (previously a SparseDataFrame was returned if all the columns were dummy encoded, and a DataFrameopen in new window otherwise).

Providing any SparseSeries or SparseDataFrame to concat()open in new window will cause a SparseSeries or SparseDataFrame to be returned, as before.

`.str``-访问器执行更严格的类型检查

Due to the lack of more fine-grained dtypes, Series.stropen in new window so far only checked whether the data was of object dtype. Series.stropen in new window will now infer the dtype data within the Series; in particular, 'bytes'-only data will raise an exception (except for Series.str.decode()open in new window, Series.str.get()open in new window, Series.str.len()open in new window, Series.str.slice()open in new window), see GH23163open in new window, GH23011open in new window, GH23551open in new window.

Previous behavior:

In [1]: s = pd.Series(np.array(['a', 'ba', 'cba'], 'S'), dtype=object)

In [2]: s
Out[2]:
0      b'a'
1     b'ba'
2    b'cba'
dtype: object

In [3]: s.str.startswith(b'a')
Out[3]:
0     True
1    False
2    False
dtype: bool

New behavior:

In [26]: s = pd.Series(np.array(['a', 'ba', 'cba'], 'S'), dtype=object)

In [27]: s
Out[27]: 
0      b'a'
1     b'ba'
2    b'cba'
Length: 3, dtype: object

In [28]: s.str.startswith(b'a')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-28-ac784692b361> in <module>
----> 1 s.str.startswith(b'a')

/pandas/pandas/core/strings.py in wrapper(self, *args, **kwargs)
   1840                     )
   1841                 )
-> 1842                 raise TypeError(msg)
   1843             return func(self, *args, **kwargs)
   1844 

TypeError: Cannot use .str.startswith with values of inferred dtype 'bytes'.

在groupby期间保留分类dtypes

Previously, columns that were categorical, but not the groupby key(s) would be converted to object dtype during groupby operations. Pandas now will preserve these dtypes. (GH18502open in new window)

In [29]: cat = pd.Categorical(["foo", "bar", "bar", "qux"], ordered=True)

In [30]: df = pd.DataFrame({'payload': [-1, -2, -1, -2], 'col': cat})

In [31]: df
Out[31]: 
   payload  col
0       -1  foo
1       -2  bar
2       -1  bar
3       -2  qux

[4 rows x 2 columns]

In [32]: df.dtypes
Out[32]: 
payload       int64
col        category
Length: 2, dtype: object

Previous Behavior:

In [5]: df.groupby('payload').first().col.dtype
Out[5]: dtype('O')

New Behavior:

In [33]: df.groupby('payload').first().col.dtype
Out[33]: CategoricalDtype(categories=['bar', 'foo', 'qux'], ordered=True)

不兼容的索引类型联合

When performing Index.union()open in new window operations between objects of incompatible dtypes, the result will be a base Indexopen in new window of dtype object. This behavior holds true for unions between Indexopen in new window objects that previously would have been prohibited. The dtype of empty Indexopen in new window objects will now be evaluated before performing union operations rather than simply returning the other Indexopen in new window object. Index.union()open in new window can now be considered commutative, such that A.union(B) == B.union(A) (GH23525open in new window).

Previous behavior:

In [1]: pd.period_range('19910905', periods=2).union(pd.Int64Index([1, 2, 3]))
...
ValueError: can only call with other PeriodIndex-ed objects

In [2]: pd.Index([], dtype=object).union(pd.Index([1, 2, 3]))
Out[2]: Int64Index([1, 2, 3], dtype='int64')

New behavior:

In [34]: pd.period_range('19910905', periods=2).union(pd.Int64Index([1, 2, 3]))
Out[34]: Index([1991-09-05, 1991-09-06, 1, 2, 3], dtype='object')

In [35]: pd.Index([], dtype=object).union(pd.Index([1, 2, 3]))
Out[35]: Index([1, 2, 3], dtype='object')

Note that integer- and floating-dtype indexes are considered “compatible”. The integer values are coerced to floating point, which may result in loss of precision. See Set operations on Index objectsopen in new window for more.

`DataFrame` groupby ffill/bfill不再返回组标签

The methods ffill, bfill, pad and backfill of DataFrameGroupBy previously included the group labels in the return value, which was inconsistent with other groupby transforms. Now only the filled values are returned. (GH21521open in new window)

In [36]: df = pd.DataFrame({"a": ["x", "y"], "b": [1, 2]})

In [37]: df
Out[37]: 
   a  b
0  x  1
1  y  2

[2 rows x 2 columns]

Previous behavior:

In [3]: df.groupby("a").ffill()
Out[3]:
   a  b
0  x  1
1  y  2

New behavior:

In [38]: df.groupby("a").ffill()
Out[38]: 
   b
0  1
1  2

[2 rows x 1 columns]

`DataFrame` 在空的分类/对象列上描述将返回top和freq

When calling DataFrame.describe()open in new window with an empty categorical / object column, the ‘top’ and ‘freq’ columns were previously omitted, which was inconsistent with the output for non-empty columns. Now the ‘top’ and ‘freq’ columns will always be included, with numpy.nan in the case of an empty DataFrameopen in new window (GH26397open in new window)

In [39]: df = pd.DataFrame({"empty_col": pd.Categorical([])})

In [40]: df
Out[40]: 
Empty DataFrame
Columns: [empty_col]
Index: []

[0 rows x 1 columns]

Previous behavior:

In [3]: df.describe()
Out[3]:
        empty_col
count           0
unique          0

New behavior:

In [41]: df.describe()
Out[41]: 
       empty_col
count          0
unique         0
top          NaN
freq         NaN

[4 rows x 1 columns]

`str`方法现在调用`repr`而不是反之亦然

Pandas has until now mostly defined string representations in a Pandas objects’s __str__/__unicode__/__bytes__ methods, and called __str__ from the __repr__ method, if a specific __repr__ method is not found. This is not needed for Python3. In Pandas 0.25, the string representations of Pandas objects are now generally defined in __repr__, and calls to __str__ in general now pass the call on to the __repr__, if a specific __str__ method doesn’t exist, as is standard for Python. This change is backward compatible for direct usage of Pandas, but if you subclass Pandas objects and give your subclasses specific __str__/__repr__ methods, you may have to adjust your __str__/__repr__ methods (GH26495open in new window).

使用`Interval`对象索引`IntervalIndex`

Indexing methods for IntervalIndexopen in new window have been modified to require exact matches only for Intervalopen in new window queries. IntervalIndex methods previously matched on any overlapping Interval. Behavior with scalar points, e.g. querying with an integer, is unchanged (GH16316open in new window).

In [42]: ii = pd.IntervalIndex.from_tuples([(0, 4), (1, 5), (5, 8)])

In [43]: ii
Out[43]: 
IntervalIndex([(0, 4], (1, 5], (5, 8]],
              closed='right',
              dtype='interval[int64]')

The in operator (__contains__) now only returns True for exact matches to Intervals in the IntervalIndex, whereas this would previously return True for any Interval overlapping an Interval in the IntervalIndex.

Previous behavior:

In [4]: pd.Interval(1, 2, closed='neither') in ii
Out[4]: True

In [5]: pd.Interval(-10, 10, closed='both') in ii
Out[5]: True

New behavior:

In [44]: pd.Interval(1, 2, closed='neither') in ii
Out[44]: False

In [45]: pd.Interval(-10, 10, closed='both') in ii
Out[45]: False

The get_loc()open in new window method now only returns locations for exact matches to Interval queries, as opposed to the previous behavior of returning locations for overlapping matches. A KeyError will be raised if an exact match is not found.

Previous behavior:

In [6]: ii.get_loc(pd.Interval(1, 5))
Out[6]: array([0, 1])

In [7]: ii.get_loc(pd.Interval(2, 6))
Out[7]: array([0, 1, 2])

New behavior:

In [6]: ii.get_loc(pd.Interval(1, 5))
Out[6]: 1

In [7]: ii.get_loc(pd.Interval(2, 6))
---------------------------------------------------------------------------
KeyError: Interval(2, 6, closed='right')

Likewise, get_indexer()open in new window and get_indexer_non_unique() will also only return locations for exact matches to Interval queries, with -1 denoting that an exact match was not found.

These indexing changes extend to querying a Seriesopen in new window or DataFrameopen in new window with an IntervalIndex index.

In [46]: s = pd.Series(list('abc'), index=ii)

In [47]: s
Out[47]: 
(0, 4]    a
(1, 5]    b
(5, 8]    c
Length: 3, dtype: object

Selecting from a Series or DataFrame using [] (__getitem__) or loc now only returns exact matches for Interval queries.

Previous behavior:

In [8]: s[pd.Interval(1, 5)]
Out[8]:
(0, 4]    a
(1, 5]    b
dtype: object

In [9]: s.loc[pd.Interval(1, 5)]
Out[9]:
(0, 4]    a
(1, 5]    b
dtype: object

New behavior:

In [48]: s[pd.Interval(1, 5)]
Out[48]: 'b'

In [49]: s.loc[pd.Interval(1, 5)]
Out[49]: 'b'

Similarly, a KeyError will be raised for non-exact matches instead of returning overlapping matches.

Previous behavior:

In [9]: s[pd.Interval(2, 3)]
Out[9]:
(0, 4]    a
(1, 5]    b
dtype: object

In [10]: s.loc[pd.Interval(2, 3)]
Out[10]:
(0, 4]    a
(1, 5]    b
dtype: object

New behavior:

In [6]: s[pd.Interval(2, 3)]
---------------------------------------------------------------------------
KeyError: Interval(2, 3, closed='right')

In [7]: s.loc[pd.Interval(2, 3)]
---------------------------------------------------------------------------
KeyError: Interval(2, 3, closed='right')

The overlaps()open in new window method can be used to create a boolean indexer that replicates the previous behavior of returning overlapping matches.

New behavior:

In [50]: idxr = s.index.overlaps(pd.Interval(2, 3))

In [51]: idxr
Out[51]: array([ True,  True, False])

In [52]: s[idxr]
Out[52]: 
(0, 4]    a
(1, 5]    b
Length: 2, dtype: object

In [53]: s.loc[idxr]
Out[53]: 
(0, 4]    a
(1, 5]    b
Length: 2, dtype: object

Series 上的二进制ufunc现在对齐

Applying a binary ufunc like numpy.power() now aligns the inputs when both are Seriesopen in new window (GH23293open in new window).

In [54]: s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])

In [55]: s2 = pd.Series([3, 4, 5], index=['d', 'c', 'b'])

In [56]: s1
Out[56]: 
a    1
b    2
c    3
Length: 3, dtype: int64

In [57]: s2
Out[57]: 
d    3
c    4
b    5
Length: 3, dtype: int64

Previous behavior

In [5]: np.power(s1, s2)
Out[5]:
a      1
b     16
c    243
dtype: int64

New behavior

In [58]: np.power(s1, s2)
Out[58]: 
a     1.0
b    32.0
c    81.0
d     NaN
Length: 4, dtype: float64

This matches the behavior of other binary operations in pandas, like Series.add()open in new window. To retain the previous behavior, convert the other Series to an array before applying the ufunc.

In [59]: np.power(s1, s2.array)
Out[59]: 
a      1
b     16
c    243
Length: 3, dtype: int64

Categorical.argsort现在在最后放置缺失值

Categorical.argsort() now places missing values at the end of the array, making it consistent with NumPy and the rest of pandas (GH21801open in new window).

In [60]: cat = pd.Categorical(['b', None, 'a'], categories=['a', 'b'], ordered=True)

Previous behavior

In [2]: cat = pd.Categorical(['b', None, 'a'], categories=['a', 'b'], ordered=True)

In [3]: cat.argsort()
Out[3]: array([1, 2, 0])

In [4]: cat[cat.argsort()]
Out[4]:
[NaN, a, b]
categories (2, object): [a < b]

New behavior

In [61]: cat.argsort()
Out[61]: array([2, 0, 1])

In [62]: cat[cat.argsort()]
Out[62]: 
[a, b, NaN]
Categories (2, object): [a < b]

将字典列表传递给DataFrame时，将保留列顺序

Starting with Python 3.7 the key-order of dict is guaranteedopen in new window. In practice, this has been true since Python 3.6. The DataFrameopen in new window constructor now treats a list of dicts in the same way as it does a list of OrderedDict, i.e. preserving the order of the dicts. This change applies only when pandas is running on Python>=3.6 (GH27309open in new window).

In [63]: data = [
   ....:     {'name': 'Joe', 'state': 'NY', 'age': 18},
   ....:     {'name': 'Jane', 'state': 'KY', 'age': 19, 'hobby': 'Minecraft'},
   ....:     {'name': 'Jean', 'state': 'OK', 'age': 20, 'finances': 'good'}
   ....: ]
   ....:

Previous Behavior:

The columns were lexicographically sorted previously,

In [1]: pd.DataFrame(data)
Out[1]:
   age finances      hobby  name state
0   18      NaN        NaN   Joe    NY
1   19      NaN  Minecraft  Jane    KY
2   20     good        NaN  Jean    OK

New Behavior:

The column order now matches the insertion-order of the keys in the dict, considering all the records from top to bottom. As a consequence, the column order of the resulting DataFrame has changed compared to previous pandas verisons.

In [64]: pd.DataFrame(data)
Out[64]: 
   name state  age      hobby finances
0   Joe    NY   18        NaN      NaN
1  Jane    KY   19  Minecraft      NaN
2  Jean    OK   20        NaN     good

[3 rows x 5 columns]

增加了依赖项的最低版本

Due to dropping support for Python 2.7, a number of optional dependencies have updated minimum versions (GH25725open in new window, GH24942open in new window, GH25752open in new window). Independently, some minimum supported versions of dependencies were updated (GH23519open in new window, GH25554open in new window). If installed, we now require:

Package	Minimum Version	Required
numpy	1.13.3	X
pytz	2015.4	X
python-dateutil	2.6.1	X
bottleneck	1.2.1
numexpr	2.6.2
pytest (dev)	4.0.2

For optional librariesopen in new window the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.

Package	Minimum Version
beautifulsoup4	4.6.0
fastparquet	0.2.1
gcsfs	0.2.2
lxml	3.8.0
matplotlib	2.2.2
openpyxl	2.4.8
pyarrow	0.9.0
pymysql	0.7.1
pytables	3.4.2
scipy	0.19.0
sqlalchemy	1.1.4
xarray	0.8.2
xlrd	1.1.0
xlsxwriter	0.9.8
xlwt	1.2.0

See Dependenciesopen in new window and Optional dependenciesopen in new window for more.

其他API更改

DatetimeTZDtypeopen in new window will now standardize pytz timezones to a common timezone instance (GH24713open in new window)
Timestampopen in new window and Timedeltaopen in new window scalars now implement the to_numpy() method as aliases to Timestamp.to_datetime64()open in new window and Timedelta.to_timedelta64()open in new window, respectively. (GH24653open in new window)
Timestamp.strptime()open in new window will now rise a NotImplementedError (GH25016open in new window)
Comparing Timestampopen in new window with unsupported objects now returns NotImplementedopen in new window instead of raising TypeError. This implies that unsupported rich comparisons are delegated to the other object, and are now consistent with Python 3 behavior for datetime objects (GH24011open in new window)
Bug in DatetimeIndex.snap()open in new window which didn’t preserving the name of the input Indexopen in new window (GH25575open in new window)
The arg argument in pandas.core.groupby.DataFrameGroupBy.agg() has been renamed to func (GH26089open in new window)
The arg argument in pandas.core.window._Window.aggregate() has been renamed to func (GH26372open in new window)
Most Pandas classes had a __bytes__ method, which was used for getting a python2-style bytestring representation of the object. This method has been removed as a part of dropping Python2 (GH26447open in new window)
The .str-accessor has been disabled for 1-level MultiIndexopen in new window, use MultiIndex.to_flat_index()open in new window if necessary (GH23679open in new window)
Removed support of gtk package for clipboards (GH26563open in new window)
Using an unsupported version of Beautiful Soup 4 will now raise an ImportError instead of a ValueError (GH27063open in new window)
Series.to_excel()open in new window and DataFrame.to_excel()open in new window will now raise a ValueError when saving timezone aware data. (GH27008open in new window, GH7056open in new window)
ExtensionArray.argsort() places NA values at the end of the sorted array. (GH21801open in new window)
DataFrame.to_hdf()open in new window and Series.to_hdf()open in new window will now raise a NotImplementedError when saving a MultiIndexopen in new window with extention data types for a fixed format. (GH7775open in new window)
Passing duplicate names in read_csv()open in new window will now raise a ValueError (GH17346open in new window)

弃用

Sparse的子类

The SparseSeries and SparseDataFrame subclasses are deprecated. Their functionality is better-provided by a Series or DataFrame with sparse values.

Previous way

In [65]: df = pd.SparseDataFrame({"A": [0, 0, 1, 2]})

In [66]: df.dtypes
Out[66]: 
A    Sparse[int64, nan]
Length: 1, dtype: object

New way

In [67]: df = pd.DataFrame({"A": pd.SparseArray([0, 0, 1, 2])})

In [68]: df.dtypes
Out[68]: 
A    Sparse[int64, 0]
Length: 1, dtype: object

The memory usage of the two approaches is identical. See Migratingopen in new window for more (GH19239open in new window).

msgpack格式

The msgpack format is deprecated as of 0.25 and will be removed in a future version. It is recommended to use pyarrow for on-the-wire transmission of pandas objects. (GH27084open in new window)

其他弃用

The deprecated .ix[] indexer now raises a more visible FutureWarning instead of DeprecationWarning (GH26438open in new window).
Deprecated the units=M (months) and units=Y (year) parameters for units of pandas.to_timedelta()open in new window, pandas.Timedelta()open in new window and pandas.TimedeltaIndex()open in new window (GH16344open in new window)
pandas.concat()open in new window has deprecated the join_axes-keyword. Instead, use DataFrame.reindex()open in new window or DataFrame.reindex_like()open in new window on the result or on the inputs (GH21951open in new window)
The SparseArray.values attribute is deprecated. You can use np.asarray(...) or the SparseArray.to_dense() method instead (GH26421open in new window).
The functions pandas.to_datetime()open in new window and pandas.to_timedelta()open in new window have deprecated the box keyword. Instead, use to_numpy() or Timestamp.to_datetime64()open in new window or Timedelta.to_timedelta64()open in new window. (GH24416open in new window)
The DataFrame.compound()open in new window and Series.compound()open in new window methods are deprecated and will be removed in a future version (GH26405open in new window).
The internal attributes _start, _stop and _step attributes of RangeIndexopen in new window have been deprecated. Use the public attributes startopen in new window, stopopen in new window and stepopen in new window instead (GH26581open in new window).
The Series.ftype()open in new window, Series.ftypes()open in new window and DataFrame.ftypes()open in new window methods are deprecated and will be removed in a future version. Instead, use Series.dtype()open in new window and DataFrame.dtypes()open in new window (GH26705open in new window).
The Series.get_values()open in new window, DataFrame.get_values()open in new window, Index.get_values()open in new window, SparseArray.get_values() and Categorical.get_values() methods are deprecated. One of np.asarray(..) or to_numpy()open in new window can be used instead (GH19617open in new window).
The ‘outer’ method on NumPy ufuncs, e.g. np.subtract.outer has been deprecated on Seriesopen in new window objects. Convert the input to an array with Series.arrayopen in new window first (GH27186open in new window)
Timedelta.resolution()open in new window is deprecated and replaced with Timedelta.resolution_string()open in new window. In a future version, Timedelta.resolution()open in new window will be changed to behave like the standard library datetime.timedelta.resolutionopen in new window (GH21344open in new window)
read_table()open in new window has been undeprecated. (GH25220open in new window)
Index.dtype_stropen in new window is deprecated. (GH18262open in new window)
Series.imagopen in new window and Series.realopen in new window are deprecated. (GH18262open in new window)
Series.put()open in new window is deprecated. (GH18262open in new window)
Index.item()open in new window and Series.item()open in new window is deprecated. (GH18262open in new window)
The default value ordered=None in CategoricalDtype has been deprecated in favor of ordered=False. When converting between categorical types ordered=True must be explicitly passed in order to be preserved. (GH26336open in new window)
Index.contains()open in new window is deprecated. Use key in index (__contains__) instead (GH17753open in new window).
DataFrame.get_dtype_counts()open in new window is deprecated. (GH18262open in new window)
Categorical.ravel() will return a Categoricalopen in new window instead of a np.ndarray (GH27199open in new window)

删除先前版本的弃用/更改

Removed Panel (GH25047open in new window, GH25191open in new window, GH25231open in new window)
Removed the previously deprecated sheetname keyword in read_excel()open in new window (GH16442open in new window, GH20938open in new window)
Removed the previously deprecated TimeGrouper (GH16942open in new window)
Removed the previously deprecated parse_cols keyword in read_excel()open in new window (GH16488open in new window)
Removed the previously deprecated pd.options.html.border (GH16970open in new window)
Removed the previously deprecated convert_objects (GH11221open in new window)
Removed the previously deprecated select method of DataFrame and Series (GH17633open in new window)
Removed the previously deprecated behavior of Seriesopen in new window treated as list-like in rename_categories()open in new window (GH17982open in new window)
Removed the previously deprecated DataFrame.reindex_axis and Series.reindex_axis (GH17842open in new window)
Removed the previously deprecated behavior of altering column or index labels with Series.rename_axis()open in new window or DataFrame.rename_axis()open in new window (GH17842open in new window)
Removed the previously deprecated tupleize_cols keyword argument in read_html()open in new window, read_csv()open in new window, and DataFrame.to_csv()open in new window (GH17877open in new window, GH17820open in new window)
Removed the previously deprecated DataFrame.from.csv and Series.from_csv (GH17812open in new window)
Removed the previously deprecated raise_on_error keyword argument in DataFrame.where()open in new window and DataFrame.mask()open in new window (GH17744open in new window)
Removed the previously deprecated ordered and categories keyword arguments in astype (GH17742open in new window)
Removed the previously deprecated cdate_range (GH17691open in new window)
Removed the previously deprecated True option for the dropna keyword argument in SeriesGroupBy.nth() (GH17493open in new window)
Removed the previously deprecated convert keyword argument in Series.take()open in new window and DataFrame.take()open in new window (GH17352open in new window)

性能改进

Significant speedup in SparseArrayopen in new window initialization that benefits most operations, fixing performance regression introduced in v0.20.0 (GH24985open in new window)
DataFrame.to_stata()open in new window is now faster when outputting data with any string or non-native endian columns (GH25045open in new window)
Improved performance of Series.searchsorted()open in new window. The speedup is especially large when the dtype is int8/int16/int32 and the searched key is within the integer bounds for the dtype (GH22034open in new window)
Improved performance of pandas.core.groupby.GroupBy.quantile() (GH20405open in new window)
Improved performance of slicing and other selected operation on a RangeIndexopen in new window (GH26565open in new window, GH26617open in new window, GH26722open in new window)
RangeIndexopen in new window now performs standard lookup without instantiating an actual hashtable, hence saving memory (GH16685open in new window)
Improved performance of read_csv()open in new window by faster tokenizing and faster parsing of small float numbers (GH25784open in new window)
Improved performance of read_csv()open in new window by faster parsing of N/A and boolean values (GH25804open in new window)
Improved performance of IntervalIndex.is_monotonic, IntervalIndex.is_monotonic_increasing and IntervalIndex.is_monotonic_decreasing by removing conversion to MultiIndexopen in new window (GH24813open in new window)
Improved performance of DataFrame.to_csv()open in new window when writing datetime dtypes (GH25708open in new window)
Improved performance of read_csv()open in new window by much faster parsing of MM/YYYY and DD/MM/YYYY datetime formats (GH25922open in new window)
Improved performance of nanops for dtypes that cannot store NaNs. Speedup is particularly prominent for Series.all()open in new window and Series.any()open in new window (GH25070open in new window)
Improved performance of Series.map()open in new window for dictionary mappers on categorical series by mapping the categories instead of mapping all values (GH23785open in new window)
Improved performance of IntervalIndex.intersection() (GH24813open in new window)
Improved performance of read_csv()open in new window by faster concatenating date columns without extra conversion to string for integer/float zero and float NaN; by faster checking the string for the possibility of being a date (GH25754open in new window)
Improved performance of IntervalIndex.is_unique by removing conversion to MultiIndex (GH24813open in new window)
Restored performance of DatetimeIndex.__iter__() by re-enabling specialized code path (GH26702open in new window)
Improved performance when building MultiIndexopen in new window with at least one CategoricalIndexopen in new window level (GH22044open in new window)
Improved performance by removing the need for a garbage collect when checking for SettingWithCopyWarning (GH27031open in new window)
For to_datetime()open in new window changed default value of cache parameter to True (GH26043open in new window)
Improved performance of DatetimeIndexopen in new window and PeriodIndexopen in new window slicing given non-unique, monotonic data (GH27136open in new window).
Improved performance of pd.read_json() for index-oriented data. (GH26773open in new window)
Improved performance of MultiIndex.shape() (GH27384open in new window).

Bug修复

Categorical相关

Bug in DataFrame.at()open in new window and Series.at()open in new window that would raise exception if the index was a CategoricalIndexopen in new window (GH20629open in new window)
Fixed bug in comparison of ordered Categoricalopen in new window that contained missing values with a scalar which sometimes incorrectly resulted in True (GH26504open in new window)
Bug in DataFrame.dropna()open in new window when the DataFrameopen in new window has a CategoricalIndexopen in new window containing Intervalopen in new window objects incorrectly raised a TypeError (GH25087open in new window)

和Datetime相关的

Bug in to_datetime()open in new window which would raise an (incorrect) ValueError when called with a date far into the future and the format argument specified instead of raising OutOfBoundsDatetime (GH23830open in new window)
Bug in to_datetime()open in new window which would raise InvalidIndexError: Reindexing only valid with uniquely valued Index objects when called with cache=True, with arg including at least two different elements from the set {None, numpy.nan, pandas.NaT} (GH22305open in new window)
Bug in DataFrameopen in new window and Seriesopen in new window where timezone aware data with dtype='datetime64[ns] was not cast to naive (GH25843open in new window)
Improved Timestampopen in new window type checking in various datetime functions to prevent exceptions when using a subclassed datetime (GH25851open in new window)
Bug in Seriesopen in new window and DataFrameopen in new window repr where np.datetime64('NaT') and np.timedelta64('NaT') with dtype=object would be represented as NaN (GH25445open in new window)
Bug in to_datetime()open in new window which does not replace the invalid argument with NaT when error is set to coerce (GH26122open in new window)
Bug in adding DateOffset with nonzero month to DatetimeIndexopen in new window would raise ValueError (GH26258open in new window)
Bug in to_datetime()open in new window which raises unhandled OverflowError when called with mix of invalid dates and NaN values with format='%Y%m%d' and error='coerce' (GH25512open in new window)
Bug in isin() for datetimelike indexes; DatetimeIndexopen in new window, TimedeltaIndexopen in new window and PeriodIndexopen in new window where the levels parameter was ignored. (GH26675open in new window)
Bug in to_datetime()open in new window which raises TypeError for format='%Y%m%d' when called for invalid integer dates with length >= 6 digits with errors='ignore'
Bug when comparing a PeriodIndexopen in new window against a zero-dimensional numpy array (GH26689open in new window)
Bug in constructing a Series or DataFrame from a numpy datetime64 array with a non-ns unit and out-of-bound timestamps generating rubbish data, which will now correctly raise an OutOfBoundsDatetime error (GH26206open in new window).
Bug in date_range()open in new window with unnecessary OverflowError being raised for very large or very small dates (GH26651open in new window)
Bug where adding Timestampopen in new window to a np.timedelta64 object would raise instead of returning a Timestampopen in new window (GH24775open in new window)
Bug where comparing a zero-dimensional numpy array containing a np.datetime64 object to a Timestampopen in new window would incorrect raise TypeError (GH26916open in new window)
Bug in to_datetime()open in new window which would raise ValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True when called with cache=True, with arg including datetime strings with different offset (GH26097open in new window)

Timedelta相关

Bug in TimedeltaIndex.intersection() where for non-monotonic indices in some cases an empty Index was returned when in fact an intersection existed (GH25913open in new window)
Bug with comparisons between Timedeltaopen in new window and NaT raising TypeError (GH26039open in new window)
Bug when adding or subtracting a BusinessHour to a Timestampopen in new window with the resulting time landing in a following or prior day respectively (GH26381open in new window)
Bug when comparing a TimedeltaIndexopen in new window against a zero-dimensional numpy array (GH26689open in new window)

Timezones相关

Bug in DatetimeIndex.to_frame()open in new window where timezone aware data would be converted to timezone naive data (GH25809open in new window)
Bug in to_datetime()open in new window with utc=True and datetime strings that would apply previously parsed UTC offsets to subsequent arguments (GH24992open in new window)
Bug in Timestamp.tz_localize()open in new window and Timestamp.tz_convert()open in new window does not propagate freq (GH25241open in new window)
Bug in Series.at()open in new window where setting Timestampopen in new window with timezone raises TypeError (GH25506open in new window)
Bug in DataFrame.update()open in new window when updating with timezone aware data would return timezone naive data (GH25807open in new window)
Bug in to_datetime()open in new window where an uninformative RuntimeError was raised when passing a naive Timestampopen in new window with datetime strings with mixed UTC offsets (GH25978open in new window)
Bug in to_datetime()open in new window with unit='ns' would drop timezone information from the parsed argument (GH26168open in new window)
Bug in DataFrame.join()open in new window where joining a timezone aware index with a timezone aware column would result in a column of NaN (GH26335open in new window)
Bug in date_range()open in new window where ambiguous or nonexistent start or end times were not handled by the ambiguous or nonexistent keywords respectively (GH27088open in new window)
Bug in DatetimeIndex.union() when combining a timezone aware and timezone unaware DatetimeIndexopen in new window (GH21671open in new window)
Bug when applying a numpy reduction function (e.g. numpy.minimum()) to a timezone aware Seriesopen in new window (GH15552open in new window)

Numeric相关

Bug in to_numeric()open in new window in which large negative numbers were being improperly handled (GH24910open in new window)
Bug in to_numeric()open in new window in which numbers were being coerced to float, even though errors was not coerce (GH24910open in new window)
Bug in to_numeric()open in new window in which invalid values for errors were being allowed (GH26466open in new window)
Bug in format in which floating point complex numbers were not being formatted to proper display precision and trimming (GH25514open in new window)
Bug in error messages in DataFrame.corr()open in new window and Series.corr()open in new window. Added the possibility of using a callable. (GH25729open in new window)
Bug in Series.divmod()open in new window and Series.rdivmod()open in new window which would raise an (incorrect) ValueError rather than return a pair of Seriesopen in new window objects as result (GH25557open in new window)
Raises a helpful exception when a non-numeric index is sent to interpolate() with methods which require numeric index. (GH21662open in new window)
Bug in eval()open in new window when comparing floats with scalar operators, for example: x < -0.1 (GH25928open in new window)
Fixed bug where casting all-boolean array to integer extension array failed (GH25211open in new window)
Bug in divmod with a Seriesopen in new window object containing zeros incorrectly raising AttributeError (GH26987open in new window)
Inconsistency in Seriesopen in new window floor-division (//) and divmod filling positive//zero with NaN instead of Inf (GH27321open in new window)

转换相关

Bug in DataFrame.astype()open in new window when passing a dict of columns and types the errors parameter was ignored. (GH25905open in new window)

字符串相关

Bug in the __name__ attribute of several methods of Series.stropen in new window, which were set incorrectly (GH23551open in new window)
Improved error message when passing Seriesopen in new window of wrong dtype to Series.str.cat()open in new window (GH22722open in new window)

“间隔”相关

Construction of Intervalopen in new window is restricted to numeric, Timestampopen in new window and Timedeltaopen in new window endpoints (GH23013open in new window)
Fixed bug in Seriesopen in new window/DataFrameopen in new window not displaying NaN in IntervalIndexopen in new window with missing values (GH25984open in new window)
Bug in IntervalIndex.get_loc()open in new window where a KeyError would be incorrectly raised for a decreasing IntervalIndexopen in new window (GH25860open in new window)
Bug in Indexopen in new window constructor where passing mixed closed Intervalopen in new window objects would result in a ValueError instead of an object dtype Index (GH27172open in new window)

索引相关

Improved exception message when calling DataFrame.iloc()open in new window with a list of non-numeric objects (GH25753open in new window).
Improved exception message when calling .iloc or .loc with a boolean indexer with different length (GH26658open in new window).
Bug in KeyError exception message when indexing a MultiIndexopen in new window with a non-existant key not displaying the original key (GH27250open in new window).
Bug in .iloc and .loc with a boolean indexer not raising an IndexError when too few items are passed (GH26658open in new window).
Bug in DataFrame.loc()open in new window and Series.loc()open in new window where KeyError was not raised for a MultiIndex when the key was less than or equal to the number of levels in the MultiIndexopen in new window (GH14885open in new window).
Bug in which DataFrame.append()open in new window produced an erroneous warning indicating that a KeyError will be thrown in the future when the data to be appended contains new columns (GH22252open in new window).
Bug in which DataFrame.to_csv()open in new window caused a segfault for a reindexed data frame, when the indices were single-level MultiIndexopen in new window (GH26303open in new window).
Fixed bug where assigning a arrays.PandasArrayopen in new window to a pandas.core.frame.DataFrame would raise error (GH26390open in new window)
Allow keyword arguments for callable local reference used in the DataFrame.query()open in new window string (GH26426open in new window)
Fixed a KeyError when indexing a ``MultiIndex``` level with a list containing exactly one label, which is missing (GH27148open in new window)
Bug which produced AttributeError on partial matching Timestampopen in new window in a MultiIndexopen in new window (GH26944open in new window)
Bug in Categoricalopen in new window and CategoricalIndexopen in new window with Intervalopen in new window values when using the in operator (__contains) with objects that are not comparable to the values in the Interval (GH23705open in new window)
Bug in DataFrame.loc()open in new window and DataFrame.iloc()open in new window on a DataFrameopen in new window with a single timezone-aware datetime64[ns] column incorrectly returning a scalar instead of a Seriesopen in new window (GH27110open in new window)
Bug in CategoricalIndexopen in new window and Categoricalopen in new window incorrectly raising ValueError instead of TypeError when a list is passed using the in operator (__contains__) (GH21729open in new window)
Bug in setting a new value in a Seriesopen in new window with a Timedeltaopen in new window object incorrectly casting the value to an integer (GH22717open in new window)
Bug in Seriesopen in new window setting a new key (__setitem__) with a timezone-aware datetime incorrectly raising ValueError (GH12862open in new window)
Bug in DataFrame.iloc()open in new window when indexing with a read-only indexer (GH17192open in new window)
Bug in Seriesopen in new window setting an existing tuple key (__setitem__) with timezone-aware datetime values incorrectly raising TypeError (GH20441open in new window)

缺失（Missing）相关

Fixed misleading exception message in Series.interpolate()open in new window if argument order is required, but omitted (GH10633open in new window, GH24014open in new window).
Fixed class type displayed in exception message in DataFrame.dropna()open in new window if invalid axis parameter passed (GH25555open in new window)
A ValueError will now be thrown by DataFrame.fillna()open in new window when limit is not a positive integer (GH27042open in new window)

多索引（MultiIndex）相关

Bug in which incorrect exception raised by Timedeltaopen in new window when testing the membership of MultiIndexopen in new window (GH24570open in new window)

输入输出（I/O）相关

Bug in DataFrame.to_html()open in new window where values were truncated using display options instead of outputting the full content (GH17004open in new window)
Fixed bug in missing text when using to_clipboard() if copying utf-16 characters in Python 3 on Windows (GH25040open in new window)
Bug in read_json()open in new window for orient='table' when it tries to infer dtypes by default, which is not applicable as dtypes are already defined in the JSON schema (GH21345open in new window)
Bug in read_json()open in new window for orient='table' and float index, as it infers index dtype by default, which is not applicable because index dtype is already defined in the JSON schema (GH25433open in new window)
Bug in read_json()open in new window for orient='table' and string of float column names, as it makes a column name type conversion to Timestampopen in new window, which is not applicable because column names are already defined in the JSON schema (GH25435open in new window)
Bug in json_normalize() for errors='ignore' where missing values in the input data, were filled in resulting DataFrame with the string "nan" instead of numpy.nan (GH25468open in new window)
DataFrame.to_html()open in new window now raises TypeError when using an invalid type for the classes parameter instead of AssertionError (GH25608open in new window)
Bug in DataFrame.to_string()open in new window and DataFrame.to_latex()open in new window that would lead to incorrect output when the header keyword is used (GH16718open in new window)
Bug in read_csv()open in new window not properly interpreting the UTF8 encoded filenames on Windows on Python 3.6+ (GH15086open in new window)
Improved performance in pandas.read_stata()open in new window and pandas.io.stata.StataReader when converting columns that have missing values (GH25772open in new window)
Bug in DataFrame.to_html()open in new window where header numbers would ignore display options when rounding (GH17280open in new window)
Bug in read_hdf()open in new window where reading a table from an HDF5 file written directly with PyTables fails with a ValueError when using a sub-selection via the start or stop arguments (GH11188open in new window)
Bug in read_hdf()open in new window not properly closing store after a KeyError is raised (GH25766open in new window)
Improved the explanation for the failure when value labels are repeated in Stata dta files and suggested work-arounds (GH25772open in new window)
Improved pandas.read_stata()open in new window and pandas.io.stata.StataReader to read incorrectly formatted 118 format files saved by Stata (GH25960open in new window)
Improved the col_space parameter in DataFrame.to_html()open in new window to accept a string so CSS length values can be set correctly (GH25941open in new window)
Fixed bug in loading objects from S3 that contain # characters in the URL (GH25945open in new window)
Adds use_bqstorage_api parameter to read_gbq()open in new window to speed up downloads of large data frames. This feature requires version 0.10.0 of the pandas-gbq library as well as the google-cloud-bigquery-storage and fastavro libraries. (GH26104open in new window)
Fixed memory leak in DataFrame.to_json()open in new window when dealing with numeric data (GH24889open in new window)
Bug in read_json()open in new window where date strings with Z were not converted to a UTC timezone (GH26168open in new window)
Added cache_dates=True parameter to read_csv()open in new window, which allows to cache unique dates when they are parsed (GH25990open in new window)
DataFrame.to_excel()open in new window now raises a ValueError when the caller’s dimensions exceed the limitations of Excel (GH26051open in new window)
Fixed bug in pandas.read_csv()open in new window where a BOM would result in incorrect parsing using engine=’python’ (GH26545open in new window)
read_excel()open in new window now raises a ValueError when input is of type pandas.io.excel.ExcelFile and engine param is passed since pandas.io.excel.ExcelFile has an engine defined (GH26566open in new window)
Bug while selecting from HDFStore with where='' specified (GH26610open in new window).
Fixed bug in DataFrame.to_excel()open in new window where custom objects (i.e. PeriodIndex) inside merged cells were not being converted into types safe for the Excel writer (GH27006open in new window)
Bug in read_hdf()open in new window where reading a timezone aware DatetimeIndexopen in new window would raise a TypeError (GH11926open in new window)
Bug in to_msgpack() and read_msgpack()open in new window which would raise a ValueError rather than a FileNotFoundError for an invalid path (GH27160open in new window)
Fixed bug in DataFrame.to_parquet()open in new window which would raise a ValueError when the dataframe had no columns (GH27339open in new window)
Allow parsing of PeriodDtypeopen in new window columns when using read_csv()open in new window (GH26934open in new window)

绘图（Plotting）相关

Fixed bug where api.extensions.ExtensionArrayopen in new window could not be used in matplotlib plotting (GH25587open in new window)
Bug in an error message in DataFrame.plot()open in new window. Improved the error message if non-numerics are passed to DataFrame.plot()open in new window (GH25481open in new window)
Bug in incorrect ticklabel positions when plotting an index that are non-numeric / non-datetime (GH7612open in new window, GH15912open in new window, GH22334open in new window)
Fixed bug causing plots of PeriodIndexopen in new window timeseries to fail if the frequency is a multiple of the frequency rule code (GH14763open in new window)
Fixed bug when plotting a DatetimeIndexopen in new window with datetime.timezone.utc timezone (GH17173open in new window)

分组/重采样/滚动

Bug in pandas.core.resample.Resampler.agg() with a timezone aware index where OverflowError would raise when passing a list of functions (GH22660open in new window)
Bug in pandas.core.groupby.DataFrameGroupBy.nunique()open in new window in which the names of column levels were lost (GH23222open in new window)
Bug in pandas.core.groupby.GroupBy.agg()open in new window when applying an aggregation function to timezone aware data (GH23683open in new window)
Bug in pandas.core.groupby.GroupBy.first()open in new window and pandas.core.groupby.GroupBy.last()open in new window where timezone information would be dropped (GH21603open in new window)
Bug in pandas.core.groupby.GroupBy.size()open in new window when grouping only NA values (GH23050open in new window)
Bug in Series.groupby()open in new window where observed kwarg was previously ignored (GH24880open in new window)
Bug in Series.groupby()open in new window where using groupby with a MultiIndexopen in new window Series with a list of labels equal to the length of the series caused incorrect grouping (GH25704open in new window)
Ensured that ordering of outputs in groupby aggregation functions is consistent across all versions of Python (GH25692open in new window)
Ensured that result group order is correct when grouping on an ordered Categorical and specifying observed=True (GH25871open in new window, GH25167open in new window)
Bug in pandas.core.window.Rolling.min()open in new window and pandas.core.window.Rolling.max()open in new window that caused a memory leak (GH25893open in new window)
Bug in pandas.core.window.Rolling.count()open in new window and pandas.core.window.Expanding.count was previously ignoring the axis keyword (GH13503open in new window)
Bug in pandas.core.groupby.GroupBy.idxmax() and pandas.core.groupby.GroupBy.idxmin() with datetime column would return incorrect dtype (GH25444open in new window, GH15306open in new window)
Bug in pandas.core.groupby.GroupBy.cumsum()open in new window, pandas.core.groupby.GroupBy.cumprod()open in new window, pandas.core.groupby.GroupBy.cummin()open in new window and pandas.core.groupby.GroupBy.cummax()open in new window with categorical column having absent categories, would return incorrect result or segfault (GH16771open in new window)
Bug in pandas.core.groupby.GroupBy.nth()open in new window where NA values in the grouping would return incorrect results (GH26011open in new window)
Bug in pandas.core.groupby.SeriesGroupBy.transform() where transforming an empty group would raise a ValueError (GH26208open in new window)
Bug in pandas.core.frame.DataFrame.groupby() where passing a pandas.core.groupby.grouper.Grouper would return incorrect groups when using the .groups accessor (GH26326open in new window)
Bug in pandas.core.groupby.GroupBy.agg()open in new window where incorrect results are returned for uint64 columns. (GH26310open in new window)
Bug in pandas.core.window.Rolling.median()open in new window and pandas.core.window.Rolling.quantile()open in new window where MemoryError is raised with empty window (GH26005open in new window)
Bug in pandas.core.window.Rolling.median()open in new window and pandas.core.window.Rolling.quantile()open in new window where incorrect results are returned with closed='left' and closed='neither' (GH26005open in new window)
Improved pandas.core.window.Rolling, pandas.core.window.Window and pandas.core.window.EWM functions to exclude nuisance columns from results instead of raising errors and raise a DataError only if all columns are nuisance (GH12537open in new window)
Bug in pandas.core.window.Rolling.max()open in new window and pandas.core.window.Rolling.min()open in new window where incorrect results are returned with an empty variable window (GH26005open in new window)
Raise a helpful exception when an unsupported weighted window function is used as an argument of pandas.core.window.Window.aggregate() (GH26597open in new window)

重塑（Reshaping）相关

Bug in pandas.merge()open in new window adds a string of None, if None is assigned in suffixes instead of remain the column name as-is (GH24782open in new window).
Bug in merge()open in new window when merging by index name would sometimes result in an incorrectly numbered index (missing index values are now assigned NA) (GH24212open in new window, GH25009open in new window)
to_records() now accepts dtypes to its column_dtypes parameter (GH24895open in new window)
Bug in concat()open in new window where order of OrderedDict (and dict in Python 3.6+) is not respected, when passed in as objs argument (GH21510open in new window)
Bug in pivot_table()open in new window where columns with NaN values are dropped even if dropna argument is False, when the aggfunc argument contains a list (GH22159open in new window)
Bug in concat()open in new window where the resulting freq of two DatetimeIndexopen in new window with the same freq would be dropped (GH3232open in new window).
Bug in merge()open in new window where merging with equivalent Categorical dtypes was raising an error (GH22501open in new window)
bug in DataFrameopen in new window instantiating with a dict of iterators or generators (e.g. pd.DataFrame({'A': reversed(range(3))})) raised an error (GH26349open in new window).
Bug in DataFrameopen in new window instantiating with a range (e.g. pd.DataFrame(range(3))) raised an error (GH26342open in new window).
Bug in DataFrameopen in new window constructor when passing non-empty tuples would cause a segmentation fault (GH25691open in new window)
Bug in Series.apply()open in new window failed when the series is a timezone aware DatetimeIndexopen in new window (GH25959open in new window)
Bug in pandas.cut()open in new window where large bins could incorrectly raise an error due to an integer overflow (GH26045open in new window)
Bug in DataFrame.sort_index()open in new window where an error is thrown when a multi-indexed DataFrame is sorted on all levels with the initial level sorted last (GH26053open in new window)
Bug in Series.nlargest()open in new window treats True as smaller than False (GH26154open in new window)
Bug in DataFrame.pivot_table()open in new window with a IntervalIndexopen in new window as pivot index would raise TypeError (GH25814open in new window)
Bug in which DataFrame.from_dict()open in new window ignored order of OrderedDict when orient='index' (GH8425open in new window).
Bug in DataFrame.transpose()open in new window where transposing a DataFrame with a timezone-aware datetime column would incorrectly raise ValueError (GH26825open in new window)
Bug in pivot_table()open in new window when pivoting a timezone aware column as the values would remove timezone information (GH14948open in new window)
Bug in merge_asof()open in new window when specifying multiple by columns where one is datetime64[ns, tz] dtype (GH26649open in new window)

零散（Sparse）

Significant speedup in SparseArrayopen in new window initialization that benefits most operations, fixing performance regression introduced in v0.20.0 (GH24985open in new window)
Bug in SparseFrame constructor where passing None as the data would cause default_fill_value to be ignored (GH16807open in new window)
Bug in SparseDataFrame when adding a column in which the length of values does not match length of index, AssertionError is raised instead of raising ValueError (GH25484open in new window)
Introduce a better error message in Series.sparse.from_coo()open in new window so it returns a TypeError for inputs that are not coo matrices (GH26554open in new window)
Bug in numpy.modf() on a SparseArrayopen in new window. Now a tuple of SparseArrayopen in new window is returned (GH26946open in new window).

构建相关更改

Fix install error with PyPy on macOS (GH26536open in new window)

扩展数组

Bug in factorize()open in new window when passing an ExtensionArray with a custom na_sentinel (GH25696open in new window).
Series.count()open in new window miscounts NA values in ExtensionArrays (GH26835open in new window)
Added Series.__array_ufunc__ to better handle NumPy ufuncs applied to Series backed by extension arrays (GH23293open in new window).
Keyword argument deep has been removed from ExtensionArray.copy() (GH27083open in new window)

其他

Removed unused C functions from vendored UltraJSON implementation (GH26198open in new window)
Allow Indexopen in new window and RangeIndexopen in new window to be passed to numpy min and max functions (GH26125open in new window)
Use actual class name in repr of empty objects of a Series subclass (GH27001open in new window).
Bug in DataFrameopen in new window where passing an object array of timezone-aware datetime objects would incorrectly raise ValueError (GH13287open in new window)

贡献者

（译者注：官方未公布）

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

列表为空，暂无数据

v0.25.0 版本特性（2019年7月18日）

增强

具有重新标记的Groupby聚合

具有多个Lambda的Groupby聚合

更好的多索引 repr

用于Series和DataFrame的较短截断 repr

使用max_level参数支持进行JSON规范化

Series.explode 将类似列表的值拆分为行

其他增强功能

向后不兼容的API更改

使用UTC偏移量对日期字符串进行索引

MultiIndex由级别和代码构造

DataFrame 上的 Groupby.apply 只对第一组求值一次

连接稀疏值

`.str``-访问器执行更严格的类型检查

在groupby期间保留分类dtypes

不兼容的索引类型联合

DataFrame groupby ffill/bfill不再返回组标签

DataFrame 在空的分类/对象列上描述将返回top和freq

__str__方法现在调用__repr__而不是反之亦然

使用Interval对象索引IntervalIndex

Series 上的二进制ufunc现在对齐

Categorical.argsort现在在最后放置缺失值

将字典列表传递给DataFrame时，将保留列顺序

增加了依赖项的最低版本

其他API更改

弃用

Sparse的子类

msgpack格式

其他弃用

删除先前版本的弃用/更改

性能改进

Bug修复

Categorical相关

和Datetime相关的

Timedelta相关

Timezones相关

Numeric相关

转换相关

字符串相关

“间隔”相关

索引相关

缺失（Missing）相关

多索引（MultiIndex）相关

输入输出（I/O）相关

绘图（Plotting）相关

分组/重采样/滚动

重塑（Reshaping）相关

零散（Sparse）

构建相关更改

扩展数组

其他

贡献者

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

`MultiIndex`由级别和代码构造

`DataFrame` 上的 `Groupby.apply` 只对第一组求值一次

`DataFrame` groupby ffill/bfill不再返回组标签

`DataFrame` 在空的分类/对象列上描述将返回top和freq

`str`方法现在调用`repr`而不是反之亦然

使用`Interval`对象索引`IntervalIndex`

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。