返回介绍

v0.25.0 版本特性(2019年7月18日)

发布于 2023-06-23 21:37:03 字数 172265 浏览 0 评论 0 收藏 0

警告

从0.25.x系列版本开始,Pandas仅支持Python 3.5.3及更高版本。有关更多详细信息,请参见计划移除对Python 2.7的支持

警告

在未来的版本中,支持的最低Python版本将提高到3.6。

警告

面板(Panel) 已完全删除。对于N-D标记的数据结构,请使用 xarrayopen in new window

警告

read_pickle()read_msgpack()方法仅保证向后兼容的 Pandas 版本为0.20.3(GH27082open in new window)。

这些是 Pandas v0.25.0 版本的改变。有关完整的更新日志(包括其他版本的Pandas),请参见发布日志

增强

具有重新标记的Groupby聚合

Pandas添加了特殊的groupby行为,称为“命名聚合”,用于在将多个聚合函数应用于特定列时命名输出列(GH18366open in new window, GH26512open in new window)。

In [1]: animals = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
   ...:                         'height': [9.1, 6.0, 9.5, 34.0],
   ...:                         'weight': [7.9, 7.5, 9.9, 198.0]})
   ...: 

In [2]: animals
Out[2]: 
  kind  height  weight
0  cat     9.1     7.9
1  dog     6.0     7.5
2  cat     9.5     9.9
3  dog    34.0   198.0

[4 rows x 3 columns]

In [3]: animals.groupby("kind").agg(
   ...:     min_height=pd.NamedAgg(column='height', aggfunc='min'),
   ...:     max_height=pd.NamedAgg(column='height', aggfunc='max'),
   ...:     average_weight=pd.NamedAgg(column='weight', aggfunc=np.mean),
   ...: )
   ...: 
Out[3]: 
      min_height  max_height  average_weight
kind                                        
cat          9.1         9.5            8.90
dog          6.0        34.0          102.75

[2 rows x 3 columns]

将所需的列名称作为 **kwargs 传递给 .agg**kwargs 的值应该是元组,其中第一个元素是列选择,第二个元素是要应用的聚合函数。Pandas提供了pandas.NamedAgg (命名为元组),使函数的参数更清晰,但也接受了普通元组。

In [4]: animals.groupby("kind").agg(
   ...:     min_height=('height', 'min'),
   ...:     max_height=('height', 'max'),
   ...:     average_weight=('weight', np.mean),
   ...: )
   ...: 
Out[4]: 
      min_height  max_height  average_weight
kind                                        
cat          9.1         9.5            8.90
dog          6.0        34.0          102.75

[2 rows x 3 columns]

命名聚合是建议替代不推荐使用的 “dict-of-dicts” 方法来命名特定于列的聚合的输出(重命名时使用字典弃用groupby.agg())。

类似的方法现在也可用于Series Groupby对象。因为不需要选择列,所以值可以只是要应用的函数。

In [5]: animals.groupby("kind").height.agg(
   ...:     min_height="min",
   ...:     max_height="max",
   ...: )
   ...: 
Out[5]: 
      min_height  max_height
kind                        
cat          9.1         9.5
dog          6.0        34.0

[2 rows x 2 columns]

在将dict传递给Series groupby聚合(重命名时使用字典时不推荐使用groupby.agg()方法)时,建议使用这种类型的聚合来替代不建议使用的方法和操作。

有关更多信息,请参见命名聚合

具有多个Lambda的Groupby聚合

您现在可以在 pandas.core.groupby.GroupBy.aggopen in new window (GH26430open in new window) 中为类似列表的聚合提供多个lambda函数。

In [6]: animals.groupby('kind').height.agg([
   ...:     lambda x: x.iloc[0], lambda x: x.iloc[-1]
   ...: ])
   ...: 
Out[6]: 
      <lambda_0>  <lambda_1>
kind                        
cat          9.1         9.5
dog          6.0        34.0

[2 rows x 2 columns]

In [7]: animals.groupby('kind').agg([
   ...:     lambda x: x.iloc[0] - x.iloc[1],
   ...:     lambda x: x.iloc[0] + x.iloc[1]
   ...: ])
   ...: 
Out[7]: 
         height                weight           
     <lambda_0> <lambda_1> <lambda_0> <lambda_1>
kind                                            
cat        -0.4       18.6       -2.0       17.8
dog       -28.0       40.0     -190.5      205.5

[2 rows x 4 columns]

以前的版本,这些行为会引发 SpecificationError 异常。

更好的多索引 repr

MultiIndexopen in new window 实例的打印现在将会显示每行的元组数据,并确保元组项垂直对齐,因此现在更容易理解MultiIndex的结构。(GH13480open in new window):

repr现在看起来像这样:

In [8]: pd.MultiIndex.from_product([['a', 'abc'], range(500)])
Out[8]: 
MultiIndex([(  'a',   0),
            (  'a',   1),
            (  'a',   2),
            (  'a',   3),
            (  'a',   4),
            (  'a',   5),
            (  'a',   6),
            (  'a',   7),
            (  'a',   8),
            (  'a',   9),
            ...
            ('abc', 490),
            ('abc', 491),
            ('abc', 492),
            ('abc', 493),
            ('abc', 494),
            ('abc', 495),
            ('abc', 496),
            ('abc', 497),
            ('abc', 498),
            ('abc', 499)],
           length=1000)

在以前的版本中,输出 MultiIndexopen in new window 操作会打印MultiIndex的所有级别和代码,这在视觉和排版上没有吸引力,并使输出的内容更难以定位。例如(将范围限制为5):

In [1]: pd.MultiIndex.from_product([['a', 'abc'], range(5)])
Out[1]: MultiIndex(levels=[['a', 'abc'], [0, 1, 2, 3]],
   ...:            codes=[[0, 0, 0, 0, 1, 1, 1, 1], [0, 1, 2, 3, 0, 1, 2, 3]])

在新的repr中,如果行数小于 options.display.max_seq_items(默认值:100个项目),则将显示所有值。水平方向上,如果输出比options.display.width 宽(默认值:80个字符),则输出将被截断。

用于Series和DataFrame的较短截断 repr

目前,pandas的默认显示选项确保当Series或DataFrame具有超过60行时,其repr将被截断为最多60行(display.max_rows选项)。 然而,这仍然给出一个占据垂直屏幕区域很大一部分的repr。 因此,引入了一个新选项 display.min_rows,默认值为10,它确定截断的repr中显示的行数:

  • 对于较小的 Series 或 DataFrame,最多显示 max_rows 数行 (默认值:60)。
  • 对于长度大于 max_rows 的长度较大的DataFrame Series,仅限显示 min_rows 数行(默认值:10,即第一个和最后一个5行)。

这个双重选项允许仍然可以看到相对较小的对象的全部内容(例如 df.head(20) 显示所有20行),同时为大对象提供简短的repr。

要恢复单个阈值的先前行为,请设置 pd.options.display.min_rows = None

使用max_level参数支持进行JSON规范化

json_normalize() normalizes the provided input dict to all nested levels. The new max_level parameter provides more control over which level to end normalization (GH23843open in new window):

The repr now looks like this:

In [9]: from pandas.io.json import json_normalize

In [10]: data = [{
   ....:     'CreatedBy': {'Name': 'User001'},
   ....:     'Lookup': {'TextField': 'Some text',
   ....:                'UserField': {'Id': 'ID001', 'Name': 'Name001'}},
   ....:     'Image': {'a': 'b'}
   ....: }]
   ....: 

In [11]: json_normalize(data, max_level=1)
Out[11]: 
  CreatedBy.Name Lookup.TextField                    Lookup.UserField Image.a
0        User001        Some text  {'Id': 'ID001', 'Name': 'Name001'}       b

[1 rows x 4 columns]

Series.explode 将类似列表的值拆分为行

Seriesopen in new window and DataFrameopen in new window have gained the DataFrame.explode()open in new window methods to transform list-likes to individual rows. See section on Exploding list-like columnopen in new window in docs for more information (GH16538open in new window, GH10511open in new window)

Here is a typical usecase. You have comma separated string in a column.

In [12]: df = pd.DataFrame([{'var1': 'a,b,c', 'var2': 1},
   ....:                    {'var1': 'd,e,f', 'var2': 2}])
   ....: 

In [13]: df
Out[13]: 
    var1  var2
0  a,b,c     1
1  d,e,f     2

[2 rows x 2 columns]

Creating a long form DataFrame is now straightforward using chained operations

In [14]: df.assign(var1=df.var1.str.split(',')).explode('var1')
Out[14]: 
  var1  var2
0    a     1
0    b     1
0    c     1
1    d     2
1    e     2
1    f     2

[6 rows x 2 columns]

其他增强功能

向后不兼容的API更改

使用UTC偏移量对日期字符串进行索引

Indexing a DataFrameopen in new window or Seriesopen in new window with a DatetimeIndexopen in new window with a date string with a UTC offset would previously ignore the UTC offset. Now, the UTC offset is respected in indexing. (GH24076open in new window, GH16785open in new window)

In [15]: df = pd.DataFrame([0], index=pd.DatetimeIndex(['2019-01-01'], tz='US/Pacific'))

In [16]: df
Out[16]: 
                           0
2019-01-01 00:00:00-08:00  0

[1 rows x 1 columns]

Previous behavior:

In [3]: df['2019-01-01 00:00:00+04:00':'2019-01-01 01:00:00+04:00']
Out[3]:
                           0
2019-01-01 00:00:00-08:00  0

New behavior:

In [17]: df['2019-01-01 12:00:00+04:00':'2019-01-01 13:00:00+04:00']
Out[17]: 
                           0
2019-01-01 00:00:00-08:00  0

[1 rows x 1 columns]

MultiIndex由级别和代码构造

Constructing a MultiIndexopen in new window with NaN levels or codes value < -1 was allowed previously. Now, construction with codes value < -1 is not allowed and NaN levels’ corresponding codes would be reassigned as -1. (GH19387open in new window)

Previous behavior:

In [1]: pd.MultiIndex(levels=[[np.nan, None, pd.NaT, 128, 2]],
   ...:               codes=[[0, -1, 1, 2, 3, 4]])
   ...:
Out[1]: MultiIndex(levels=[[nan, None, NaT, 128, 2]],
                   codes=[[0, -1, 1, 2, 3, 4]])

In [2]: pd.MultiIndex(levels=[[1, 2]], codes=[[0, -2]])
Out[2]: MultiIndex(levels=[[1, 2]],
                   codes=[[0, -2]])

New behavior:

In [18]: pd.MultiIndex(levels=[[np.nan, None, pd.NaT, 128, 2]],
   ....:               codes=[[0, -1, 1, 2, 3, 4]])
   ....: 
Out[18]: 
MultiIndex([(nan,),
            (nan,),
            (nan,),
            (nan,),
            (128,),
            (  2,)],
           )

In [19]: pd.MultiIndex(levels=[[1, 2]], codes=[[0, -2]])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-19-225a01af3975> in <module>
----> 1 pd.MultiIndex(levels=[[1, 2]], codes=[[0, -2]])

/pandas/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    206                 else:
    207                     kwargs[new_arg_name] = new_arg_value
--> 208             return func(*args, **kwargs)
    209 
    210         return wrapper

/pandas/pandas/core/indexes/multi.py in __new__(cls, levels, codes, sortorder, names, dtype, copy, name, verify_integrity, _set_identity)
    270 
    271         if verify_integrity:
--> 272             new_codes = result._verify_integrity()
    273             result._codes = new_codes
    274 

/pandas/pandas/core/indexes/multi.py in _verify_integrity(self, codes, levels)
    348                 raise ValueError(
    349                     "On level {level}, code value ({code})"
--> 350                     " < -1".format(level=i, code=level_codes.min())
    351                 )
    352             if not level.is_unique:

ValueError: On level 0, code value (-2) < -1

DataFrame 上的 Groupby.apply 只对第一组求值一次

The implementation of DataFrameGroupBy.apply() previously evaluated the supplied function consistently twice on the first group to infer if it is safe to use a fast code path. Particularly for functions with side effects, this was an undesired behavior and may have led to surprises. (GH2936open in new window, GH2656open in new window, GH7739open in new window, GH10519open in new window, GH12155open in new window, GH20084open in new window, GH21417open in new window)

Now every group is evaluated only a single time.

In [20]: df = pd.DataFrame({"a": ["x", "y"], "b": [1, 2]})

In [21]: df
Out[21]: 
   a  b
0  x  1
1  y  2

[2 rows x 2 columns]

In [22]: def func(group):
   ....:     print(group.name)
   ....:     return group
   ....:

Previous behavior:

In [3]: df.groupby('a').apply(func)
x
x
y
Out[3]:
   a  b
0  x  1
1  y  2

New behavior:

In [23]: df.groupby("a").apply(func)
x
y
Out[23]: 
   a  b
0  x  1
1  y  2

[2 rows x 2 columns]

连接稀疏值

When passed DataFrames whose values are sparse, concat()open in new window will now return a Seriesopen in new window or DataFrameopen in new window with sparse values, rather than a SparseDataFrame (GH25702open in new window).

In [24]: df = pd.DataFrame({"A": pd.SparseArray([0, 1])})

Previous behavior:

In [2]: type(pd.concat([df, df]))
pandas.core.sparse.frame.SparseDataFrame

New behavior:

In [25]: type(pd.concat([df, df]))
Out[25]: pandas.core.frame.DataFrame

This now matches the existing behavior of concatopen in new window on Series with sparse values. concat()open in new window will continue to return a SparseDataFrame when all the values are instances of SparseDataFrame.

This change also affects routines using concat()open in new window internally, like get_dummies()open in new window, which now returns a DataFrameopen in new window in all cases (previously a SparseDataFrame was returned if all the columns were dummy encoded, and a DataFrameopen in new window otherwise).

Providing any SparseSeries or SparseDataFrame to concat()open in new window will cause a SparseSeries or SparseDataFrame to be returned, as before.

`.str``-访问器执行更严格的类型检查

Due to the lack of more fine-grained dtypes, Series.stropen in new window so far only checked whether the data was of object dtype. Series.stropen in new window will now infer the dtype data within the Series; in particular, 'bytes'-only data will raise an exception (except for Series.str.decode()open in new window, Series.str.get()open in new window, Series.str.len()open in new window, Series.str.slice()open in new window), see GH23163open in new window, GH23011open in new window, GH23551open in new window.

Previous behavior:

In [1]: s = pd.Series(np.array(['a', 'ba', 'cba'], 'S'), dtype=object)

In [2]: s
Out[2]:
0      b'a'
1     b'ba'
2    b'cba'
dtype: object

In [3]: s.str.startswith(b'a')
Out[3]:
0     True
1    False
2    False
dtype: bool

New behavior:

In [26]: s = pd.Series(np.array(['a', 'ba', 'cba'], 'S'), dtype=object)

In [27]: s
Out[27]: 
0      b'a'
1     b'ba'
2    b'cba'
Length: 3, dtype: object

In [28]: s.str.startswith(b'a')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-28-ac784692b361> in <module>
----> 1 s.str.startswith(b'a')

/pandas/pandas/core/strings.py in wrapper(self, *args, **kwargs)
   1840                     )
   1841                 )
-> 1842                 raise TypeError(msg)
   1843             return func(self, *args, **kwargs)
   1844 

TypeError: Cannot use .str.startswith with values of inferred dtype 'bytes'.

在groupby期间保留分类dtypes

Previously, columns that were categorical, but not the groupby key(s) would be converted to object dtype during groupby operations. Pandas now will preserve these dtypes. (GH18502open in new window)

In [29]: cat = pd.Categorical(["foo", "bar", "bar", "qux"], ordered=True)

In [30]: df = pd.DataFrame({'payload': [-1, -2, -1, -2], 'col': cat})

In [31]: df
Out[31]: 
   payload  col
0       -1  foo
1       -2  bar
2       -1  bar
3       -2  qux

[4 rows x 2 columns]

In [32]: df.dtypes
Out[32]: 
payload       int64
col        category
Length: 2, dtype: object

Previous Behavior:

In [5]: df.groupby('payload').first().col.dtype
Out[5]: dtype('O')

New Behavior:

In [33]: df.groupby('payload').first().col.dtype
Out[33]: CategoricalDtype(categories=['bar', 'foo', 'qux'], ordered=True)

不兼容的索引类型联合

When performing Index.union()open in new window operations between objects of incompatible dtypes, the result will be a base Indexopen in new window of dtype object. This behavior holds true for unions between Indexopen in new window objects that previously would have been prohibited. The dtype of empty Indexopen in new window objects will now be evaluated before performing union operations rather than simply returning the other Indexopen in new window object. Index.union()open in new window can now be considered commutative, such that A.union(B) == B.union(A) (GH23525open in new window).

Previous behavior:

In [1]: pd.period_range('19910905', periods=2).union(pd.Int64Index([1, 2, 3]))
...
ValueError: can only call with other PeriodIndex-ed objects

In [2]: pd.Index([], dtype=object).union(pd.Index([1, 2, 3]))
Out[2]: Int64Index([1, 2, 3], dtype='int64')

New behavior:

In [34]: pd.period_range('19910905', periods=2).union(pd.Int64Index([1, 2, 3]))
Out[34]: Index([1991-09-05, 1991-09-06, 1, 2, 3], dtype='object')

In [35]: pd.Index([], dtype=object).union(pd.Index([1, 2, 3]))
Out[35]: Index([1, 2, 3], dtype='object')

Note that integer- and floating-dtype indexes are considered “compatible”. The integer values are coerced to floating point, which may result in loss of precision. See Set operations on Index objectsopen in new window for more.

DataFrame groupby ffill/bfill不再返回组标签

The methods ffill, bfill, pad and backfill of DataFrameGroupBy previously included the group labels in the return value, which was inconsistent with other groupby transforms. Now only the filled values are returned. (GH21521open in new window)

In [36]: df = pd.DataFrame({"a": ["x", "y"], "b": [1, 2]})

In [37]: df
Out[37]: 
   a  b
0  x  1
1  y  2

[2 rows x 2 columns]

Previous behavior:

In [3]: df.groupby("a").ffill()
Out[3]:
   a  b
0  x  1
1  y  2

New behavior:

In [38]: df.groupby("a").ffill()
Out[38]: 
   b
0  1
1  2

[2 rows x 1 columns]

DataFrame 在空的分类/对象列上描述将返回top和freq

When calling DataFrame.describe()open in new window with an empty categorical / object column, the ‘top’ and ‘freq’ columns were previously omitted, which was inconsistent with the output for non-empty columns. Now the ‘top’ and ‘freq’ columns will always be included, with numpy.nan in the case of an empty DataFrameopen in new window (GH26397open in new window)

In [39]: df = pd.DataFrame({"empty_col": pd.Categorical([])})

In [40]: df
Out[40]: 
Empty DataFrame
Columns: [empty_col]
Index: []

[0 rows x 1 columns]

Previous behavior:

In [3]: df.describe()
Out[3]:
        empty_col
count           0
unique          0

New behavior:

In [41]: df.describe()
Out[41]: 
       empty_col
count          0
unique         0
top          NaN
freq         NaN

[4 rows x 1 columns]

__str__方法现在调用__repr__而不是反之亦然

Pandas has until now mostly defined string representations in a Pandas objects’s __str__/__unicode__/__bytes__ methods, and called __str__ from the __repr__ method, if a specific __repr__ method is not found. This is not needed for Python3. In Pandas 0.25, the string representations of Pandas objects are now generally defined in __repr__, and calls to __str__ in general now pass the call on to the __repr__, if a specific __str__ method doesn’t exist, as is standard for Python. This change is backward compatible for direct usage of Pandas, but if you subclass Pandas objects and give your subclasses specific __str__/__repr__ methods, you may have to adjust your __str__/__repr__ methods (GH26495open in new window).

使用Interval对象索引IntervalIndex

Indexing methods for IntervalIndexopen in new window have been modified to require exact matches only for Intervalopen in new window queries. IntervalIndex methods previously matched on any overlapping Interval. Behavior with scalar points, e.g. querying with an integer, is unchanged (GH16316open in new window).

In [42]: ii = pd.IntervalIndex.from_tuples([(0, 4), (1, 5), (5, 8)])

In [43]: ii
Out[43]: 
IntervalIndex([(0, 4], (1, 5], (5, 8]],
              closed='right',
              dtype='interval[int64]')

The in operator (__contains__) now only returns True for exact matches to Intervals in the IntervalIndex, whereas this would previously return True for any Interval overlapping an Interval in the IntervalIndex.

Previous behavior:

In [4]: pd.Interval(1, 2, closed='neither') in ii
Out[4]: True

In [5]: pd.Interval(-10, 10, closed='both') in ii
Out[5]: True

New behavior:

In [44]: pd.Interval(1, 2, closed='neither') in ii
Out[44]: False

In [45]: pd.Interval(-10, 10, closed='both') in ii
Out[45]: False

The get_loc()open in new window method now only returns locations for exact matches to Interval queries, as opposed to the previous behavior of returning locations for overlapping matches. A KeyError will be raised if an exact match is not found.

Previous behavior:

In [6]: ii.get_loc(pd.Interval(1, 5))
Out[6]: array([0, 1])

In [7]: ii.get_loc(pd.Interval(2, 6))
Out[7]: array([0, 1, 2])

New behavior:

In [6]: ii.get_loc(pd.Interval(1, 5))
Out[6]: 1

In [7]: ii.get_loc(pd.Interval(2, 6))
---------------------------------------------------------------------------
KeyError: Interval(2, 6, closed='right')

Likewise, get_indexer()open in new window and get_indexer_non_unique() will also only return locations for exact matches to Interval queries, with -1 denoting that an exact match was not found.

These indexing changes extend to querying a Seriesopen in new window or DataFrameopen in new window with an IntervalIndex index.

In [46]: s = pd.Series(list('abc'), index=ii)

In [47]: s
Out[47]: 
(0, 4]    a
(1, 5]    b
(5, 8]    c
Length: 3, dtype: object

Selecting from a Series or DataFrame using [] (__getitem__) or loc now only returns exact matches for Interval queries.

Previous behavior:

In [8]: s[pd.Interval(1, 5)]
Out[8]:
(0, 4]    a
(1, 5]    b
dtype: object

In [9]: s.loc[pd.Interval(1, 5)]
Out[9]:
(0, 4]    a
(1, 5]    b
dtype: object

New behavior:

In [48]: s[pd.Interval(1, 5)]
Out[48]: 'b'

In [49]: s.loc[pd.Interval(1, 5)]
Out[49]: 'b'

Similarly, a KeyError will be raised for non-exact matches instead of returning overlapping matches.

Previous behavior:

In [9]: s[pd.Interval(2, 3)]
Out[9]:
(0, 4]    a
(1, 5]    b
dtype: object

In [10]: s.loc[pd.Interval(2, 3)]
Out[10]:
(0, 4]    a
(1, 5]    b
dtype: object

New behavior:

In [6]: s[pd.Interval(2, 3)]
---------------------------------------------------------------------------
KeyError: Interval(2, 3, closed='right')

In [7]: s.loc[pd.Interval(2, 3)]
---------------------------------------------------------------------------
KeyError: Interval(2, 3, closed='right')

The overlaps()open in new window method can be used to create a boolean indexer that replicates the previous behavior of returning overlapping matches.

New behavior:

In [50]: idxr = s.index.overlaps(pd.Interval(2, 3))

In [51]: idxr
Out[51]: array([ True,  True, False])

In [52]: s[idxr]
Out[52]: 
(0, 4]    a
(1, 5]    b
Length: 2, dtype: object

In [53]: s.loc[idxr]
Out[53]: 
(0, 4]    a
(1, 5]    b
Length: 2, dtype: object

Series 上的二进制ufunc现在对齐

Applying a binary ufunc like numpy.power() now aligns the inputs when both are Seriesopen in new window (GH23293open in new window).

In [54]: s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])

In [55]: s2 = pd.Series([3, 4, 5], index=['d', 'c', 'b'])

In [56]: s1
Out[56]: 
a    1
b    2
c    3
Length: 3, dtype: int64

In [57]: s2
Out[57]: 
d    3
c    4
b    5
Length: 3, dtype: int64

Previous behavior

In [5]: np.power(s1, s2)
Out[5]:
a      1
b     16
c    243
dtype: int64

New behavior

In [58]: np.power(s1, s2)
Out[58]: 
a     1.0
b    32.0
c    81.0
d     NaN
Length: 4, dtype: float64

This matches the behavior of other binary operations in pandas, like Series.add()open in new window. To retain the previous behavior, convert the other Series to an array before applying the ufunc.

In [59]: np.power(s1, s2.array)
Out[59]: 
a      1
b     16
c    243
Length: 3, dtype: int64

Categorical.argsort现在在最后放置缺失值

Categorical.argsort() now places missing values at the end of the array, making it consistent with NumPy and the rest of pandas (GH21801open in new window).

In [60]: cat = pd.Categorical(['b', None, 'a'], categories=['a', 'b'], ordered=True)

Previous behavior

In [2]: cat = pd.Categorical(['b', None, 'a'], categories=['a', 'b'], ordered=True)

In [3]: cat.argsort()
Out[3]: array([1, 2, 0])

In [4]: cat[cat.argsort()]
Out[4]:
[NaN, a, b]
categories (2, object): [a < b]

New behavior

In [61]: cat.argsort()
Out[61]: array([2, 0, 1])

In [62]: cat[cat.argsort()]
Out[62]: 
[a, b, NaN]
Categories (2, object): [a < b]

将字典列表传递给DataFrame时,将保留列顺序

Starting with Python 3.7 the key-order of dict is guaranteedopen in new window. In practice, this has been true since Python 3.6. The DataFrameopen in new window constructor now treats a list of dicts in the same way as it does a list of OrderedDict, i.e. preserving the order of the dicts. This change applies only when pandas is running on Python>=3.6 (GH27309open in new window).

In [63]: data = [
   ....:     {'name': 'Joe', 'state': 'NY', 'age': 18},
   ....:     {'name': 'Jane', 'state': 'KY', 'age': 19, 'hobby': 'Minecraft'},
   ....:     {'name': 'Jean', 'state': 'OK', 'age': 20, 'finances': 'good'}
   ....: ]
   ....:

Previous Behavior:

The columns were lexicographically sorted previously,

In [1]: pd.DataFrame(data)
Out[1]:
   age finances      hobby  name state
0   18      NaN        NaN   Joe    NY
1   19      NaN  Minecraft  Jane    KY
2   20     good        NaN  Jean    OK

New Behavior:

The column order now matches the insertion-order of the keys in the dict, considering all the records from top to bottom. As a consequence, the column order of the resulting DataFrame has changed compared to previous pandas verisons.

In [64]: pd.DataFrame(data)
Out[64]: 
   name state  age      hobby finances
0   Joe    NY   18        NaN      NaN
1  Jane    KY   19  Minecraft      NaN
2  Jean    OK   20        NaN     good

[3 rows x 5 columns]

增加了依赖项的最低版本

Due to dropping support for Python 2.7, a number of optional dependencies have updated minimum versions (GH25725open in new window, GH24942open in new window, GH25752open in new window). Independently, some minimum supported versions of dependencies were updated (GH23519open in new window, GH25554open in new window). If installed, we now require:

PackageMinimum VersionRequired
numpy1.13.3X
pytz2015.4X
python-dateutil2.6.1X
bottleneck1.2.1
numexpr2.6.2
pytest (dev)4.0.2

For optional librariesopen in new window the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.

PackageMinimum Version
beautifulsoup44.6.0
fastparquet0.2.1
gcsfs0.2.2
lxml3.8.0
matplotlib2.2.2
openpyxl2.4.8
pyarrow0.9.0
pymysql0.7.1
pytables3.4.2
scipy0.19.0
sqlalchemy1.1.4
xarray0.8.2
xlrd1.1.0
xlsxwriter0.9.8
xlwt1.2.0

See Dependenciesopen in new window and Optional dependenciesopen in new window for more.

其他API更改

弃用

Sparse的子类

The SparseSeries and SparseDataFrame subclasses are deprecated. Their functionality is better-provided by a Series or DataFrame with sparse values.

Previous way

In [65]: df = pd.SparseDataFrame({"A": [0, 0, 1, 2]})

In [66]: df.dtypes
Out[66]: 
A    Sparse[int64, nan]
Length: 1, dtype: object

New way

In [67]: df = pd.DataFrame({"A": pd.SparseArray([0, 0, 1, 2])})

In [68]: df.dtypes
Out[68]: 
A    Sparse[int64, 0]
Length: 1, dtype: object

The memory usage of the two approaches is identical. See Migratingopen in new window for more (GH19239open in new window).

msgpack格式

The msgpack format is deprecated as of 0.25 and will be removed in a future version. It is recommended to use pyarrow for on-the-wire transmission of pandas objects. (GH27084open in new window)

其他弃用

删除先前版本的弃用/更改

性能改进

Bug修复

Categorical相关

和Datetime相关的

Timedelta相关

Timezones相关

Numeric相关

转换相关

字符串相关

“间隔”相关

索引相关

缺失(Missing)相关

多索引(MultiIndex)相关

输入输出(I/O)相关

绘图(Plotting)相关

分组/重采样/滚动

重塑(Reshaping)相关

零散(Sparse)

构建相关更改

扩展数组

其他

贡献者

(译者注:官方未公布)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
    我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
    原文