熊猫的聚集

发布于 2025-02-06 09:03:10 字数 939 浏览 2 评论 0 原文

  1. 如何与大熊猫进行聚合?
  2. 聚合后没有数据框!发生了什么?
  3. 如何主要汇总字符串列( list s, tuple s,带有shipator 的字符串)?
  4. 我如何汇总计数?
  5. 如何创建一个由汇总值填充的新列?

我已经看到了这些反复出现的问题,询问了熊猫骨料功能的各种面孔。 有关当今汇总及其各种用例的大多数信息都在数十个措辞不好,无法搜索的帖子中分散。 这里的目的是整理一些更重要的后代。

Q& a是将成为一系列有用的用户引进的下一部分:

请注意,这篇文章并不意味着要替代有关聚合的文档,所以也请阅读!

  1. How can I perform aggregation with Pandas?
  2. No DataFrame after aggregation! What happened?
  3. How can I aggregate mainly strings columns (to lists, tuples, strings with separator)?
  4. How can I aggregate counts?
  5. How can I create a new column filled by aggregated values?

I've seen these recurring questions asking about various faces of the pandas aggregate functionality.
Most of the information regarding aggregation and its various use cases today is fragmented across dozens of badly worded, unsearchable posts.
The aim here is to collate some of the more important points for posterity.

This Q&A is meant to be the next instalment in a series of helpful user-guides:

Please note that this post is not meant to be a replacement for the documentation about aggregation and about groupby, so please read that as well!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

夏见 2025-02-13 09:03:10

问题1

如何与大熊猫进行聚合?

扩展 contregation文档

聚合功能是降低返回对象的维度的功能。这意味着输出系列/数据框与原始的行较小或相同。

某些常见的汇总功能如下表:

Function    Description
mean()         Compute mean of groups
sum()         Compute sum of group values
size()         Compute group sizes
count()     Compute count of group
std()         Standard deviation of groups
var()         Compute variance of groups
sem()         Standard error of the mean of groups
describe()     Generates descriptive statistics
first()     Compute first of group values
last()         Compute last of group values
nth()         Take nth value, or a subset if n is a list
min()         Compute min of group values
max()         Compute max of group values
np.random.seed(123)

df = pd.DataFrame({'A' : ['foo', 'foo', 'bar', 'foo', 'bar', 'foo'],
                   'B' : ['one', 'two', 'three','two', 'two', 'one'],
                   'C' : np.random.randint(5, size=6),
                   'D' : np.random.randint(5, size=6),
                   'E' : np.random.randint(5, size=6)})
print (df)
     A      B  C  D  E
0  foo    one  2  3  0
1  foo    two  4  1  0
2  bar  three  2  1  1
3  foo    two  1  0  3
4  bar    two  3  1  4
5  foo    one  2  1  0

通过过滤的列进行聚合, Cython实现的函数

df1 = df.groupby(['A', 'B'], as_index=False)['C'].sum()
print (df1)
     A      B  C
0  bar  three  2
1  bar    two  3
2  foo    one  4
3  foo    two  5

用于所有列的汇总函数,而无需在 groupby 函数中指定,此处 a,b 列:

df2 = df.groupby(['A', 'B'], as_index=False).sum()
print (df2)
     A      B  C  D  E
0  bar  three  2  1  1
1  bar    two  3  1  4
2  foo    one  4  4  0
3  foo    two  5  1  3

您也可以 :在 groupby 函数之后,仅指定一些用于聚合的列:

df3 = df.groupby(['A', 'B'], as_index=False)['C','D'].sum()
print (df3)
     A      B  C  D
0  bar  three  2  1
1  bar    two  3  1
2  foo    one  4  4
3  foo    two  5  1

使用函数 dataframegroupby.agg

df1 = df.groupby(['A', 'B'], as_index=False)['C'].agg('sum')
print (df1)
     A      B  C
0  bar  three  2
1  bar    two  3
2  foo    one  4
3  foo    two  5

df2 = df.groupby(['A', 'B'], as_index=False).agg('sum')
print (df2)
     A      B  C  D  E
0  bar  three  2  1  1
1  bar    two  3  1  4
2  foo    one  4  4  0
3  foo    two  5  1  3

用于用于一列的多个功能 s-新列和汇总功能的名称:

df4 = (df.groupby(['A', 'B'])['C']
         .agg([('average','mean'),('total','sum')])
         .reset_index())
print (df4)
     A      B  average  total
0  bar  three      2.0      2
1  bar    two      3.0      3
2  foo    one      2.0      4
3  foo    two      2.5      5

如果要传递多个函数,则可能是通过 list of tuple s:

df5 = (df.groupby(['A', 'B'])
         .agg([('average','mean'),('total','sum')]))

print (df5)
                C             D             E
          average total average total average total
A   B
bar three     2.0     2     1.0     1     1.0     1
    two       3.0     3     1.0     1     4.0     4
foo one       2.0     4     2.0     4     0.0     0
    two       2.5     5     0.5     1     1.5     3

然后获取 Multiiindex 在列中:

print (df5.columns)
MultiIndex(levels=[['C', 'D', 'E'], ['average', 'total']],
           labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])

为了转换为列,平坦 MultiIndex 使用 MAP with JOIN> JOIN

df5.columns = df5.columns.map('_'.join)
df5 = df5.reset_index()
print (df5)
     A      B  C_average  C_total  D_average  D_total  E_average  E_total
0  bar  three        2.0        2        1.0        1        1.0        1
1  bar    two        3.0        3        1.0        1        4.0        4
2  foo    one        2.0        4        2.0        4        0.0        0
3  foo    two        2.5        5        0.5        1        1.5        3

另一个解决方案是汇总功能的通过列表,然后变平 MultiIndex 和另一列名称使用 str.replace

df5 = df.groupby(['A', 'B']).agg(['mean','sum'])

df5.columns = (df5.columns.map('_'.join)
                  .str.replace('sum','total')
                  .str.replace('mean','average'))
df5 = df5.reset_index()
print (df5)
     A      B  C_average  C_total  D_average  D_total  E_average  E_total
0  bar  three        2.0        2        1.0        1        1.0        1
1  bar    two        3.0        3        1.0        1        4.0        4
2  foo    one        2.0        4        2.0        4        0.0        0
3  foo    two        2.5        5        0.5        1        1.5        3

如果需要指定的每个列,则使用汇总函数单独传递 dictionary

df6 = (df.groupby(['A', 'B'], as_index=False)
         .agg({'C':'sum','D':'mean'})
         .rename(columns={'C':'C_total', 'D':'D_average'}))
print (df6)
     A      B  C_total  D_average
0  bar  three        2        1.0
1  bar    two        3        1.0
2  foo    one        4        2.0
3  foo    two        5        0.5

您也可以传递自定义功能:

def func(x):
    return x.iat[0] + x.iat[-1]

df7 = (df.groupby(['A', 'B'], as_index=False)
         .agg({'C':'sum','D': func})
         .rename(columns={'C':'C_total', 'D':'D_sum_first_and_last'}))
print (df7)
     A      B  C_total  D_sum_first_and_last
0  bar  three        2                     2
1  bar    two        3                     2
2  foo    one        4                     4
3  foo    two        5                     1

dataframe

ategrame gotregation之后否 no dataframe呢发生了什么?

通过两个或多个列进行聚合:

df1 = df.groupby(['A', 'B'])['C'].sum()
print (df1)
A    B
bar  three    2
     two      3
foo  one      4
     two      5
Name: C, dtype: int32

第一次检查pandas对象的 index type type :

print (df1.index)
MultiIndex(levels=[['bar', 'foo'], ['one', 'three', 'two']],
           labels=[[0, 0, 1, 1], [1, 2, 0, 2]],
           names=['A', 'B'])

print (type(df1))
<class 'pandas.core.series.Series'>

有两个解决方案有关如何获取 Multiiindex Series 列:

  • add parameter as_index = false
df1 = df.groupby(['A', 'B'], as_index=False)['C'].sum()
print (df1)
     A      B  C
0  bar  three  2
1  bar    two  3
2  foo    one  4
3  foo    two  5
df1 = df.groupby(['A', 'B'])['C'].sum().reset_index()
print (df1)
     A      B  C
0  bar  three  2
1  bar    two  3
2  foo    one  4
3  foo    two  5

如果按一个列组:

df2 = df.groupby('A')['C'].sum()
print (df2)
A
bar    5
foo    9
Name: C, dtype: int32

... get seriper> with index

print (df2.index)
Index(['bar', 'foo'], dtype='object', name='A')

print (type(df2))
<class 'pandas.core.series.Series'>

和解决方案是解决方案与 MultiIndex系列相同:

df2 = df.groupby('A', as_index=False)['C'].sum()
print (df2)
     A  C
0  bar  5
1  foo  9

df2 = df.groupby('A')['C'].sum().reset_index()
print (df2)
     A  C
0  bar  5
1  foo  9

问题3

我如何主要汇总字符串列( list s, tuple s, s,带有带有的字符串saparator )?

df = pd.DataFrame({'A' : ['a', 'c', 'b', 'b', 'a', 'c', 'b'],
                   'B' : ['one', 'two', 'three','two', 'two', 'one', 'three'],
                   'C' : ['three', 'one', 'two', 'two', 'three','two', 'one'],
                   'D' : [1,2,3,2,3,1,2]})
print (df)
   A      B      C  D
0  a    one  three  1
1  c    two    one  2
2  b  three    two  3
3  b    two    two  2
4  a    two  three  3
5  c    one    two  1
6  b  three    one  2

代替聚合函数,可以通过 list tuple set 用于转换列:

df1 = df.groupby('A')['B'].agg(list).reset_index()
print (df1)
   A                    B
0  a           [one, two]
1  b  [three, two, three]
2  c           [two, one]

替代方案是使用 groupby.apply

df1 = df.groupby('A')['B'].apply(list).reset_index()
print (df1)
   A                    B
0  a           [one, two]
1  b  [three, two, three]
2  c           [two, one]

要使用分隔符转换为字符串,请使用 .join 仅当它是字符串列时:

df2 = df.groupby('A')['B'].agg(','.join).reset_index()
print (df2)
   A                B
0  a          one,two
1  b  three,two,three
2  c          two,one

如果是数字列,请使用

df3 = (df.groupby('A')['D']
         .agg(lambda x: ','.join(x.astype(str)))
         .reset_index())
print (df3)
   A      D
0  a    1,3
1  b  3,2,2
2  c    2,1

另一个解决方案是在 groupby 之前转换为字符串:

df3 = (df.assign(D = df['D'].astype(str))
         .groupby('A')['D']
         .agg(','.join).reset_index())
print (df3)
   A      D
0  a    1,3
1  b  3,2,2
2  c    2,1

对于转换所有列,请勿在 groupby 之后传递列列表。
没有任何列 d ,因为自动排除'nuisance'列。这意味着所有数字列被排除在外。

df4 = df.groupby('A').agg(','.join).reset_index()
print (df4)
   A                B            C
0  a          one,two  three,three
1  b  three,two,three  two,two,one
2  c          two,one      one,two

因此,有必要将所有列转换为字符串,然后获取所有列:

df5 = (df.groupby('A')
         .agg(lambda x: ','.join(x.astype(str)))
         .reset_index())
print (df5)
   A                B            C      D
0  a          one,two  three,three    1,3
1  b  three,two,three  two,two,one  3,2,2
2  c          two,one      one,two    2,1

问题4

如何汇总计数?

df = pd.DataFrame({'A' : ['a', 'c', 'b', 'b', 'a', 'c', 'b'],
                   'B' : ['one', 'two', 'three','two', 'two', 'one', 'three'],
                   'C' : ['three', np.nan, np.nan, 'two', 'three','two', 'one'],
                   'D' : [np.nan,2,3,2,3,np.nan,2]})
print (df)
   A      B      C    D
0  a    one  three  NaN
1  c    two    NaN  2.0
2  b  three    NaN  3.0
3  b    two    two  2.0
4  a    two  three  3.0
5  c    one    two  NaN
6  b  three    one  2.0

函数 对于每个组的 size

df1 = df.groupby('A').size().reset_index(name='COUNT')
print (df1)
   A  COUNT
0  a      2
1  b      3
2  c      2

function

df2 = df.groupby('A')['C'].count().reset_index(name='COUNT')
print (df2)
   A  COUNT
0  a      2
1  b      2
2  c      1

df3 = df.groupby('A').count().add_suffix('_COUNT').reset_index()
print (df3)
   A  B_COUNT  C_COUNT  D_COUNT
0  a        2        2        1
1  b        3        2        3
2  c        2        1        1

​href =“ http://pandas.pydata.org/pandas-docs/stable/generated/pandas.series.value_counts.html” rel =“ noreferrer”> series.value_count.value_counts 。它返回包含以降序的唯一值计数的对象的大小,因此第一个元素是最常见的元素。默认情况下,它不包括 nan s值。

df4 = (df['A'].value_counts()
              .rename_axis('A')
              .reset_index(name='COUNT'))
print (df4)
   A  COUNT
0  b      3
1  a      2
2  c      2

如果您想要相同的输出,例如使用函数 groupby + size ,请添加 series.sort_index

df5 = (df['A'].value_counts()
              .sort_index()
              .rename_axis('A')
              .reset_index(name='COUNT'))
print (df5)
   A  COUNT
0  a      2
1  b      3
2  c      2

问题5

我如何创建一个由聚合值填充的新列?

方法 返回一个与被分组的对象相同(相同的大小)。

参见 Pandas Documentation 提供更多信息。

np.random.seed(123)

df = pd.DataFrame({'A' : ['foo', 'foo', 'bar', 'foo', 'bar', 'foo'],
                    'B' : ['one', 'two', 'three','two', 'two', 'one'],
                    'C' : np.random.randint(5, size=6),
                    'D' : np.random.randint(5, size=6)})
print (df)
     A      B  C  D
0  foo    one  2  3
1  foo    two  4  1
2  bar  three  2  1
3  foo    two  1  0
4  bar    two  3  1
5  foo    one  2  1


df['C1'] = df.groupby('A')['C'].transform('sum')
df['C2'] = df.groupby(['A','B'])['C'].transform('sum')


df[['C3','D3']] = df.groupby('A')['C','D'].transform('sum')
df[['C4','D4']] = df.groupby(['A','B'])['C','D'].transform('sum')

print (df)

     A      B  C  D  C1  C2  C3  D3  C4  D4
0  foo    one  2  3   9   4   9   5   4   4
1  foo    two  4  1   9   5   9   5   5   1
2  bar  three  2  1   5   2   5   2   2   1
3  foo    two  1  0   9   5   9   5   5   1
4  bar    two  3  1   5   3   5   2   3   1
5  foo    one  2  1   9   4   9   5   4   4

Question 1

How can I perform aggregation with Pandas?

Expanded aggregation documentation.

Aggregating functions are the ones that reduce the dimension of the returned objects. It means output Series/DataFrame have less or same rows like original.

Some common aggregating functions are tabulated below:

Function    Description
mean()         Compute mean of groups
sum()         Compute sum of group values
size()         Compute group sizes
count()     Compute count of group
std()         Standard deviation of groups
var()         Compute variance of groups
sem()         Standard error of the mean of groups
describe()     Generates descriptive statistics
first()     Compute first of group values
last()         Compute last of group values
nth()         Take nth value, or a subset if n is a list
min()         Compute min of group values
max()         Compute max of group values
np.random.seed(123)

df = pd.DataFrame({'A' : ['foo', 'foo', 'bar', 'foo', 'bar', 'foo'],
                   'B' : ['one', 'two', 'three','two', 'two', 'one'],
                   'C' : np.random.randint(5, size=6),
                   'D' : np.random.randint(5, size=6),
                   'E' : np.random.randint(5, size=6)})
print (df)
     A      B  C  D  E
0  foo    one  2  3  0
1  foo    two  4  1  0
2  bar  three  2  1  1
3  foo    two  1  0  3
4  bar    two  3  1  4
5  foo    one  2  1  0

Aggregation by filtered columns and Cython implemented functions:

df1 = df.groupby(['A', 'B'], as_index=False)['C'].sum()
print (df1)
     A      B  C
0  bar  three  2
1  bar    two  3
2  foo    one  4
3  foo    two  5

An aggregate function is used for all columns without being specified in the groupby function, here the A, B columns:

df2 = df.groupby(['A', 'B'], as_index=False).sum()
print (df2)
     A      B  C  D  E
0  bar  three  2  1  1
1  bar    two  3  1  4
2  foo    one  4  4  0
3  foo    two  5  1  3

You can also specify only some columns used for aggregation in a list after the groupby function:

df3 = df.groupby(['A', 'B'], as_index=False)['C','D'].sum()
print (df3)
     A      B  C  D
0  bar  three  2  1
1  bar    two  3  1
2  foo    one  4  4
3  foo    two  5  1

Same results by using function DataFrameGroupBy.agg:

df1 = df.groupby(['A', 'B'], as_index=False)['C'].agg('sum')
print (df1)
     A      B  C
0  bar  three  2
1  bar    two  3
2  foo    one  4
3  foo    two  5

df2 = df.groupby(['A', 'B'], as_index=False).agg('sum')
print (df2)
     A      B  C  D  E
0  bar  three  2  1  1
1  bar    two  3  1  4
2  foo    one  4  4  0
3  foo    two  5  1  3

For multiple functions applied for one column use a list of tuples - names of new columns and aggregated functions:

df4 = (df.groupby(['A', 'B'])['C']
         .agg([('average','mean'),('total','sum')])
         .reset_index())
print (df4)
     A      B  average  total
0  bar  three      2.0      2
1  bar    two      3.0      3
2  foo    one      2.0      4
3  foo    two      2.5      5

If want to pass multiple functions is possible pass list of tuples:

df5 = (df.groupby(['A', 'B'])
         .agg([('average','mean'),('total','sum')]))

print (df5)
                C             D             E
          average total average total average total
A   B
bar three     2.0     2     1.0     1     1.0     1
    two       3.0     3     1.0     1     4.0     4
foo one       2.0     4     2.0     4     0.0     0
    two       2.5     5     0.5     1     1.5     3

Then get MultiIndex in columns:

print (df5.columns)
MultiIndex(levels=[['C', 'D', 'E'], ['average', 'total']],
           labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])

And for converting to columns, flattening MultiIndex use map with join:

df5.columns = df5.columns.map('_'.join)
df5 = df5.reset_index()
print (df5)
     A      B  C_average  C_total  D_average  D_total  E_average  E_total
0  bar  three        2.0        2        1.0        1        1.0        1
1  bar    two        3.0        3        1.0        1        4.0        4
2  foo    one        2.0        4        2.0        4        0.0        0
3  foo    two        2.5        5        0.5        1        1.5        3

Another solution is pass list of aggregate functions, then flatten MultiIndex and for another columns names use str.replace:

df5 = df.groupby(['A', 'B']).agg(['mean','sum'])

df5.columns = (df5.columns.map('_'.join)
                  .str.replace('sum','total')
                  .str.replace('mean','average'))
df5 = df5.reset_index()
print (df5)
     A      B  C_average  C_total  D_average  D_total  E_average  E_total
0  bar  three        2.0        2        1.0        1        1.0        1
1  bar    two        3.0        3        1.0        1        4.0        4
2  foo    one        2.0        4        2.0        4        0.0        0
3  foo    two        2.5        5        0.5        1        1.5        3

If want specified each column with aggregated function separately pass dictionary:

df6 = (df.groupby(['A', 'B'], as_index=False)
         .agg({'C':'sum','D':'mean'})
         .rename(columns={'C':'C_total', 'D':'D_average'}))
print (df6)
     A      B  C_total  D_average
0  bar  three        2        1.0
1  bar    two        3        1.0
2  foo    one        4        2.0
3  foo    two        5        0.5

You can pass custom function too:

def func(x):
    return x.iat[0] + x.iat[-1]

df7 = (df.groupby(['A', 'B'], as_index=False)
         .agg({'C':'sum','D': func})
         .rename(columns={'C':'C_total', 'D':'D_sum_first_and_last'}))
print (df7)
     A      B  C_total  D_sum_first_and_last
0  bar  three        2                     2
1  bar    two        3                     2
2  foo    one        4                     4
3  foo    two        5                     1

Question 2

No DataFrame after aggregation! What happened?

Aggregation by two or more columns:

df1 = df.groupby(['A', 'B'])['C'].sum()
print (df1)
A    B
bar  three    2
     two      3
foo  one      4
     two      5
Name: C, dtype: int32

First check the Index and type of a Pandas object:

print (df1.index)
MultiIndex(levels=[['bar', 'foo'], ['one', 'three', 'two']],
           labels=[[0, 0, 1, 1], [1, 2, 0, 2]],
           names=['A', 'B'])

print (type(df1))
<class 'pandas.core.series.Series'>

There are two solutions for how to get MultiIndex Series to columns:

  • add parameter as_index=False
df1 = df.groupby(['A', 'B'], as_index=False)['C'].sum()
print (df1)
     A      B  C
0  bar  three  2
1  bar    two  3
2  foo    one  4
3  foo    two  5
df1 = df.groupby(['A', 'B'])['C'].sum().reset_index()
print (df1)
     A      B  C
0  bar  three  2
1  bar    two  3
2  foo    one  4
3  foo    two  5

If group by one column:

df2 = df.groupby('A')['C'].sum()
print (df2)
A
bar    5
foo    9
Name: C, dtype: int32

... get Series with Index:

print (df2.index)
Index(['bar', 'foo'], dtype='object', name='A')

print (type(df2))
<class 'pandas.core.series.Series'>

And the solution is the same like in the MultiIndex Series:

df2 = df.groupby('A', as_index=False)['C'].sum()
print (df2)
     A  C
0  bar  5
1  foo  9

df2 = df.groupby('A')['C'].sum().reset_index()
print (df2)
     A  C
0  bar  5
1  foo  9

Question 3

How can I aggregate mainly strings columns (to lists, tuples, strings with separator)?

df = pd.DataFrame({'A' : ['a', 'c', 'b', 'b', 'a', 'c', 'b'],
                   'B' : ['one', 'two', 'three','two', 'two', 'one', 'three'],
                   'C' : ['three', 'one', 'two', 'two', 'three','two', 'one'],
                   'D' : [1,2,3,2,3,1,2]})
print (df)
   A      B      C  D
0  a    one  three  1
1  c    two    one  2
2  b  three    two  3
3  b    two    two  2
4  a    two  three  3
5  c    one    two  1
6  b  three    one  2

Instead of an aggregation function, it is possible to pass list, tuple, set for converting the column:

df1 = df.groupby('A')['B'].agg(list).reset_index()
print (df1)
   A                    B
0  a           [one, two]
1  b  [three, two, three]
2  c           [two, one]

An alternative is use GroupBy.apply:

df1 = df.groupby('A')['B'].apply(list).reset_index()
print (df1)
   A                    B
0  a           [one, two]
1  b  [three, two, three]
2  c           [two, one]

For converting to strings with a separator, use .join only if it is a string column:

df2 = df.groupby('A')['B'].agg(','.join).reset_index()
print (df2)
   A                B
0  a          one,two
1  b  three,two,three
2  c          two,one

If it is a numeric column, use a lambda function with astype for converting to strings:

df3 = (df.groupby('A')['D']
         .agg(lambda x: ','.join(x.astype(str)))
         .reset_index())
print (df3)
   A      D
0  a    1,3
1  b  3,2,2
2  c    2,1

Another solution is converting to strings before groupby:

df3 = (df.assign(D = df['D'].astype(str))
         .groupby('A')['D']
         .agg(','.join).reset_index())
print (df3)
   A      D
0  a    1,3
1  b  3,2,2
2  c    2,1

For converting all columns, don't pass a list of column(s) after groupby.
There isn't any column D, because automatic exclusion of 'nuisance' columns. It means all numeric columns are excluded.

df4 = df.groupby('A').agg(','.join).reset_index()
print (df4)
   A                B            C
0  a          one,two  three,three
1  b  three,two,three  two,two,one
2  c          two,one      one,two

So it's necessary to convert all columns into strings, and then get all columns:

df5 = (df.groupby('A')
         .agg(lambda x: ','.join(x.astype(str)))
         .reset_index())
print (df5)
   A                B            C      D
0  a          one,two  three,three    1,3
1  b  three,two,three  two,two,one  3,2,2
2  c          two,one      one,two    2,1

Question 4

How can I aggregate counts?

df = pd.DataFrame({'A' : ['a', 'c', 'b', 'b', 'a', 'c', 'b'],
                   'B' : ['one', 'two', 'three','two', 'two', 'one', 'three'],
                   'C' : ['three', np.nan, np.nan, 'two', 'three','two', 'one'],
                   'D' : [np.nan,2,3,2,3,np.nan,2]})
print (df)
   A      B      C    D
0  a    one  three  NaN
1  c    two    NaN  2.0
2  b  three    NaN  3.0
3  b    two    two  2.0
4  a    two  three  3.0
5  c    one    two  NaN
6  b  three    one  2.0

Function GroupBy.size for size of each group:

df1 = df.groupby('A').size().reset_index(name='COUNT')
print (df1)
   A  COUNT
0  a      2
1  b      3
2  c      2

Function GroupBy.count excludes missing values:

df2 = df.groupby('A')['C'].count().reset_index(name='COUNT')
print (df2)
   A  COUNT
0  a      2
1  b      2
2  c      1

This function should be used for multiple columns for counting non-missing values:

df3 = df.groupby('A').count().add_suffix('_COUNT').reset_index()
print (df3)
   A  B_COUNT  C_COUNT  D_COUNT
0  a        2        2        1
1  b        3        2        3
2  c        2        1        1

A related function is Series.value_counts. It returns the size of the object containing counts of unique values in descending order, so that the first element is the most frequently-occurring element. It excludes NaNs values by default.

df4 = (df['A'].value_counts()
              .rename_axis('A')
              .reset_index(name='COUNT'))
print (df4)
   A  COUNT
0  b      3
1  a      2
2  c      2

If you want same output like using function groupby + size, add Series.sort_index:

df5 = (df['A'].value_counts()
              .sort_index()
              .rename_axis('A')
              .reset_index(name='COUNT'))
print (df5)
   A  COUNT
0  a      2
1  b      3
2  c      2

Question 5

How can I create a new column filled by aggregated values?

Method GroupBy.transform returns an object that is indexed the same (same size) as the one being grouped.

See the Pandas documentation for more information.

np.random.seed(123)

df = pd.DataFrame({'A' : ['foo', 'foo', 'bar', 'foo', 'bar', 'foo'],
                    'B' : ['one', 'two', 'three','two', 'two', 'one'],
                    'C' : np.random.randint(5, size=6),
                    'D' : np.random.randint(5, size=6)})
print (df)
     A      B  C  D
0  foo    one  2  3
1  foo    two  4  1
2  bar  three  2  1
3  foo    two  1  0
4  bar    two  3  1
5  foo    one  2  1


df['C1'] = df.groupby('A')['C'].transform('sum')
df['C2'] = df.groupby(['A','B'])['C'].transform('sum')


df[['C3','D3']] = df.groupby('A')['C','D'].transform('sum')
df[['C4','D4']] = df.groupby(['A','B'])['C','D'].transform('sum')

print (df)

     A      B  C  D  C1  C2  C3  D3  C4  D4
0  foo    one  2  3   9   4   9   5   4   4
1  foo    two  4  1   9   5   9   5   5   1
2  bar  three  2  1   5   2   5   2   2   1
3  foo    two  1  0   9   5   9   5   5   1
4  bar    two  3  1   5   3   5   2   3   1
5  foo    one  2  1   9   4   9   5   4   4
国产ˉ祖宗 2025-02-13 09:03:10

如果您来自R或SQL背景,以下是三个示例,可以教您您已经熟悉的方式进行聚合所需的一切:

让我们首先创建Pandas DataFrame,

import pandas as pd

df = pd.DataFrame({'key1' : ['a','a','a','b','a'],
                   'key2' : ['c','c','d','d','e'],
                   'value1' : [1,2,2,3,3],
                   'value2' : [9,8,7,6,5]})

df.head(5)

这是我们创建的表的样子:

key1 key2 value2 valu2
a c 1 9
a c 2 8
a d 2 7
b d 3 6
e 1。与 1.1组合的行 3 5

减少汇总类似于

1.1,如果pandas版本; = 0.25

通过运行打印(pd .__版本__)检查熊猫版本。如果您的 pandas版本为0.25或更高,则以下代码将起作用:

df_agg = df.groupby(['key1','key2']).agg(mean_of_value_1=('value1', 'mean'),
                                         sum_of_value_2=('value2', 'sum'),
                                         count_of_value1=('value1','size')
                                         ).reset_index()


df_agg.head(5)

结果数据表将看起来像这样:

key1 key2 key2 mean_of_value1 sum_of_value2 count_of_value1
a c 1.5 17 2
a c 1.5 17 2 a d 2.0 7 1 a d 2.0 7 1
e 3.0 5 a e 3.0 5 1
B D 3.0 6 1

SQL 等效是:

SELECT
      key1
     ,key2
     ,AVG(value1) AS mean_of_value_1
     ,SUM(value2) AS sum_of_value_2
     ,COUNT(*) AS count_of_value1
FROM
    df
GROUP BY
     key1
    ,key2

1.2,如果Pandas版本&lt; 0.25

如果您的Pandas版本是 以上的0.25 ,则运行上述代码将为您带来以下错误:

typeError:gengregate()缺少1所需的位置参数:'arg'

现在可以对 value1 and value2 进行汇总,您将运行此代码:

df_agg = df.groupby(['key1','key2'],as_index=False).agg({'value1':['mean','count'],'value2':'sum'})

df_agg.columns = ['_'.join(col).strip() for col in df_agg.columns.values]

df_agg.head(5)

结果表将显示这样:

key1 key2 key2 value1_mean value1_count value2_sum
a c 1.5 2 17
a d 2.0 1 7
a e 3.0 1 5
b d 3.0 1 6

重命名列需要使用以下代码单独完成

df_agg.rename(columns={"value1_mean" : "mean_of_value1",
                       "value1_count" : "count_of_value1",
                       "value2_sum" : "sum_of_value2"
                       }, inplace=True)

列 : ( excel -sumif,countif

如果要执行sumif,countif等,就像在不减少行减少的情况下在Excel中做的那样,则需要这样做。

df['Total_of_value1_by_key1'] = df.groupby('key1')['value1'].transform('sum')

df.head(5)

结果数据框将看起来像这样的行与原始数量相同的行数:

key1 key2 value2 value2 total_of_value1_by_key1
a c 1 9 8
a c 2 8 8 8
a d 2 7 8
b d 3 6 3 6 3
a 3 5 8 3

3。创建一个等级列 row_number(),(按顺序分区)

最后,可能在某些情况下您要创建一个 rank 列 columt row_number()over(按key1订单按值1 desc,value2 asc)

这是您的方式。

 df['RN'] = df.sort_values(['value1','value2'], ascending=[False,True]) \
              .groupby(['key1']) \
              .cumcount() + 1

 df.head(5)

注意:我们通过在每行末尾添加 \ 来制作代码多行。

这是最终的数据框架的样子:

key1 key2 value1 value2 rn
a c 1 9 4
a c 2 8 3
a d 2 7 2
b d 3 6 1
a e 3 5 1

在上面的所有示例中,最终数据表将具有表结构,并且不会具有您可能在其他语法中获得的枢轴结构。

其他汇总运算符:

Meand()

计算组的平均值 sum

()

()计算组的计数

std()组的标准偏差

var() compute组的差异

sem()标准错误组的平均值

dractic()生成描述性统计

first()计算组值的第一个

last()计算组值的最后一个

> nth()以nth值或子集为n是列表

min()计算组值的最小值

max()计算组值的最大值

If you are coming from an R or SQL background, here are three examples that will teach you everything you need to do aggregation the way you are already familiar with:

Let us first create a Pandas dataframe

import pandas as pd

df = pd.DataFrame({'key1' : ['a','a','a','b','a'],
                   'key2' : ['c','c','d','d','e'],
                   'value1' : [1,2,2,3,3],
                   'value2' : [9,8,7,6,5]})

df.head(5)

Here is how the table we created looks like:

key1 key2 value1 value2
a c 1 9
a c 2 8
a d 2 7
b d 3 6
a e 3 5

1. Aggregating With Row Reduction Similar to SQL Group By

1.1 If Pandas version >=0.25

Check your Pandas version by running print(pd.__version__). If your Pandas version is 0.25 or above then the following code will work:

df_agg = df.groupby(['key1','key2']).agg(mean_of_value_1=('value1', 'mean'),
                                         sum_of_value_2=('value2', 'sum'),
                                         count_of_value1=('value1','size')
                                         ).reset_index()


df_agg.head(5)

The resulting data table will look like this:

key1 key2 mean_of_value1 sum_of_value2 count_of_value1
a c 1.5 17 2
a d 2.0 7 1
a e 3.0 5 1
b d 3.0 6 1

The SQL equivalent of this is:

SELECT
      key1
     ,key2
     ,AVG(value1) AS mean_of_value_1
     ,SUM(value2) AS sum_of_value_2
     ,COUNT(*) AS count_of_value1
FROM
    df
GROUP BY
     key1
    ,key2

1.2 If Pandas version <0.25

If your Pandas version is older than 0.25 then running the above code will give you the following error:

TypeError: aggregate() missing 1 required positional argument: 'arg'

Now to do the aggregation for both value1 and value2, you will run this code:

df_agg = df.groupby(['key1','key2'],as_index=False).agg({'value1':['mean','count'],'value2':'sum'})

df_agg.columns = ['_'.join(col).strip() for col in df_agg.columns.values]

df_agg.head(5)

The resulting table will look like this:

key1 key2 value1_mean value1_count value2_sum
a c 1.5 2 17
a d 2.0 1 7
a e 3.0 1 5
b d 3.0 1 6

Renaming the columns needs to be done separately using the below code:

df_agg.rename(columns={"value1_mean" : "mean_of_value1",
                       "value1_count" : "count_of_value1",
                       "value2_sum" : "sum_of_value2"
                       }, inplace=True)

2. Create a Column Without Reduction in Rows (EXCEL - SUMIF, COUNTIF)

If you want to do a SUMIF, COUNTIF, etc., like how you would do in Excel where there is no reduction in rows, then you need to do this instead.

df['Total_of_value1_by_key1'] = df.groupby('key1')['value1'].transform('sum')

df.head(5)

The resulting data frame will look like this with the same number of rows as the original:

key1 key2 value1 value2 Total_of_value1_by_key1
a c 1 9 8
a c 2 8 8
a d 2 7 8
b d 3 6 3
a e 3 5 8

3. Creating a RANK Column ROW_NUMBER() OVER (PARTITION BY ORDER BY)

Finally, there might be cases where you want to create a rank column which is the SQL equivalent of ROW_NUMBER() OVER (PARTITION BY key1 ORDER BY value1 DESC, value2 ASC).

Here is how you do that.

 df['RN'] = df.sort_values(['value1','value2'], ascending=[False,True]) \
              .groupby(['key1']) \
              .cumcount() + 1

 df.head(5)

Note: we make the code multi-line by adding \ at the end of each line.

Here is how the resulting data frame looks like:

key1 key2 value1 value2 RN
a c 1 9 4
a c 2 8 3
a d 2 7 2
b d 3 6 1
a e 3 5 1

In all the examples above, the final data table will have a table structure and won't have the pivot structure that you might get in other syntaxes.

Other aggregating operators:

mean() Compute mean of groups

sum() Compute sum of group values

size() Compute group sizes

count() Compute count of group

std() Standard deviation of groups

var() Compute variance of groups

sem() Standard error of the mean of groups

describe() Generates descriptive statistics

first() Compute first of group values

last() Compute last of group values

nth() Take nth value, or a subset if n is a list

min() Compute min of group values

max() Compute max of group values

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文