使用Pandas Groupby获取每个组(例如计数,均值等)的统计信息?

发布于 2025-01-31 05:11:43 字数 365 浏览 3 评论 0 原文

我有一个dataframe df ,我将其从中使用几列来 groupby

df['col1','col2','col3','col4'].groupby(['col1','col2']).mean()

以上述方式,我几乎得到了我需要的表(dataframe)。缺少的是一个附加的列,该列包含每个组中的行数。换句话说,我的意思是,但我也想知道有多少人被用来获得这些手段。例如,在第一组中有8个值,在第二个值中,依此类推。

简而言之:如何获得 group 数据框架的统计信息?

I have a dataframe df and I use several columns from it to groupby:

df['col1','col2','col3','col4'].groupby(['col1','col2']).mean()

In the above way, I almost get the table (dataframe) that I need. What is missing is an additional column that contains number of rows in each group. In other words, I have mean but I also would like to know how many were used to get these means. For example in the first group there are 8 values and in the second one 10 and so on.

In short: How do I get group-wise statistics for a dataframe?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(13

居里长安 2025-02-07 05:11:44

快速答案:

每组获得行计数的最简单方法是调用 .size(),它返回 seriper

df.groupby(['col1','col2']).size()

通常,您将此结果作为 dataframe (而不是 series ),因此您可以做:

df.groupby(['col1', 'col2']).size().reset_index(name='counts')

如果您想了解如何计算每个组的行计数和其他统计信息,请继续阅读以下内容。


详细示例:

考虑以下示例数据框:

In [2]: df
Out[2]: 
  col1 col2  col3  col4  col5  col6
0    A    B  0.20 -0.61 -0.49  1.49
1    A    B -1.53 -1.01 -0.39  1.82
2    A    B -0.44  0.27  0.72  0.11
3    A    B  0.28 -1.32  0.38  0.18
4    C    D  0.12  0.59  0.81  0.66
5    C    D -0.13 -1.65 -1.64  0.50
6    C    D -1.42 -0.11 -0.18 -0.44
7    E    F -0.00  1.42 -0.26  1.17
8    E    F  0.91 -0.47  1.35 -0.34
9    G    H  1.48 -0.63 -1.14  0.17

首先使用 .size()获取行计数:

In [3]: df.groupby(['col1', 'col2']).size()
Out[3]: 
col1  col2
A     B       4
C     D       3
E     F       2
G     H       1
dtype: int64

然后让我们使用 .size()。reset_index(name ='counts' )要获得行计数:

In [4]: df.groupby(['col1', 'col2']).size().reset_index(name='counts')
Out[4]: 
  col1 col2  counts
0    A    B       4
1    C    D       3
2    E    F       2
3    G    H       1

包括更多统计信息的结果

在要计算分组数据的统计信息时,

In [5]: (df
   ...: .groupby(['col1', 'col2'])
   ...: .agg({
   ...:     'col3': ['mean', 'count'], 
   ...:     'col4': ['median', 'min', 'count']
   ...: }))
Out[5]: 
            col4                  col3      
          median   min count      mean count
col1 col2                                   
A    B    -0.810 -1.32     4 -0.372500     4
C    D    -0.110 -1.65     3 -0.476667     3
E    F     0.475 -0.47     2  0.455000     2
G    H    -0.630 -0.63     1  1.480000     1

,通常看起来像这样:由于嵌套列,上面的结果有点烦人标签,也是因为行计数为每个列。

为了获得对输出的更多控制,我通常将统计信息分为单个聚合,然后使用 JOIN 组合。看起来像这样:

In [6]: gb = df.groupby(['col1', 'col2'])
   ...: counts = gb.size().to_frame(name='counts')
   ...: (counts
   ...:  .join(gb.agg({'col3': 'mean'}).rename(columns={'col3': 'col3_mean'}))
   ...:  .join(gb.agg({'col4': 'median'}).rename(columns={'col4': 'col4_median'}))
   ...:  .join(gb.agg({'col4': 'min'}).rename(columns={'col4': 'col4_min'}))
   ...:  .reset_index()
   ...: )
   ...: 
Out[6]: 
  col1 col2  counts  col3_mean  col4_median  col4_min
0    A    B       4  -0.372500       -0.810     -1.32
1    C    D       3  -0.476667       -0.110     -1.65
2    E    F       2   0.455000        0.475     -0.47
3    G    H       1   1.480000       -0.630     -0.63


脚注

用于生成测试数据的代码如下所示:

In [1]: import numpy as np
   ...: import pandas as pd 
   ...: 
   ...: keys = np.array([
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['C', 'D'],
   ...:         ['C', 'D'],
   ...:         ['C', 'D'],
   ...:         ['E', 'F'],
   ...:         ['E', 'F'],
   ...:         ['G', 'H'] 
   ...:         ])
   ...: 
   ...: df = pd.DataFrame(
   ...:     np.hstack([keys,np.random.randn(10,4).round(2)]), 
   ...:     columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']
   ...: )
   ...: 
   ...: df[['col3', 'col4', 'col5', 'col6']] = \
   ...:     df[['col3', 'col4', 'col5', 'col6']].astype(float)
   ...: 

免责声明:

如果您要汇总的某些列具有零值,那么您确实想将组行视为每个列的独立聚合。否则,您可能会误导实际使用多少记录来计算均值之类的内容,因为熊猫会在平均计算中删除 nan 条目而不告诉您。

Quick Answer:

The simplest way to get row counts per group is by calling .size(), which returns a Series:

df.groupby(['col1','col2']).size()

Usually you want this result as a DataFrame (instead of a Series) so you can do:

df.groupby(['col1', 'col2']).size().reset_index(name='counts')

If you want to find out how to calculate the row counts and other statistics for each group continue reading below.


Detailed example:

Consider the following example dataframe:

In [2]: df
Out[2]: 
  col1 col2  col3  col4  col5  col6
0    A    B  0.20 -0.61 -0.49  1.49
1    A    B -1.53 -1.01 -0.39  1.82
2    A    B -0.44  0.27  0.72  0.11
3    A    B  0.28 -1.32  0.38  0.18
4    C    D  0.12  0.59  0.81  0.66
5    C    D -0.13 -1.65 -1.64  0.50
6    C    D -1.42 -0.11 -0.18 -0.44
7    E    F -0.00  1.42 -0.26  1.17
8    E    F  0.91 -0.47  1.35 -0.34
9    G    H  1.48 -0.63 -1.14  0.17

First let's use .size() to get the row counts:

In [3]: df.groupby(['col1', 'col2']).size()
Out[3]: 
col1  col2
A     B       4
C     D       3
E     F       2
G     H       1
dtype: int64

Then let's use .size().reset_index(name='counts') to get the row counts:

In [4]: df.groupby(['col1', 'col2']).size().reset_index(name='counts')
Out[4]: 
  col1 col2  counts
0    A    B       4
1    C    D       3
2    E    F       2
3    G    H       1

Including results for more statistics

When you want to calculate statistics on grouped data, it usually looks like this:

In [5]: (df
   ...: .groupby(['col1', 'col2'])
   ...: .agg({
   ...:     'col3': ['mean', 'count'], 
   ...:     'col4': ['median', 'min', 'count']
   ...: }))
Out[5]: 
            col4                  col3      
          median   min count      mean count
col1 col2                                   
A    B    -0.810 -1.32     4 -0.372500     4
C    D    -0.110 -1.65     3 -0.476667     3
E    F     0.475 -0.47     2  0.455000     2
G    H    -0.630 -0.63     1  1.480000     1

The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.

To gain more control over the output I usually split the statistics into individual aggregations that I then combine using join. It looks like this:

In [6]: gb = df.groupby(['col1', 'col2'])
   ...: counts = gb.size().to_frame(name='counts')
   ...: (counts
   ...:  .join(gb.agg({'col3': 'mean'}).rename(columns={'col3': 'col3_mean'}))
   ...:  .join(gb.agg({'col4': 'median'}).rename(columns={'col4': 'col4_median'}))
   ...:  .join(gb.agg({'col4': 'min'}).rename(columns={'col4': 'col4_min'}))
   ...:  .reset_index()
   ...: )
   ...: 
Out[6]: 
  col1 col2  counts  col3_mean  col4_median  col4_min
0    A    B       4  -0.372500       -0.810     -1.32
1    C    D       3  -0.476667       -0.110     -1.65
2    E    F       2   0.455000        0.475     -0.47
3    G    H       1   1.480000       -0.630     -0.63


Footnotes

The code used to generate the test data is shown below:

In [1]: import numpy as np
   ...: import pandas as pd 
   ...: 
   ...: keys = np.array([
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['C', 'D'],
   ...:         ['C', 'D'],
   ...:         ['C', 'D'],
   ...:         ['E', 'F'],
   ...:         ['E', 'F'],
   ...:         ['G', 'H'] 
   ...:         ])
   ...: 
   ...: df = pd.DataFrame(
   ...:     np.hstack([keys,np.random.randn(10,4).round(2)]), 
   ...:     columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']
   ...: )
   ...: 
   ...: df[['col3', 'col4', 'col5', 'col6']] = \
   ...:     df[['col3', 'col4', 'col5', 'col6']].astype(float)
   ...: 

Disclaimer:

If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop NaN entries in the mean calculation without telling you about it.

黑白记忆 2025-02-07 05:11:44

groupby 对象上, agg 函数可以将列表列入一次应用多种聚合方法。这应该给您带来所需的结果:

df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).agg(['mean', 'count'])

On groupby object, the agg function can take a list to apply several aggregation methods at once. This should give you the result you need:

df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).agg(['mean', 'count'])
杀お生予夺 2025-02-07 05:11:44

瑞士军刀:

返回 count sean std 和其他有用的统计信息。

df.groupby(['A', 'B'])['C'].describe()

           count  mean   std   min   25%   50%   75%   max
A   B                                                     
bar one      1.0  0.40   NaN  0.40  0.40  0.40  0.40  0.40
    three    1.0  2.24   NaN  2.24  2.24  2.24  2.24  2.24
    two      1.0 -0.98   NaN -0.98 -0.98 -0.98 -0.98 -0.98
foo one      2.0  1.36  0.58  0.95  1.15  1.36  1.56  1.76
    three    1.0 -0.15   NaN -0.15 -0.15 -0.15 -0.15 -0.15
    two      2.0  1.42  0.63  0.98  1.20  1.42  1.65  1.87

要获取特定的统计信息,只需选择它们,

df.groupby(['A', 'B'])['C'].describe()[['count', 'mean']]

           count      mean
A   B                     
bar one      1.0  0.400157
    three    1.0  2.240893
    two      1.0 -0.977278
foo one      2.0  1.357070
    three    1.0 -0.151357
    two      2.0  1.423148

注意:如果您只需要计算1或2个统计数据,则可能是
更快地使用 groupby.agg ,只需计算这些列
您正在执行浪费计算。

描述适用于多列(Change ['C'] to ['C','d'] < /code> - 或完全删除它 - 看看会发生什么,结果是多索引的柱状数据框架)。

您还获得了字符串数据的不同统计信息。这是一个示例,

df2 = df.assign(D=list('aaabbccc')).sample(n=100, replace=True)

with pd.option_context('precision', 2):
    display(df2.groupby(['A', 'B'])
               .describe(include='all')
               .dropna(how='all', axis=1))

              C                                                   D                
          count  mean       std   min   25%   50%   75%   max count unique top freq
A   B                                                                              
bar one    14.0  0.40  5.76e-17  0.40  0.40  0.40  0.40  0.40    14      1   a   14
    three  14.0  2.24  4.61e-16  2.24  2.24  2.24  2.24  2.24    14      1   b   14
    two     9.0 -0.98  0.00e+00 -0.98 -0.98 -0.98 -0.98 -0.98     9      1   c    9
foo one    22.0  1.43  4.10e-01  0.95  0.95  1.76  1.76  1.76    22      2   a   13
    three  15.0 -0.15  0.00e+00 -0.15 -0.15 -0.15 -0.15 -0.15    15      1   c   15
    two    26.0  1.49  4.48e-01  0.98  0.98  1.87  1.87  1.87    26      2   b   15

有关更多信息,请参见文档


pandas&gt; = 1.1: dataframe.value_counts

如果您只想捕获每个组的大小,则可以从PANDAS 1.1获得,这将切除 groupby ,并且更快。

df.value_counts(subset=['col1', 'col2'])

最小示例

# Setup
np.random.seed(0)
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})

df.value_counts(['A', 'B']) 

A    B    
foo  two      2
     one      2
     three    1
bar  two      1
     three    1
     one      1
dtype: int64

其他统计分析工具

如果您找不到上面要寻找的内容,则用户指南< /a>具有支持的统计分析,相关和回归工具的全面列表。

Swiss Army Knife: GroupBy.describe

Returns count, mean, std, and other useful statistics per-group.

df.groupby(['A', 'B'])['C'].describe()

           count  mean   std   min   25%   50%   75%   max
A   B                                                     
bar one      1.0  0.40   NaN  0.40  0.40  0.40  0.40  0.40
    three    1.0  2.24   NaN  2.24  2.24  2.24  2.24  2.24
    two      1.0 -0.98   NaN -0.98 -0.98 -0.98 -0.98 -0.98
foo one      2.0  1.36  0.58  0.95  1.15  1.36  1.56  1.76
    three    1.0 -0.15   NaN -0.15 -0.15 -0.15 -0.15 -0.15
    two      2.0  1.42  0.63  0.98  1.20  1.42  1.65  1.87

To get specific statistics, just select them,

df.groupby(['A', 'B'])['C'].describe()[['count', 'mean']]

           count      mean
A   B                     
bar one      1.0  0.400157
    three    1.0  2.240893
    two      1.0 -0.977278
foo one      2.0  1.357070
    three    1.0 -0.151357
    two      2.0  1.423148

Note: if you only need to compute 1 or 2 stats then it might be
faster to use groupby.agg and just compute those columns otherwise
you are performing wasteful computation.

describe works for multiple columns (change ['C'] to ['C', 'D']—or remove it altogether—and see what happens, the result is a MultiIndexed columned dataframe).

You also get different statistics for string data. Here's an example,

df2 = df.assign(D=list('aaabbccc')).sample(n=100, replace=True)

with pd.option_context('precision', 2):
    display(df2.groupby(['A', 'B'])
               .describe(include='all')
               .dropna(how='all', axis=1))

              C                                                   D                
          count  mean       std   min   25%   50%   75%   max count unique top freq
A   B                                                                              
bar one    14.0  0.40  5.76e-17  0.40  0.40  0.40  0.40  0.40    14      1   a   14
    three  14.0  2.24  4.61e-16  2.24  2.24  2.24  2.24  2.24    14      1   b   14
    two     9.0 -0.98  0.00e+00 -0.98 -0.98 -0.98 -0.98 -0.98     9      1   c    9
foo one    22.0  1.43  4.10e-01  0.95  0.95  1.76  1.76  1.76    22      2   a   13
    three  15.0 -0.15  0.00e+00 -0.15 -0.15 -0.15 -0.15 -0.15    15      1   c   15
    two    26.0  1.49  4.48e-01  0.98  0.98  1.87  1.87  1.87    26      2   b   15

For more information, see the documentation.


pandas >= 1.1: DataFrame.value_counts

This is available from pandas 1.1 if you just want to capture the size of every group, this cuts out the GroupBy and is faster.

df.value_counts(subset=['col1', 'col2'])

Minimal Example

# Setup
np.random.seed(0)
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})

df.value_counts(['A', 'B']) 

A    B    
foo  two      2
     one      2
     three    1
bar  two      1
     three    1
     one      1
dtype: int64

Other Statistical Analysis Tools

If you didn't find what you were looking for above, the User Guide has a comprehensive listing of supported statical analysis, correlation, and regression tools.

月依秋水 2025-02-07 05:11:44

获得多个统计数据,折叠索引并保留列名称:

df = df.groupby(['col1','col2']).agg(['mean', 'count'])
df.columns = [ ' '.join(str(i) for i in col) for col in df.columns]
df.reset_index(inplace=True)
df

产生:

“

To get multiple stats, collapse the index, and retain column names:

df = df.groupby(['col1','col2']).agg(['mean', 'count'])
df.columns = [ ' '.join(str(i) for i in col) for col in df.columns]
df.reset_index(inplace=True)
df

Produces:

**enter image description here**

夏九 2025-02-07 05:11:44

我们可以使用GroupBy和Count轻松完成。但是,我们应该记住使用reset_index()。

df[['col1','col2','col3','col4']].groupby(['col1','col2']).count().\
reset_index()

We can easily do it by using groupby and count. But, we should remember to use reset_index().

df[['col1','col2','col3','col4']].groupby(['col1','col2']).count().\
reset_index()
甜中书 2025-02-07 05:11:44

请尝试此代码,

new_column=df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).count()
df['count_it']=new_column
df

我认为代码将添加一个名为“计数”的列,每个组的计数

Please try this code

new_column=df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).count()
df['count_it']=new_column
df

I think that code will add a column called 'count it' which count of each group

深居我梦 2025-02-07 05:11:44

创建一个组对象并调用类似于以下示例的方法:

grp = df.groupby(['col1',  'col2',  'col3']) 

grp.max() 
grp.mean() 
grp.describe() 

Create a group object and call methods like below example:

grp = df.groupby(['col1',  'col2',  'col3']) 

grp.max() 
grp.mean() 
grp.describe() 
耳钉梦 2025-02-07 05:11:44

如果您熟悉tidyverse r软件包,这是一种在python中进行的方法:

from datar.all import tibble, rnorm, f, group_by, summarise, mean, n, rep

df = tibble(
  col1=rep(['A', 'B'], 5), 
  col2=rep(['C', 'D'], each=5), 
  col3=rnorm(10), 
  col4=rnorm(10)
)
df >> group_by(f.col1, f.col2) >> summarise(
  count=n(),
  col3_mean=mean(f.col3), 
  col4_mean=mean(f.col4)
)
  col1 col2  n  mean_col3  mean_col4
0    A    C  3  -0.516402   0.468454
1    A    D  2  -0.248848   0.979655
2    B    C  2   0.545518  -0.966536
3    B    D  3  -0.349836  -0.915293
[Groups: ['col1'] (n=2)]

我是 datar的作者软件包。如果您对使用它有任何疑问,请随时提交问题。

If you are familiar with tidyverse R packages, here is a way to do it in python:

from datar.all import tibble, rnorm, f, group_by, summarise, mean, n, rep

df = tibble(
  col1=rep(['A', 'B'], 5), 
  col2=rep(['C', 'D'], each=5), 
  col3=rnorm(10), 
  col4=rnorm(10)
)
df >> group_by(f.col1, f.col2) >> summarise(
  count=n(),
  col3_mean=mean(f.col3), 
  col4_mean=mean(f.col4)
)
  col1 col2  n  mean_col3  mean_col4
0    A    C  3  -0.516402   0.468454
1    A    D  2  -0.248848   0.979655
2    B    C  2   0.545518  -0.966536
3    B    D  3  -0.349836  -0.915293
[Groups: ['col1'] (n=2)]

I am the author of the datar package. Please feel free to submit issues if you have any questions about using it.

薔薇婲 2025-02-07 05:11:44

pivot_table 带有特定 aggfunc s

grogne> s gotegrame 也可以使用 pivot_table 。它产生的表与Excel Pivot表不太不同。基本思想是将列以 values = 和Grouper列的汇总列为 index = ,而任何聚合器的功能为 aggfunc = ((可以接受 groupby.agg 的所有优化功能都可以)。

pivot_table groupby.agg 的一个优点是,对于多列,它产生了一个 size 列列,而 groupby.agg 为每列创建 size 列(除了所有列都是冗余)。

agg_df = df.pivot_table(
    values=['col3', 'col4', 'col5'], 
    index=['col1', 'col2'], 
    aggfunc=['size', 'mean', 'median']
).reset_index()
# flatten the MultiIndex column (should be omitted if MultiIndex is preferred)
agg_df.columns = [i if not j else f"{j}_{i}" for i,j in agg_df.columns]

使用命名汇总的自定义列名称

为自定义列名称,而不是多个重命名调用,从开始时使用命名汇总。

来自 docs

为了支持对输出列名称的控制列的聚合,Pandas接受了groupby.agg()中的特殊语法,称为“命名聚合”,其中

  • 关键字是输出列名
  • 值是元组,其第一个元素是要选择的列,第二个元素是用于该列的汇总。 Pandas提供了pandas.namedagg名为Tuple的字段['列','aggfunc'],以使其更清楚地参数。像往常一样,汇总可以是可叫的或字符串的别名。

例如,要生成 col3 col4 col5 的汇总数据框架的平均值和计数,可以使用以下代码。请注意,它作为 groupby.agg 的一部分执行重命名列的步骤。

aggfuncs = {f'{c}_{f}': (c, f) for c in ['col3', 'col4', 'col5'] for f in ['mean', 'count']}
agg_df = df.groupby(['col1', 'col2'], as_index=False).agg(**aggfuncs)

的另一种用例,名为Contregation 是每列需要不同的聚合函数。例如,如果自定义列仅需要 col3 的平均值,则需要使用 col4 min col5 的中位数。名称,可以使用以下代码完成。

agg_df = df.groupby(['col1', 'col2'], as_index=False).agg(col3_mean=('col3', 'mean'), col4_median=('col4', 'median'), col5_min=('col5', 'min'))
# or equivalently,
agg_df = df.groupby(['col1', 'col2'], as_index=False).agg(**{'_'.join(p): p for p in [('col3', 'mean'), ('col4', 'median'), ('col5', 'min')]})

pivot_table with specific aggfuncs

For a dataframe of aggregate statistics, pivot_table can be used as well. It produces a table not too dissimilar from Excel pivot table. The basic idea is to pass in the columns to be aggregated as values= and grouper columns as index= and whatever aggregator functions as aggfunc= (all of the optimized functions that are admissible for groupby.agg are OK).

One advantage of pivot_table over groupby.agg is that for multiple columns it produces a single size column whereas groupby.agg which creates a size column for each column (all except one are redundant).

agg_df = df.pivot_table(
    values=['col3', 'col4', 'col5'], 
    index=['col1', 'col2'], 
    aggfunc=['size', 'mean', 'median']
).reset_index()
# flatten the MultiIndex column (should be omitted if MultiIndex is preferred)
agg_df.columns = [i if not j else f"{j}_{i}" for i,j in agg_df.columns]

res1

Use named aggregation for custom column names

For custom column names, instead of multiple rename calls, use named aggregation from the beginning.

From the docs:

To support column-specific aggregation with control over the output column names, pandas accepts the special syntax in GroupBy.agg(), known as “named aggregation”, where

  • The keywords are the output column names
  • The values are tuples whose first element is the column to select and the second element is the aggregation to apply to that column. pandas provides the pandas.NamedAgg namedtuple with the fields ['column', 'aggfunc'] to make it clearer what the arguments are. As usual, the aggregation can be a callable or a string alias.

As an example, to produce aggregate dataframe where each of col3, col4 and col5 has its mean and count computed, the following code could be used. Note that it does the renaming columns step as part of groupby.agg.

aggfuncs = {f'{c}_{f}': (c, f) for c in ['col3', 'col4', 'col5'] for f in ['mean', 'count']}
agg_df = df.groupby(['col1', 'col2'], as_index=False).agg(**aggfuncs)

res3

Another use case of named aggregation is if each column needs a different aggregator function. For example, if only the mean of col3, median of col4 and min of col5 are needed with custom column names, it can be done using the following code.

agg_df = df.groupby(['col1', 'col2'], as_index=False).agg(col3_mean=('col3', 'mean'), col4_median=('col4', 'median'), col5_min=('col5', 'min'))
# or equivalently,
agg_df = df.groupby(['col1', 'col2'], as_index=False).agg(**{'_'.join(p): p for p in [('col3', 'mean'), ('col4', 'median'), ('col5', 'min')]})

res2

看海 2025-02-07 05:11:44
df_group = (df.groupby(['city','gender'])[['age',"grade"]]
     .agg([('average','mean'),('freq','count')])
     .reset_index())

城市和性别是群体列。

年龄和等级是依赖列(我对它们计算的平均值)。

平均列的名称为平均
计数列的名称将为freq

”在此处输入图像描述”

df_group = (df.groupby(['city','gender'])[['age',"grade"]]
     .agg([('average','mean'),('freq','count')])
     .reset_index())

city and gender are the groups columns.

age and grade are the the depend columns (which I calculate mean on them).

the name of the mean column would be average
the name of the count column would be freq

enter image description here

拥有 2025-02-07 05:11:44

使分组列回到

animals.groupby("kind", as_index=False).agg(
    min_height=("height", "min"),
    max_height=("height", "max"),
    average_weight=("weight", "mean"),
)

这将列名称和函数名称作为字符串。 sql等效是

SELECT
  MIN(height) AS min_height,
  MAX(height) AS max_height, 
  AVG(height) AS average_weight
FROM animals
GROUP BY kind

SQL-style aggregation functions are supported by "named aggregation" with as_index=False to get the grouping column back:

animals.groupby("kind", as_index=False).agg(
    min_height=("height", "min"),
    max_height=("height", "max"),
    average_weight=("weight", "mean"),
)

This uses column names and function names as strings. The SQL equivalent is

SELECT
  MIN(height) AS min_height,
  MAX(height) AS max_height, 
  AVG(height) AS average_weight
FROM animals
GROUP BY kind

Common aggregation functions docs (count, max, min, first, etc.)

过期情话 2025-02-07 05:11:44

另一种选择:

import pandas as pd
import numpy as np

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})
df

    A   B       C           D
0   foo one   0.808197   2.057923
1   bar one   0.330835  -0.815545
2   foo two  -1.664960  -2.372025
3   bar three 0.034224   0.825633
4   foo two   1.131271  -0.984838
5   bar two   2.961694  -1.122788
6   foo one   -0.054695  0.503555
7   foo three 0.018052  -0.746912

pd.crosstab(df.A, df.B).stack().reset_index(name='count')

输出:

    A   B     count
0   bar one     1
1   bar three   1
2   bar two     1
3   foo one     2
4   foo three   1
5   foo two     2

Another alternative:

import pandas as pd
import numpy as np

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})
df

    A   B       C           D
0   foo one   0.808197   2.057923
1   bar one   0.330835  -0.815545
2   foo two  -1.664960  -2.372025
3   bar three 0.034224   0.825633
4   foo two   1.131271  -0.984838
5   bar two   2.961694  -1.122788
6   foo one   -0.054695  0.503555
7   foo three 0.018052  -0.746912

pd.crosstab(df.A, df.B).stack().reset_index(name='count')

Output:

    A   B     count
0   bar one     1
1   bar three   1
2   bar two     1
3   foo one     2
4   foo three   1
5   foo two     2
泪是无色的血 2025-02-07 05:11:44

我认为最简单的方法使您可以:

a)轻松选择每列不同的聚合功能,

b)需要 no 以后重命名或连接列,

使用

df.groupby(['col1', 'col2']).agg(
  counts=('col1', 'count'),
  col3_mean=('col3', 'mean'),
  col4_median=('col4', 'median'),
  col4_min=('col4', 'min'),
)

考虑到与 Pedro的答案

  col1 col2  col3  col4  col5  col6
0    A    B  0.20 -0.61 -0.49  1.49
1    A    B -1.53 -1.01 -0.39  1.82
2    A    B -0.44  0.27  0.72  0.11
3    A    B  0.28 -1.32  0.38  0.18
4    C    D  0.12  0.59  0.81  0.66
5    C    D -0.13 -1.65 -1.64  0.50
6    C    D -1.42 -0.11 -0.18 -0.44
7    E    F -0.00  1.42 -0.26  1.17
8    E    F  0.91 -0.47  1.35 -0.34
9    G    H  1.48 -0.63 -1.14  0.17

此方法得出相同的结果:

  col1 col2  counts  col3_mean  col4_median  col4_min
0    A    B       4  -0.372500       -0.810     -1.32
1    C    D       3  -0.476667       -0.110     -1.65
2    E    F       2   0.455000        0.475     -0.47
3    G    H       1   1.480000       -0.630     -0.63

I think the simplest approach that allows you to:

a) easily choose different aggregation functions for each column,

b) requires no later renaming or joining of columns,

is using Named aggregation:

df.groupby(['col1', 'col2']).agg(
  counts=('col1', 'count'),
  col3_mean=('col3', 'mean'),
  col4_median=('col4', 'median'),
  col4_min=('col4', 'min'),
)

Example:

Considering the same example dataframe as in Pedro's answer:

  col1 col2  col3  col4  col5  col6
0    A    B  0.20 -0.61 -0.49  1.49
1    A    B -1.53 -1.01 -0.39  1.82
2    A    B -0.44  0.27  0.72  0.11
3    A    B  0.28 -1.32  0.38  0.18
4    C    D  0.12  0.59  0.81  0.66
5    C    D -0.13 -1.65 -1.64  0.50
6    C    D -1.42 -0.11 -0.18 -0.44
7    E    F -0.00  1.42 -0.26  1.17
8    E    F  0.91 -0.47  1.35 -0.34
9    G    H  1.48 -0.63 -1.14  0.17

This approach yields the same results:

  col1 col2  counts  col3_mean  col4_median  col4_min
0    A    B       4  -0.372500       -0.810     -1.32
1    C    D       3  -0.476667       -0.110     -1.65
2    E    F       2   0.455000        0.475     -0.47
3    G    H       1   1.480000       -0.630     -0.63
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文