使用Pandas Groupby获取每个组(例如计数,均值等)的统计信息?
我有一个dataframe df
,我将其从中使用几列来 groupby
:
df['col1','col2','col3','col4'].groupby(['col1','col2']).mean()
以上述方式,我几乎得到了我需要的表(dataframe)。缺少的是一个附加的列,该列包含每个组中的行数。换句话说,我的意思是,但我也想知道有多少人被用来获得这些手段。例如,在第一组中有8个值,在第二个值中,依此类推。
简而言之:如何获得 group 数据框架的统计信息?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(13)
快速答案:
每组获得行计数的最简单方法是调用
.size()
,它返回seriper
:通常,您将此结果作为
dataframe
(而不是series
),因此您可以做:如果您想了解如何计算每个组的行计数和其他统计信息,请继续阅读以下内容。
详细示例:
考虑以下示例数据框:
首先使用
.size()
获取行计数:然后让我们使用
.size()。reset_index(name ='counts' )
要获得行计数:包括更多统计信息的结果
在要计算分组数据的统计信息时,
,通常看起来像这样:由于嵌套列,上面的结果有点烦人标签,也是因为行计数为每个列。
为了获得对输出的更多控制,我通常将统计信息分为单个聚合,然后使用
JOIN
组合。看起来像这样:脚注
用于生成测试数据的代码如下所示:
免责声明:
如果您要汇总的某些列具有零值,那么您确实想将组行视为每个列的独立聚合。否则,您可能会误导实际使用多少记录来计算均值之类的内容,因为熊猫会在平均计算中删除
nan
条目而不告诉您。Quick Answer:
The simplest way to get row counts per group is by calling
.size()
, which returns aSeries
:Usually you want this result as a
DataFrame
(instead of aSeries
) so you can do:If you want to find out how to calculate the row counts and other statistics for each group continue reading below.
Detailed example:
Consider the following example dataframe:
First let's use
.size()
to get the row counts:Then let's use
.size().reset_index(name='counts')
to get the row counts:Including results for more statistics
When you want to calculate statistics on grouped data, it usually looks like this:
The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.
To gain more control over the output I usually split the statistics into individual aggregations that I then combine using
join
. It looks like this:Footnotes
The code used to generate the test data is shown below:
Disclaimer:
If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop
NaN
entries in the mean calculation without telling you about it.在
groupby
对象上,agg
函数可以将列表列入一次应用多种聚合方法。这应该给您带来所需的结果:On
groupby
object, theagg
function can take a list to apply several aggregation methods at once. This should give you the result you need:瑞士军刀:
返回
count
,sean
,std
和其他有用的统计信息。要获取特定的统计信息,只需选择它们,
描述
适用于多列(Change['C']
to['C','d'] < /code> - 或完全删除它 - 看看会发生什么,结果是多索引的柱状数据框架)。
您还获得了字符串数据的不同统计信息。这是一个示例,
有关更多信息,请参见文档。
pandas&gt; = 1.1: dataframe.value_counts
如果您只想捕获每个组的大小,则可以从PANDAS 1.1获得,这将切除
groupby
,并且更快。最小示例
其他统计分析工具
如果您找不到上面要寻找的内容,则用户指南< /a>具有支持的统计分析,相关和回归工具的全面列表。
Swiss Army Knife:
GroupBy.describe
Returns
count
,mean
,std
, and other useful statistics per-group.To get specific statistics, just select them,
describe
works for multiple columns (change['C']
to['C', 'D']
—or remove it altogether—and see what happens, the result is a MultiIndexed columned dataframe).You also get different statistics for string data. Here's an example,
For more information, see the documentation.
pandas >= 1.1:
DataFrame.value_counts
This is available from pandas 1.1 if you just want to capture the size of every group, this cuts out the
GroupBy
and is faster.Minimal Example
Other Statistical Analysis Tools
If you didn't find what you were looking for above, the User Guide has a comprehensive listing of supported statical analysis, correlation, and regression tools.
获得多个统计数据,折叠索引并保留列名称:
产生:
To get multiple stats, collapse the index, and retain column names:
Produces:
我们可以使用GroupBy和Count轻松完成。但是,我们应该记住使用reset_index()。
We can easily do it by using groupby and count. But, we should remember to use reset_index().
请尝试此代码,
我认为代码将添加一个名为“计数”的列,每个组的计数
Please try this code
I think that code will add a column called 'count it' which count of each group
创建一个组对象并调用类似于以下示例的方法:
Create a group object and call methods like below example:
如果您熟悉tidyverse r软件包,这是一种在python中进行的方法:
我是 datar的作者软件包。如果您对使用它有任何疑问,请随时提交问题。
If you are familiar with tidyverse R packages, here is a way to do it in python:
I am the author of the datar package. Please feel free to submit issues if you have any questions about using it.
pivot_table
带有特定aggfunc
sgrogne> s gotegrame 也可以使用
pivot_table
。它产生的表与Excel Pivot表不太不同。基本思想是将列以values =
和Grouper列的汇总列为index =
,而任何聚合器的功能为aggfunc =
((可以接受groupby.agg
的所有优化功能都可以)。pivot_table
groupby.agg
的一个优点是,对于多列,它产生了一个size
列列,而groupby.agg
为每列创建size
列(除了所有列都是冗余)。使用命名汇总的自定义列名称
为自定义列名称,而不是多个
重命名
调用,从开始时使用命名汇总。来自 docs :
例如,要生成
col3
,col4
和col5
的汇总数据框架的平均值和计数,可以使用以下代码。请注意,它作为groupby.agg
的一部分执行重命名列的步骤。的另一种用例,名为Contregation 是每列需要不同的聚合函数。例如,如果自定义列仅需要
col3
的平均值,则需要使用col4
和min
col5 的中位数。名称,可以使用以下代码完成。pivot_table
with specificaggfunc
sFor a dataframe of aggregate statistics,
pivot_table
can be used as well. It produces a table not too dissimilar from Excel pivot table. The basic idea is to pass in the columns to be aggregated asvalues=
and grouper columns asindex=
and whatever aggregator functions asaggfunc=
(all of the optimized functions that are admissible forgroupby.agg
are OK).One advantage of
pivot_table
overgroupby.agg
is that for multiple columns it produces a singlesize
column whereasgroupby.agg
which creates asize
column for each column (all except one are redundant).Use named aggregation for custom column names
For custom column names, instead of multiple
rename
calls, use named aggregation from the beginning.From the docs:
As an example, to produce aggregate dataframe where each of
col3
,col4
andcol5
has its mean and count computed, the following code could be used. Note that it does the renaming columns step as part ofgroupby.agg
.Another use case of named aggregation is if each column needs a different aggregator function. For example, if only the mean of
col3
, median ofcol4
andmin
ofcol5
are needed with custom column names, it can be done using the following code.城市和性别是群体列。
年龄和等级是依赖列(我对它们计算的平均值)。
平均列的名称为平均
计数列的名称将为freq
city and gender are the groups columns.
age and grade are the the depend columns (which I calculate mean on them).
the name of the mean column would be average
the name of the count column would be freq
使分组列回到:
这将列名称和函数名称作为字符串。 sql等效是
SQL-style aggregation functions are supported by "named aggregation" with
as_index=False
to get the grouping column back:This uses column names and function names as strings. The SQL equivalent is
Common aggregation functions docs (count, max, min, first, etc.)
另一种选择:
输出:
Another alternative:
Output:
我认为最简单的方法使您可以:
a)轻松选择每列不同的聚合功能,
b)需要 no 以后重命名或连接列,
使用
”
考虑到与 Pedro的答案:
此方法得出相同的结果:
I think the simplest approach that allows you to:
a) easily choose different aggregation functions for each column,
b) requires no later renaming or joining of columns,
is using Named aggregation:
Example:
Considering the same example dataframe as in Pedro's answer:
This approach yields the same results: