- 如何与大熊猫进行聚合?
- 聚合后没有数据框!发生了什么?
- 如何主要汇总字符串列(
list
s, tuple
s,带有shipator
的字符串)?
- 我如何汇总计数?
- 如何创建一个由汇总值填充的新列?
我已经看到了这些反复出现的问题,询问了熊猫骨料功能的各种面孔。
有关当今汇总及其各种用例的大多数信息都在数十个措辞不好,无法搜索的帖子中分散。
这里的目的是整理一些更重要的后代。
Q& a是将成为一系列有用的用户引进的下一部分:
请注意,这篇文章并不意味着要替代有关聚合的文档和,所以也请阅读!
- How can I perform aggregation with Pandas?
- No DataFrame after aggregation! What happened?
- How can I aggregate mainly strings columns (to
list
s, tuple
s, strings with separator
)?
- How can I aggregate counts?
- How can I create a new column filled by aggregated values?
I've seen these recurring questions asking about various faces of the pandas aggregate functionality.
Most of the information regarding aggregation and its various use cases today is fragmented across dozens of badly worded, unsearchable posts.
The aim here is to collate some of the more important points for posterity.
This Q&A is meant to be the next instalment in a series of helpful user-guides:
Please note that this post is not meant to be a replacement for the documentation about aggregation and about groupby, so please read that as well!
发布评论
评论(2)
问题1
如何与大熊猫进行聚合?
扩展 contregation文档。
聚合功能是降低返回对象的维度的功能。这意味着输出系列/数据框与原始的行较小或相同。
某些常见的汇总功能如下表:
通过过滤的列进行聚合, Cython实现的函数:
用于所有列的汇总函数,而无需在
groupby
函数中指定,此处a,b
列:您也可以 :在
groupby
函数之后,仅指定一些用于聚合的列:使用函数
dataframegroupby.agg
:用于用于一列的多个功能 s-新列和汇总功能的名称:
如果要传递多个函数,则可能是通过
list
oftuple
s:然后获取
Multiiindex 在列中:
为了转换为列,平坦
MultiIndex
使用MAP
withJOIN> JOIN
:另一个解决方案是汇总功能的通过列表,然后变平
MultiIndex
和另一列名称使用str.replace
:如果需要指定的每个列,则使用汇总函数单独传递
dictionary
:您也可以传递自定义功能:
dataframe
ategrame gotregation之后否 no dataframe呢发生了什么?
通过两个或多个列进行聚合:
第一次检查pandas对象的
index
type type :有两个解决方案有关如何获取
Multiiindex Series
列:as_index = false
series.reset_index
:如果按一个列组:
... get
seriper> with
:index
和解决方案是解决方案与
MultiIndex系列
相同:问题3
我如何主要汇总字符串列(
list
s, tuple s, s,带有带有的字符串saparator
)?代替聚合函数,可以通过
list
,tuple
,set
用于转换列:替代方案是使用
groupby.apply
:要使用分隔符转换为字符串,请使用
.join
仅当它是字符串列时:如果是数字列,请使用
另一个解决方案是在
groupby
之前转换为字符串:对于转换所有列,请勿在
groupby
之后传递列列表。没有任何列
d
,因为自动排除'nuisance'列。这意味着所有数字列被排除在外。因此,有必要将所有列转换为字符串,然后获取所有列:
问题4
如何汇总计数?
函数 对于每个组的
size
:function
href =“ http://pandas.pydata.org/pandas-docs/stable/generated/pandas.series.value_counts.html” rel =“ noreferrer”>
。它返回包含以降序的唯一值计数的对象的大小,因此第一个元素是最常见的元素。默认情况下,它不包括
series.value_count.value_counts
nan
s值。如果您想要相同的输出,例如使用函数
groupby
+size
,请添加series.sort_index
:问题5
我如何创建一个由聚合值填充的新列?
方法 返回一个与被分组的对象相同(相同的大小)。
参见 Pandas Documentation 提供更多信息。
Question 1
How can I perform aggregation with Pandas?
Expanded aggregation documentation.
Aggregating functions are the ones that reduce the dimension of the returned objects. It means output Series/DataFrame have less or same rows like original.
Some common aggregating functions are tabulated below:
Aggregation by filtered columns and Cython implemented functions:
An aggregate function is used for all columns without being specified in the
groupby
function, here theA, B
columns:You can also specify only some columns used for aggregation in a list after the
groupby
function:Same results by using function
DataFrameGroupBy.agg
:For multiple functions applied for one column use a list of
tuple
s - names of new columns and aggregated functions:If want to pass multiple functions is possible pass
list
oftuple
s:Then get
MultiIndex
in columns:And for converting to columns, flattening
MultiIndex
usemap
withjoin
:Another solution is pass list of aggregate functions, then flatten
MultiIndex
and for another columns names usestr.replace
:If want specified each column with aggregated function separately pass
dictionary
:You can pass custom function too:
Question 2
No DataFrame after aggregation! What happened?
Aggregation by two or more columns:
First check the
Index
andtype
of a Pandas object:There are two solutions for how to get
MultiIndex Series
to columns:as_index=False
Series.reset_index
:If group by one column:
... get
Series
withIndex
:And the solution is the same like in the
MultiIndex Series
:Question 3
How can I aggregate mainly strings columns (to
list
s,tuple
s,strings with separator
)?Instead of an aggregation function, it is possible to pass
list
,tuple
,set
for converting the column:An alternative is use
GroupBy.apply
:For converting to strings with a separator, use
.join
only if it is a string column:If it is a numeric column, use a lambda function with
astype
for converting tostring
s:Another solution is converting to strings before
groupby
:For converting all columns, don't pass a list of column(s) after
groupby
.There isn't any column
D
, because automatic exclusion of 'nuisance' columns. It means all numeric columns are excluded.So it's necessary to convert all columns into strings, and then get all columns:
Question 4
How can I aggregate counts?
Function
GroupBy.size
forsize
of each group:Function
GroupBy.count
excludes missing values:This function should be used for multiple columns for counting non-missing values:
A related function is
Series.value_counts
. It returns the size of the object containing counts of unique values in descending order, so that the first element is the most frequently-occurring element. It excludesNaN
s values by default.If you want same output like using function
groupby
+size
, addSeries.sort_index
:Question 5
How can I create a new column filled by aggregated values?
Method
GroupBy.transform
returns an object that is indexed the same (same size) as the one being grouped.See the Pandas documentation for more information.
如果您来自R或SQL背景,以下是三个示例,可以教您您已经熟悉的方式进行聚合所需的一切:
让我们首先创建Pandas DataFrame,
这是我们创建的表的样子:
减少汇总类似于
1.1,如果pandas版本; = 0.25
通过运行
打印(pd .__版本__)
检查熊猫版本。如果您的 pandas版本为0.25或更高,则以下代码将起作用:结果数据表将看起来像这样:
SQL 等效是:
1.2,如果Pandas版本
< 0.25
如果您的Pandas版本是 以上的0.25 ,则运行上述代码将为您带来以下错误:
现在可以对
value1
andvalue2
进行汇总,您将运行此代码:结果表将显示这样:
重命名列需要使用以下代码单独完成
列 : (
excel -sumif,countif
)如果要执行sumif,countif等,就像在不减少行减少的情况下在Excel中做的那样,则需要这样做。
结果数据框将看起来像这样的行与原始数量相同的行数:
3。创建一个等级列
row_number(),(按顺序分区)
最后,可能在某些情况下您要创建一个 rank 列 columt
row_number()over(按key1订单按值1 desc,value2 asc)
。这是您的方式。
注意:我们通过在每行末尾添加
\
来制作代码多行。这是最终的数据框架的样子:
在上面的所有示例中,最终数据表将具有表结构,并且不会具有您可能在其他语法中获得的枢轴结构。
其他汇总运算符:
Meand()
计算组的平均值
sum
()
()
计算组的计数std()
组的标准偏差var()
compute组的差异sem()标准错误组的平均值
dractic()
生成描述性统计first()
计算组值的第一个last()计算组值的最后一个
> nth()
以nth值或子集为n是列表min()
计算组值的最小值max()
计算组值的最大值If you are coming from an R or SQL background, here are three examples that will teach you everything you need to do aggregation the way you are already familiar with:
Let us first create a Pandas dataframe
Here is how the table we created looks like:
1. Aggregating With Row Reduction Similar to SQL
Group By
1.1 If Pandas version
>=0.25
Check your Pandas version by running
print(pd.__version__)
. If your Pandas version is 0.25 or above then the following code will work:The resulting data table will look like this:
The SQL equivalent of this is:
1.2 If Pandas version
<0.25
If your Pandas version is older than 0.25 then running the above code will give you the following error:
Now to do the aggregation for both
value1
andvalue2
, you will run this code:The resulting table will look like this:
Renaming the columns needs to be done separately using the below code:
2. Create a Column Without Reduction in Rows (
EXCEL - SUMIF, COUNTIF
)If you want to do a SUMIF, COUNTIF, etc., like how you would do in Excel where there is no reduction in rows, then you need to do this instead.
The resulting data frame will look like this with the same number of rows as the original:
3. Creating a RANK Column
ROW_NUMBER() OVER (PARTITION BY ORDER BY)
Finally, there might be cases where you want to create a rank column which is the SQL equivalent of
ROW_NUMBER() OVER (PARTITION BY key1 ORDER BY value1 DESC, value2 ASC)
.Here is how you do that.
Note: we make the code multi-line by adding
\
at the end of each line.Here is how the resulting data frame looks like:
In all the examples above, the final data table will have a table structure and won't have the pivot structure that you might get in other syntaxes.
Other aggregating operators:
mean()
Compute mean of groupssum()
Compute sum of group valuessize()
Compute group sizescount()
Compute count of groupstd()
Standard deviation of groupsvar()
Compute variance of groupssem()
Standard error of the mean of groupsdescribe()
Generates descriptive statisticsfirst()
Compute first of group valueslast()
Compute last of group valuesnth()
Take nth value, or a subset if n is a listmin()
Compute min of group valuesmax()
Compute max of group values