如何将dataframe行分组为pandas groupby中的列表
给定数据框,我想将第一列分组,然后将第二列作为行中的列表,以便数据框架类似:
a b
A 1
A 2
B 5
B 5
B 4
C 6
变成
A [1,2]
B [5,5,4]
C [6]
我该怎么做?
Given a dataframe, I want to groupby the first column and get second column as lists in rows, so that a dataframe like:
a b
A 1
A 2
B 5
B 5
B 4
C 6
becomes
A [1,2]
B [5,5,4]
C [6]
How do I do this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(16)
在 @bm答案上,以下是一个更通用的版本,可与 new> newer Library版本 > :( numpy版本
1.19.2
,pandas版本1.2.1
)而且该解决方案还可以处理多名> :
但是,这并没有谨慎测试。
如果性能很重要,请降低到numpy级别:
测试:
结果:
对于随机种子0将获得:
Building upon @B.M answer, here is a more general version and updated to work with newer library version: (numpy version
1.19.2
, pandas version1.2.1
)And this solution can also deal with multi-indices:
However this is not heavily tested, use with caution.
If performance is important go down to numpy level:
Tests:
Results:
for the random seed 0 one would get:
让我们使用
df.groupby
与list和系列
构造函数Let us using
df.groupby
with list andSeries
constructor排序消耗
o(nlog(n))
时间,这是上述解决方案中最耗时的操作(包含单列)
pd.series.to_list
除非考虑其他框架,例如
对于2000万记录,大约需要
17.2秒
。与Apply(list)
相比,大约19.2
和lambda函数,该功能大约为20.6S
Sorting consumes
O(nlog(n))
time which is the most time consuming operation in the solutions suggested aboveFor a simple solution (containing single column)
pd.Series.to_list
would work and can be considered more efficient unless considering other frameworkse.g.
For 20 million records it takes about
17.2 seconds
. compared toapply(list)
which takes about19.2
and lambda function which takes about20.6s
为了加起来以前的答案,就我而言,我想要列表和其他功能,例如
min
和max
。这样做的方法是:Just to add up to previous answers, In my case, I want the list and other functions like
min
andmax
. The way to do that is:在这里,我将元素与“ |”分组在一起。作为分离器
Here I have grouped elements with "|" as a separator
基于 @Edchum对他的回答的评论的答案。评论就是这样 -
让我们首先创建一个数据框,在第一列中具有500K类别,而总DF Shape则为2000万。
上面的代码在第一列中需要2分钟的2000万行和500K类别。
Answer based on @EdChum's comment on his answer. Comment is this -
Let's first create a dataframe with 500k categories in first column and total df shape 20 million as mentioned in question.
This above code takes 2 minutes for 20 million rows and 500k categories in first column.
您可以使用
groupby
进行此操作,然后在关注列上进行分组,然后应用
list
对每个组:You can do this using
groupby
to group on the column of interest and thenapply
list
to every group:实现这一目标的一种方便方法是:
研究编写自定义聚合: https://www.kaggle.com/akshaysehgal/how-to-to-group-by-ggregate-using-py
A handy way to achieve this would be:
Look into writing Custom Aggregations: https://www.kaggle.com/akshaysehgal/how-to-group-by-aggregate-using-py
如果性能很重要,请降低到numpy级别:
测试:
If performance is important go down to numpy level:
Tests:
为了解决数据框的几列:
此答案的灵感来自 anamika modi 的答案。谢谢你!
To solve this for several columns of a dataframe:
This answer was inspired from Anamika Modi's answer. Thank you!
使用以下任何一个
groupby
和agg
配方。要汇总多列作为列表,请使用以下任何一个:
要组合单列,请将groupby转换为
seripergroupby
对象,然后调用seripergroupby.agg
。使用,Use any of the following
groupby
andagg
recipes.To aggregate multiple columns as lists, use any of the following:
To group-listify a single column only, convert the groupby to a
SeriesGroupBy
object, then callSeriesGroupBy.agg
. Use,现在是时候使用
agg
而不是应用
了。时
如果要多列堆放到列表中
,请在
pd.dataframe
中产生如果要在列表中单列,则在ps.series
注意,结果
pd in
.dataframe 的速度比ps.series
慢10倍,当您仅聚合单列时,在多柱情况下使用它。It is time to use
agg
instead ofapply
.When
If you want multiple columns stack into list , result in
pd.DataFrame
If you want single column in list, result in
ps.Series
Note, result in
pd.DataFrame
is about 10x slower than result inps.Series
when you only aggregate single column, use it in multicolumns case .正如您所说的
groupby
pd.dataframe
对象的方法可以完成工作。示例
给出了组的索引描述。
为了获得单个组的要素,您可以做到,例如
As you were saying the
groupby
method of apd.DataFrame
object can do the job.Example
which gives and index-wise description of the groups.
To get elements of single groups, you can do, for instance
只是一个辛苦。
pandas.pivot_table
更加通用,似乎更方便:Just a suplement.
pandas.pivot_table
is much more universal and seems more convenient:我发现的最简单方法至少对于一列而言,它类似于 anamika的答案使用元组语法用于聚合函数。
The easiest way I have found to achieve the same thing, at least for one column, which is similar to Anamika's answer, just with the tuple syntax for the aggregate function.
如果寻找 unique list 在分组多列时可能会有所帮助:
If looking for a unique list while grouping multiple columns this could probably help: