使用 groupby 获取组中具有最大值的行
按 ['Sp','Mt']
列分组后,如何找到 pandas DataFrame 中具有 count
列最大值的所有行?
示例1:下面的DataFrame:
Sp Mt Value count
0 MM1 S1 a **3**
1 MM1 S1 n 2
2 MM1 S3 cb **5**
3 MM2 S3 mk **8**
4 MM2 S4 bg **10**
5 MM2 S4 dgd 1
6 MM4 S2 rd 2
7 MM4 S2 cb 2
8 MM4 S2 uyi **7**
预期输出是获取每组中计数最大的结果行,如下所示:
Sp Mt Value count
0 MM1 S1 a **3**
2 MM1 S3 cb **5**
3 MM2 S3 mk **8**
4 MM2 S4 bg **10**
8 MM4 S2 uyi **7**
示例2:
Sp Mt Value count
4 MM2 S4 bg 10
5 MM2 S4 dgd 1
6 MM4 S2 rd 2
7 MM4 S2 cb 8
8 MM4 S2 uyi 8
预期输出:
Sp Mt Value count
4 MM2 S4 bg 10
7 MM4 S2 cb 8
8 MM4 S2 uyi 8
How do I find all rows in a pandas DataFrame which have the max value for count
column, after grouping by ['Sp','Mt']
columns?
Example 1: the following DataFrame:
Sp Mt Value count
0 MM1 S1 a **3**
1 MM1 S1 n 2
2 MM1 S3 cb **5**
3 MM2 S3 mk **8**
4 MM2 S4 bg **10**
5 MM2 S4 dgd 1
6 MM4 S2 rd 2
7 MM4 S2 cb 2
8 MM4 S2 uyi **7**
Expected output is to get the result rows whose count is max in each group, like this:
Sp Mt Value count
0 MM1 S1 a **3**
2 MM1 S3 cb **5**
3 MM2 S3 mk **8**
4 MM2 S4 bg **10**
8 MM4 S2 uyi **7**
Example 2:
Sp Mt Value count
4 MM2 S4 bg 10
5 MM2 S4 dgd 1
6 MM4 S2 rd 2
7 MM4 S2 cb 8
8 MM4 S2 uyi 8
Expected output:
Sp Mt Value count
4 MM2 S4 bg 10
7 MM4 S2 cb 8
8 MM4 S2 uyi 8
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(16)
总结起来,方法有很多种,但是哪一种更快呢?
获胜者是...
Summarizing, there are many ways, but which one is faster?
And the winner is...
尝试使用
nlargest
在 groupby 对象上。优点是它返回从中获取“最大项目”的行,并且我们可以获得它们的索引。在本例中,我们希望
n=1
代表最大值,并且keep='all'
包含重复的最大值。注意:我们对索引的最后一个 (-1) 元素进行切片,因为本例中的索引由元组组成(例如
('MM1', 'S1', 0)
)。Try using
nlargest
on the groupby object. The advantage is that it returns the rows where "the nlargest item(s)" were fetched from, and we can get their index.In this case, we want
n=1
for the max andkeep='all'
to include duplicate maxes.Note: we slice the last (-1) element of our index since our index in this case consist of tuples (e.g.
('MM1', 'S1', 0)
).我一直在许多组操作中使用这种功能样式:
.reset_index(drop=True)
通过删除组索引让您返回到原始索引。I've been using this functional style for many group operations:
.reset_index(drop=True)
gets you back to the original index by dropping the group-index.意识到“应用”“nlargest”到groupby对象效果同样好:
额外的优势 - 还可以获取前n个值< /strong> 如果需要:
Realizing that "applying" "nlargest" to groupby object works just as fine:
Additional advantage - also can fetch top n values if required:
如果您对 DataFrame 进行排序,则排序将保留在 groupby 中。然后,您可以只抓取第一个或最后一个元素并重置索引。
If you sort your DataFrame that ordering will be preserved in the groupby. You can then just grab the first or last element and reset the index.
其中许多都是很好的答案,但为了帮助展示可扩展性,在具有不同数量重复项的 280 万行上显示了一些惊人的差异。我的数据最快的是排序然后删除重复(删除除最后以外的所有内容比降序排序稍快,并删除除第一个之外的所有内容)
如您所见,排序是比 Transform 快 1/3,比 Groupby 快 75%。其他一切都慢了 40 倍。在小型数据集中,这可能并不重要,但正如您所看到的,这可能会对大型数据集产生重大影响。
Many of these are great answers, but to help show scalability, on 2.8 million rows with varying amount of duplicates shows some startling differences. The fastest for my data was the sort by then drop duplicate (drop all but last marginally faster than sort descending and drop all but first)
As you can see Sort is 1/3 faster than transform and 75% faster than groupby. Everything else is up to 40x slower. In small datasets, this may not matter by much, but as you can see, this can significantly impact large datasets.
另一种使用排名的方法
Another approach using rank
如果
df
索引不唯一,您可能需要先执行此步骤df.reset_index(inplace=True)
。if the
df
index isn't unique you may need this stepdf.reset_index(inplace=True)
first.首先,我们可以像这样获取每个组的最大计数:
要获取原始 DF 的索引,您可以执行以下操作:
请注意,如果每个组有多个最大值,则将返回所有最大值。
更新
万岁玛丽有机会这就是OP所要求的:
Firstly, we can get the max count for each group like this:
To get the indices of the original DF you can do:
Note that if you have multiple max values per group, all will be returned.
Update
On a Hail Mary chance that this is what the OP is requesting:
您可以按计数对数据帧进行排序,然后删除重复项。我认为这更容易:
You can sort the dataFrame by count and then remove duplicates. I think it's easier:
简单的解决方案是应用 idxmax() 函数来获取具有最大值的行的索引。
这将过滤掉组中具有最大值的所有行。
Easy solution would be to apply the
idxmax()
function to get indices of rows with max values.This would filter out all the rows with max value in the group.
您可能不需要执行
groupby()
,但同时使用sort_values
+drop_duplicates
也使用
tail
几乎相同的逻辑>You may not need to do
groupby()
, but use bothsort_values
+drop_duplicates
Also almost same logic by using
tail
在相对较大的 DataFrame(约 400k 行)上尝试了 Zelazny 建议的解决方案后,我发现它非常慢。这是我发现在我的数据集上运行速度要快几个数量级的替代方案。
Having tried the solution suggested by Zelazny on a relatively large DataFrame (~400k rows) I found it to be very slow. Here is an alternative that I found to run orders of magnitude faster on my data set.
使用
groupby
和idxmax
方法:将col
date
传输到datetime
:获取
date
列max
的索引,位于之后groupyby ad_id
:获取想要的数据:
Use
groupby
andidxmax
methods:transfer col
date
todatetime
:get the index of
max
of columndate
, aftergroupyby ad_id
:get the wanted data:
对我来说,最简单的解决方案是当计数等于最大值时保留值。因此,以下一行命令就足够了:
For me, the easiest solution would be keep value when count is equal to the maximum. Therefore, the following one line command is enough :