基于组中条件累积的先前值的有序数据帧的聚合(pandas)
我有一个有序的数据框,我试图通过一些分组列并根据其他列的累积先前值进行聚合。
df = pd.DataFrame({'ID':['ID1','ID1','ID1','ID1','ID1','ID2','ID2','ID2','ID2']
, 'Group':['Group1','Group2','Group2','Group2','Group1','Group2','Group2','Group2','Group1']
, 'Value1':[0,1,1,1,1,1,0,0,0]
, 'Value2':[1,2,3,4,5,4,3,2,2]})
df
ID Group Value1 Value2
0 ID1 Group1 0 1
1 ID1 Group2 1 2
2 ID1 Group2 1 3
3 ID1 Group2 1 4
4 ID1 Group1 1 5
5 ID2 Group2 1 4
6 ID2 Group2 0 3
7 ID2 Group2 0 2
8 ID2 Group1 0 2
我想使用 Value1 和 Value 2 聚合三种不同的方式,按 ID 和组分组。 df 已订购(基于日期、ID 和组)
Output1:按 ID 和 Group(不包括行本身)统计 Value1 前几行中 1 的数量
Output2:按 ID 和 Group(包括行本身)对 Value2 的前几行值求和
输出3:如果前几行的 Value1 为 1(不包括行本身),则按 ID 和组对前几行的 Value2 求和
这是我想要的输出:
ID Group Value1 Value2 Output1 Output2 Output3
0 ID1 Group1 0 1 0 1 NaN
1 ID1 Group2 1 2 0 2 NaN
2 ID1 Group2 1 3 1 5 2
3 ID1 Group2 1 4 2 9 5
4 ID1 Group1 1 5 0 6 NaN
5 ID2 Group2 1 4 0 4 NaN
6 ID2 Group2 0 3 1 7 4
7 ID2 Group2 0 2 1 9 4
8 ID2 Group1 0 2 0 2 NaN
为了确保清楚我要做什么,让我们看一下输出索引 3(第四行)
3 ID1 Group2 1 4 2 9 5
Output1 = 2,因为 ID1/Group2 中其上方有两行 值1 = 1。
Output2 = 9,因为其上方所有行的 Value2 之和 ID1/Group2,包括行本身为(2+3+4 = 9)。
Output3 = 5,因为 ID1/Group2 中有前两行的 Value1 = 1,所以它们的 Value2 中的一些 (2 + 3 = 5)
我想添加一下,我正在处理一个大型数据集,所以我'我正在寻找高效/高性能的解决方案。
I've got an ordered dataframe which I'm trying to aggregate by some grouping columns and based on accumulated previous values of other columns.
df = pd.DataFrame({'ID':['ID1','ID1','ID1','ID1','ID1','ID2','ID2','ID2','ID2']
, 'Group':['Group1','Group2','Group2','Group2','Group1','Group2','Group2','Group2','Group1']
, 'Value1':[0,1,1,1,1,1,0,0,0]
, 'Value2':[1,2,3,4,5,4,3,2,2]})
df
ID Group Value1 Value2
0 ID1 Group1 0 1
1 ID1 Group2 1 2
2 ID1 Group2 1 3
3 ID1 Group2 1 4
4 ID1 Group1 1 5
5 ID2 Group2 1 4
6 ID2 Group2 0 3
7 ID2 Group2 0 2
8 ID2 Group1 0 2
I'd like to aggregate three different ways using Value1 and Value 2, Grouped by ID and Group.
df is already ordered (based on date, ID and Group)
Output1: count the number of 1s in previous rows of Value1, by ID and Group (excluding the row itself)
Output2: sum the value of previous rows of Value2, by ID and Group (including the row itself)
Output3: sum Value2 of previous rows, by ID and Group, if Value1 of those previous rows is 1 (excluding the row itself)
here's my desired output:
ID Group Value1 Value2 Output1 Output2 Output3
0 ID1 Group1 0 1 0 1 NaN
1 ID1 Group2 1 2 0 2 NaN
2 ID1 Group2 1 3 1 5 2
3 ID1 Group2 1 4 2 9 5
4 ID1 Group1 1 5 0 6 NaN
5 ID2 Group2 1 4 0 4 NaN
6 ID2 Group2 0 3 1 7 4
7 ID2 Group2 0 2 1 9 4
8 ID2 Group1 0 2 0 2 NaN
To make sure it's clear what I'm trying to do, let's look at the output index 3 (the fourtth row)
3 ID1 Group2 1 4 2 9 5
Output1 = 2 because there are two rows above it in ID1/Group2 that has
Value1 = 1.Output2 = 9 because the sum of Value2 of all rows above it in
ID1/Group2, including the row itself is (2+3+4 = 9).Output3 = 5, because there are two previous rows in ID1/Group2 that have Value1 = 1, so some of their Value2 (2 + 3 = 5)
I'd like to add I'm working on a large dataset, so I'm looking for an efficient/high performance solution.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
输出 1 和 2 的解决方案
groupby + cumsum
Value2
列中的值进行掩码,其中Value1 列中的对应值为 0,之后您需要对屏蔽列进行分组并使用 cumsum 计算累积和,现在为了排除当前行,您可以从累积和中减去屏蔽列更新按照评论中的新要求:
结果
Solution
groupby + cumsum
Value2
where the corresponding value in column Value1 is 0, after that you need to group the masked column and usecumsum
to calculate cumulative sum now in order to exclude the current row you have can subtract the masked column from the cummulative sumUpdate as per the new requirements in the comments:
Result
您可以为第三个输出添加一个屏蔽列,并计算分组的、移位的累积和:
或者,作为线性形式:
输出:
You can add a masked column for the third output and computer a grouped, shifted cumsum:
Or, as linear form:
output: