基于组中条件累积的先前值的有序数据帧的聚合（pandas）

发布于 2025-01-16 20:08:34 字数 1906 浏览 4 评论 0原文

我有一个有序的数据框，我试图通过一些分组列并根据其他列的累积先前值进行聚合。

df = pd.DataFrame({'ID':['ID1','ID1','ID1','ID1','ID1','ID2','ID2','ID2','ID2']
            , 'Group':['Group1','Group2','Group2','Group2','Group1','Group2','Group2','Group2','Group1']
            , 'Value1':[0,1,1,1,1,1,0,0,0]
            , 'Value2':[1,2,3,4,5,4,3,2,2]})


df
        ID  Group   Value1  Value2
    0   ID1 Group1    0       1
    1   ID1 Group2    1       2
    2   ID1 Group2    1       3
    3   ID1 Group2    1       4
    4   ID1 Group1    1       5
    5   ID2 Group2    1       4
    6   ID2 Group2    0       3
    7   ID2 Group2    0       2
    8   ID2 Group1    0       2

我想使用 Value1 和 Value 2 聚合三种不同的方式，按 ID 和组分组。 df 已订购（基于日期、ID 和组）

Output1：按 ID 和 Group（不包括行本身）统计 Value1 前几行中 1 的数量
Output2：按 ID 和 Group（包括行本身）对 Value2 的前几行值求和
输出3：如果前几行的 Value1 为 1（不包括行本身），则按 ID 和组对前几行的 Value2 求和

这是我想要的输出：

    ID  Group   Value1  Value2  Output1 Output2 Output3
0   ID1 Group1    0       1        0      1       NaN
1   ID1 Group2    1       2        0      2       NaN
2   ID1 Group2    1       3        1      5        2
3   ID1 Group2    1       4        2      9        5
4   ID1 Group1    1       5        0      6       NaN 
5   ID2 Group2    1       4        0      4       NaN
6   ID2 Group2    0       3        1      7        4
7   ID2 Group2    0       2        1      9        4
8   ID2 Group1    0       2        0      2       NaN

为了确保清楚我要做什么，让我们看一下输出索引 3（第四行）

3   ID1 Group2    1       4        2      9        5

Output1 = 2，因为 ID1/Group2 中其上方有两行值1 = 1。
Output2 = 9，因为其上方所有行的 Value2 之和 ID1/Group2，包括行本身为(2+3+4 = 9)。
Output3 = 5，因为 ID1/Group2 中有前两行的 Value1 = 1，所以它们的 Value2 中的一些 (2 + 3 = 5)

我想添加一下，我正在处理一个大型数据集，所以我'我正在寻找高效/高性能的解决方案。

原文

I've got an ordered dataframe which I'm trying to aggregate by some grouping columns and based on accumulated previous values of other columns.

df = pd.DataFrame({'ID':['ID1','ID1','ID1','ID1','ID1','ID2','ID2','ID2','ID2']
            , 'Group':['Group1','Group2','Group2','Group2','Group1','Group2','Group2','Group2','Group1']
            , 'Value1':[0,1,1,1,1,1,0,0,0]
            , 'Value2':[1,2,3,4,5,4,3,2,2]})


df
        ID  Group   Value1  Value2
    0   ID1 Group1    0       1
    1   ID1 Group2    1       2
    2   ID1 Group2    1       3
    3   ID1 Group2    1       4
    4   ID1 Group1    1       5
    5   ID2 Group2    1       4
    6   ID2 Group2    0       3
    7   ID2 Group2    0       2
    8   ID2 Group1    0       2

I'd like to aggregate three different ways using Value1 and Value 2, Grouped by ID and Group.
df is already ordered (based on date, ID and Group)

Output1: count the number of 1s in previous rows of Value1, by ID and Group (excluding the row itself)
Output2: sum the value of previous rows of Value2, by ID and Group (including the row itself)
Output3: sum Value2 of previous rows, by ID and Group, if Value1 of those previous rows is 1 (excluding the row itself)

here's my desired output:

    ID  Group   Value1  Value2  Output1 Output2 Output3
0   ID1 Group1    0       1        0      1       NaN
1   ID1 Group2    1       2        0      2       NaN
2   ID1 Group2    1       3        1      5        2
3   ID1 Group2    1       4        2      9        5
4   ID1 Group1    1       5        0      6       NaN 
5   ID2 Group2    1       4        0      4       NaN
6   ID2 Group2    0       3        1      7        4
7   ID2 Group2    0       2        1      9        4
8   ID2 Group1    0       2        0      2       NaN

To make sure it's clear what I'm trying to do, let's look at the output index 3 (the fourtth row)

3   ID1 Group2    1       4        2      9        5

Output1 = 2 because there are two rows above it in ID1/Group2 that has
Value1 = 1.
Output2 = 9 because the sum of Value2 of all rows above it in
ID1/Group2, including the row itself is (2+3+4 = 9).
Output3 = 5, because there are two previous rows in ID1/Group2 that have Value1 = 1, so some of their Value2 (2 + 3 = 5)

I'd like to add I'm working on a large dataset, so I'm looking for an efficient/high performance solution.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

笑梦风尘 2025-01-23 20:08:34

输出 1 和 2 的解决方案

：我们可以使用 groupby + cumsum
对于输出 3：计算有点棘手，因为您必须首先对 Value2 列中的值进行掩码，其中Value1 列中的对应值为 0，之后您需要对屏蔽列进行分组并使用 cumsum 计算累积和，现在为了排除当前行，您可以从累积和中减去屏蔽列

g = df.groupby(['ID', 'Group'])
df['Output1'] = g['Value1'].cumsum() - df['Value1']
df['Output2'] = g['Value2'].cumsum()

s = df['Value2'].mul(df['Value1'])
df['Output3'] = s.groupby([df['ID'], df['Group']]).cumsum() - s

更新按照评论中的新要求：

def transform(g):
    g['Output1'] = g['Value1'].cumsum() - g['Value1']
    g['Output2'] = g['Value2'].cumsum()

    cond = g['Value1'].eq(1)
    g['Output3'] = g['Value2'].mask(~cond).cumsum().shift().ffill()
    return g


df.groupby(['ID', 'Group']).apply(transform)

结果

print(df)

    ID   Group  Value1  Value2  Output1  Output2  Output3
0  ID1  Group1       0       1        0        1      NaN
1  ID1  Group2       1       2        0        2      NaN
2  ID1  Group2       1       3        1        5      2.0
3  ID1  Group2       1       4        2        9      5.0
4  ID1  Group1       1       5        0        6      NaN
5  ID2  Group2       1       4        0        4      NaN
6  ID2  Group2       0       3        1        7      4.0
7  ID2  Group2       0       2        1        9      4.0
8  ID2  Group1       0       2        0        2      NaN

Solution

For Output 1 and 2: We can use groupby + cumsum
For Output 3: Its a little tricky calculation as you have to first the mask the values in column Value2 where the corresponding value in column Value1 is 0, after that you need to group the masked column and use cumsum to calculate cumulative sum now in order to exclude the current row you have can subtract the masked column from the cummulative sum

g = df.groupby(['ID', 'Group'])
df['Output1'] = g['Value1'].cumsum() - df['Value1']
df['Output2'] = g['Value2'].cumsum()

s = df['Value2'].mul(df['Value1'])
df['Output3'] = s.groupby([df['ID'], df['Group']]).cumsum() - s

Update as per the new requirements in the comments:

def transform(g):
    g['Output1'] = g['Value1'].cumsum() - g['Value1']
    g['Output2'] = g['Value2'].cumsum()

    cond = g['Value1'].eq(1)
    g['Output3'] = g['Value2'].mask(~cond).cumsum().shift().ffill()
    return g


df.groupby(['ID', 'Group']).apply(transform)

Result

print(df)

    ID   Group  Value1  Value2  Output1  Output2  Output3
0  ID1  Group1       0       1        0        1      NaN
1  ID1  Group2       1       2        0        2      NaN
2  ID1  Group2       1       3        1        5      2.0
3  ID1  Group2       1       4        2        9      5.0
4  ID1  Group1       1       5        0        6      NaN
5  ID2  Group2       1       4        0        4      NaN
6  ID2  Group2       0       3        1        7      4.0
7  ID2  Group2       0       2        1        9      4.0
8  ID2  Group1       0       2        0        2      NaN

回复收藏 0 原文

羁客 2025-01-23 20:08:34

您可以为第三个输出添加一个屏蔽列，并计算分组的、移位的累积和：

import numpy as np

# dictionary of shift values
d_shift = {'Value1': 1, 'Value3': 1}
# dictionary of fill values
d_fill  = {'Value1': 0}

df[['Output1', 'Output2', 'Output3']] = (df
 .assign(Value3=df['Value2'].where(df['Value1'].eq(1)))
 .groupby(['ID', 'Group'])
 .transform(lambda x: x.shift(d_shift.get(x.name, 0),
                              fill_value=d_fill.get(x.name, np.nan)).cumsum())
)

或者，作为线性形式：

g = (df.assign(Value3=df['Value2']
       .mask(df['Value1'].ne(1))).groupby(['ID', 'Group'])
     )
df['Output1'] = g['Value1'].apply(lambda s: s.shift(fill_value=0).cumsum())
df['Output2'] = g['Value2'].cumsum()
df['Output3'] = g['Value3'].apply(lambda s: s.shift().cumsum())

输出：

    ID   Group  Value1  Value2  Output1  Output2  Output3
0  ID1  Group1       0       1        0        1      NaN
1  ID1  Group2       1       2        0        2      NaN
2  ID1  Group2       1       3        1        5      2.0
3  ID1  Group2       1       4        2        9      5.0
4  ID1  Group1       1       5        0        6      NaN
5  ID2  Group2       1       4        0        4      NaN
6  ID2  Group2       0       3        1        7      4.0
7  ID2  Group2       0       2        1        9      NaN
8  ID2  Group1       0       2        0        2      NaN

You can add a masked column for the third output and computer a grouped, shifted cumsum:

import numpy as np

# dictionary of shift values
d_shift = {'Value1': 1, 'Value3': 1}
# dictionary of fill values
d_fill  = {'Value1': 0}

df[['Output1', 'Output2', 'Output3']] = (df
 .assign(Value3=df['Value2'].where(df['Value1'].eq(1)))
 .groupby(['ID', 'Group'])
 .transform(lambda x: x.shift(d_shift.get(x.name, 0),
                              fill_value=d_fill.get(x.name, np.nan)).cumsum())
)

Or, as linear form:

g = (df.assign(Value3=df['Value2']
       .mask(df['Value1'].ne(1))).groupby(['ID', 'Group'])
     )
df['Output1'] = g['Value1'].apply(lambda s: s.shift(fill_value=0).cumsum())
df['Output2'] = g['Value2'].cumsum()
df['Output3'] = g['Value3'].apply(lambda s: s.shift().cumsum())

output:

    ID   Group  Value1  Value2  Output1  Output2  Output3
0  ID1  Group1       0       1        0        1      NaN
1  ID1  Group2       1       2        0        2      NaN
2  ID1  Group2       1       3        1        5      2.0
3  ID1  Group2       1       4        2        9      5.0
4  ID1  Group1       1       5        0        6      NaN
5  ID2  Group2       1       4        0        4      NaN
6  ID2  Group2       0       3        1        7      4.0
7  ID2  Group2       0       2        1        9      NaN
8  ID2  Group1       0       2        0        2      NaN

回复收藏 0 原文

~没有更多了~