Pandas：累计对组内和跨另一个组的行进行编号

发布于 2025-01-20 18:18:55 字数 938 浏览 4 评论 0原文

给定以下数据框：

    col_1 col_2 col_3
0     1     A     1
1     1     B     1
2     2     A     3
3     2     A     3
4     2     A     3
5     2     B     3
6     2     B     3
7     2     B     3
8     3     A     2
9     3     A     2
10    3     C     2
11    3     C     2

我需要创建一个新列，其中的行在“col_1”和“col_2”形成的每个组内累积编号，而且在每个“col_1”组之后累积编号，如下所示：

    col_1 col_2 col_3  new
0     1     A     1     1
1     1     B     1     1
2     2     A     3     2
3     2     A     3     3
4     2     A     3     4
5     2     B     3     2
6     2     B     3     3
7     2     B     3     4
8     3     A     2     5
9     3     A     2     6
10    3     C     2     5
11    3     C     2     6

我尝试过：

df['new'] = df.groupby(['col_1', 'col_2']).cumcount() + 1

但这并没有按预期与前一组相加。

原文

Given the following dataframe:

    col_1 col_2 col_3
0     1     A     1
1     1     B     1
2     2     A     3
3     2     A     3
4     2     A     3
5     2     B     3
6     2     B     3
7     2     B     3
8     3     A     2
9     3     A     2
10    3     C     2
11    3     C     2

I need to create a new column in which the rows are numbered cumulatively within each group formed by 'col_1' and 'col_2', but also cumulatively after each group of 'col_1', like this:

    col_1 col_2 col_3  new
0     1     A     1     1
1     1     B     1     1
2     2     A     3     2
3     2     A     3     3
4     2     A     3     4
5     2     B     3     2
6     2     B     3     3
7     2     B     3     4
8     3     A     2     5
9     3     A     2     6
10    3     C     2     5
11    3     C     2     6

I've tried:

df['new'] = df.groupby(['col_1', 'col_2']).cumcount() + 1

But this doesn't add up from the previous group as intended.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

画中仙 2025-01-27 18:18:55

这是一个棘手的问题。您想计算组内的量集，但是对于所有后续组，您需要跟踪已经增加了多少，以便您知道要应用的偏移。可以使用以前的组上的max + <代码> cumscount 的cumsum 。在这里，唯一的并发症是您需要确定以前的组标签和后续标签之间的关系，如果Susbequent组的标签之间没有简单的 + 1增量。

# Cumcount within group
s = df.groupby(['col_1', 'col_2']).cumcount()

# Determine how many cumcounts were within all previous groups of `col_1' 
to_merge = s.add(1).groupby(df['col_1']).max().cumsum().add(1).to_frame('new')

# Link group with prior group label
df1 = df[['col_1']].drop_duplicates()
df1['col_1_shift'] = df1['col_1'].shift(-1)
df1 = pd.concat([to_merge, df1.set_index('col_1')], axis=1)

# Bring the group offset over
df = df.merge(df1, left_on='col_1', right_on='col_1_shift', how='left')

# Add the group offset to the cumulative count within group.
# First group (no previous group) is NaN so fill with 1.
df['new'] = df['new'].fillna(1, downcast='infer') + s

# Clean up merging column
df = df.drop(columns='col_1_shift')

    col_1 col_2  col_3  new
0       1     A      1    1
1       1     B      1    1
2       2     A      3    2
3       2     A      3    3
4       2     A      3    4
5       2     B      3    2
6       2     B      3    3
7       2     B      3    4
8       3     A      2    5
9       3     A      2    6
10      3     C      2    5
11      3     C      2    6

This is a tricky problem. You want to calculate the cumcount within group, but for all subsequent groups you need to keep track of how much that was already incremented so you know the offset to apply. That can be done with a max + cumsum of this cumcount over the previous groups. Here the only complication is that you need to determine the relationship between previous and subsequent group labels, in case there isn't some simple + 1 increment between labels of susbequent groups.

# Cumcount within group
s = df.groupby(['col_1', 'col_2']).cumcount()

# Determine how many cumcounts were within all previous groups of `col_1' 
to_merge = s.add(1).groupby(df['col_1']).max().cumsum().add(1).to_frame('new')

# Link group with prior group label
df1 = df[['col_1']].drop_duplicates()
df1['col_1_shift'] = df1['col_1'].shift(-1)
df1 = pd.concat([to_merge, df1.set_index('col_1')], axis=1)

# Bring the group offset over
df = df.merge(df1, left_on='col_1', right_on='col_1_shift', how='left')

# Add the group offset to the cumulative count within group.
# First group (no previous group) is NaN so fill with 1.
df['new'] = df['new'].fillna(1, downcast='infer') + s

# Clean up merging column
df = df.drop(columns='col_1_shift')

    col_1 col_2  col_3  new
0       1     A      1    1
1       1     B      1    1
2       2     A      3    2
3       2     A      3    3
4       2     A      3    4
5       2     B      3    2
6       2     B      3    3
7       2     B      3    4
8       3     A      2    5
9       3     A      2    6
10      3     C      2    5
11      3     C      2    6

回复收藏 0 原文

最舍不得你 2025-01-27 18:18:55

您可以使用两个连续的groupby，一个在两列上，第二个在第一组上仅通过 col_1：

# classical cumcount per group
count1 = df.groupby(['col_1', 'col_2']).cumcount().add(1)
# max cumcount per group
g = count1.groupby(df['col_1']) # (*) read below
count2 = g.ngroup().map(g.max().cumsum()).fillna(0, downcast='infer')
# add the two
df['new'] = count1+count2

### Note (*)
## if df['col_1'] is not of the form 1/2/3...
## use this to group instead:
# group = df['col_1'].ne(df['col_1'].shift()).cumsum()
# g = count1.groupby(group)

输出：

    col_1 col_2  col_3  new
0       1     A      1    1
1       1     B      1    1
2       2     A      3    2
3       2     A      3    3
4       2     A      3    4
5       2     B      3    2
6       2     B      3    3
7       2     B      3    4
8       3     A      2    5
9       3     A      2    6
10      3     C      2    5
11      3     C      2    6

You can use two consecutive groupby, one on the two columns, the second on the first group only by col_1:

# classical cumcount per group
count1 = df.groupby(['col_1', 'col_2']).cumcount().add(1)
# max cumcount per group
g = count1.groupby(df['col_1']) # (*) read below
count2 = g.ngroup().map(g.max().cumsum()).fillna(0, downcast='infer')
# add the two
df['new'] = count1+count2

### Note (*)
## if df['col_1'] is not of the form 1/2/3...
## use this to group instead:
# group = df['col_1'].ne(df['col_1'].shift()).cumsum()
# g = count1.groupby(group)

output:

    col_1 col_2  col_3  new
0       1     A      1    1
1       1     B      1    1
2       2     A      3    2
3       2     A      3    3
4       2     A      3    4
5       2     B      3    2
6       2     B      3    3
7       2     B      3    4
8       3     A      2    5
9       3     A      2    6
10      3     C      2    5
11      3     C      2    6

回复收藏 0 原文

~没有更多了~