熊猫 - 基于重复和相反的存在的组 /聚集行

发布于 2025-02-13 21:05:05 字数 15133 浏览 2 评论 0原文

我有一个数据框，有时包含2行，实际上是一个条目。识别这些方法的方法是：

列：不，罢工，Cents，SD，ED是相同的
列ExecutionTimestAmp，对于A（+）的结构，将在短时间内（＆lt; 2.5min）
在结构中，存在一个偏移（ - ）

	Not	Strike	Cents	SD	ED	Structure	StartDate	EndDate	Index	DisseminationID	ExecutionTimestamp	EventTimestamp	FloatingRateResetFrequencyPeriod1	TruncatedNotional	TDateID
10	500	2.925	182.5	2	3	(+)	2024-06-30 00:00:00	2025-06-30 00:00:00	USD	372521736	2022-06- 28 13:41:01	2022-06-28 13:41:01	3M	错误	1
11	500	2.925	182.5	2	3	（+）	2024-06-06-06-30 00:00:00	2025-06-06-30 00:00:	00	:	00 -06-28 13:41:01	2022-06-28 13:41:01	3M	FALSE	1
12	250	3.255	153.5	1	2	（+）	2023-06-30 00:00:00	2024-06-06-06-30 00:00:00	USD	372539195	2022-06-28	2022-2022-06-06-28	14:36:15	错误	1
13	250	3.255	153.5	1	2	（+）	2023-06-30 00:00:00	2024-06-30 00:00:00	USD	372533865	2022-06-28 14:37:11	2022-06-28 14:37:11	3M	FALSE	1
14	380	2.473	43	1	2	（ - ）	2023-06-06-06-06-06-30 00 ：00：00	2024-06-30 00:00：00	USD	372657384	2022-06-28 16:40:37	2022-06-28 16:40:37	3M	false	1
15	380	2.473	43	1	2	（ - ）	2023-06-06-06-30 00:00：00	2024-06-06-30 00:00:00 ：00	USD	372657386	2022-06-28 16:40:37	2022-06-28 16:40:37	3m	false	1
16 16	16	16 16	16	1	2	（+）	2023-06-06-30 00:00:00	2024-06-06-06-30 00:00:00	USD	372656023	2022-06-28 16:41:00	20222-06-28 16 ：41：00	3M	错误	1
17	17	3.223	76	1	2	（+）	2023-06-06-30 00:00:00	2024-06-30 00:00:00	USD	372656033	2022-06-28 16:41:00	2022-06-28 16:41:00	3M	FALSE	1
18	130	3.223	76	1	2	（ - ）	2023-06-06-06-06-06-30 00 ：00：00	2024-06-30 00：00：00	USD	372654200	2022-06-28 16:41:16	2022-06-28 16:41:16	3M	FALSE	1
19	130	3.223	76	1	2	（ - ）	2023-06-06-06-30 00:00:00	2024-06-06-06-06-06-30	00:00:00 USD	372654203	00 : 20222222222222 -06-28 16:41:16	2022-06-28 16:41:16	3M	false	1

我想结合这些对2的对将导致：

列不，罢工，SD，ED保持不变。
将列的分数添加在一起，
将列结构替换为（=）
我不在乎返回哪个时间戳，无论容易。

	Not	Strike	Cents	SD	ED	Structure	StartDate	EndDate	Index	DisseminationID	ExecutionTimestamp	EventTimestamp	FloatingRateResetFrequencyPeriod1	TruncatedNotional	TDateID
10	500	2.925	182.5	2	3	(+)	2024-06-30 00:00:00	2025-06-30 00:00:00	USD	372521736	2022-06- 28 13:41:01	2022-06-28 13:41:01	3M	错误	1
11	500	2.925	182.5	2	3	（+）	2024-06-06-06-30 00:00:00	2025-06-06-30 00:00:	00	:	00 -06-28 13:41:01	2022-06-28 13:41:01	3M	FALSE	1
12	250	3.255	153.5	1	2	（+）	2023-06-30 00:00:00	2024-06-06-06-30 00:00:00	USD	372539195	2022-06-28	2022-2022-06-06-28	14:36:15	错误	1
13	250	3.255	153.5	1	2	（+）	2023-06-30 00:00:00	2024-06-30 00:00:00	USD	372533865	2022-06-28 14:37:11	2022-06-28 14:37:11	3M	FALSE	1
14	380	2.473	43	1	2	（ - ）	2023-06-06-06-06-06-30 00 ：00：00	2024-06-30 00:00：00	USD	372657384	2022-06-28 16:40:37	2022-06-28 16:40:37	3M	false	1
15	380	2.473	43	1	2	（ - ）	2023-06-06-06-30 00:00：00	2024-06-06-30 00:00:00 ：00	USD	372657386	2022-06-28 16:40:37	2022-06-28 16:40:37	3m	false	1
16 16	16	16 16	152	1	2	（=）	2023-06-06-30 00:00:00	2024-06-06-06-30 00:00:00	USD	372656023	2022-06-28 2022-06-28 16:41:00	20222-06-28 16 ：41：00	3M	错误	1
17	17	3.223	152	1	2	（=）	2023-06-06-30 00:00:00	2024-06-30 00:00:00	USD	372656033	2022-06-28	2022-06-28	3M	16:41:00	16:41:00

原文

I have a Dataframe that sometimes contains 2 rows for what is, in reality, one entry. The way to identify these is:

Columns: Not, Strike, Cents, SD, ED are identical
Column ExecutionTimestamp is going to be within a short period of time (<2.5min)
For a (+) in Structure, there exists an offsetting (-)

	Not	Strike	Cents	SD	ED	Structure	StartDate	EndDate	Index	DisseminationID	ExecutionTimestamp	EventTimestamp	FloatingRateResetFrequencyPeriod1	TruncatedNotional	TDateID
10	500	2.925	182.5	2	3	(+)	2024-06-30 00:00:00	2025-06-30 00:00:00	USD	372521736	2022-06-28 13:41:01	2022-06-28 13:41:01	3M	False	1
11	500	2.925	182.5	2	3	(+)	2024-06-30 00:00:00	2025-06-30 00:00:00	USD	372530892	2022-06-28 13:41:01	2022-06-28 13:41:01	3M	False	1
12	250	3.255	153.5	1	2	(+)	2023-06-30 00:00:00	2024-06-30 00:00:00	USD	372539195	2022-06-28 14:36:15	2022-06-28 14:36:15	3M	False	1
13	250	3.255	153.5	1	2	(+)	2023-06-30 00:00:00	2024-06-30 00:00:00	USD	372533865	2022-06-28 14:37:11	2022-06-28 14:37:11	3M	False	1
14	380	2.473	43	1	2	(-)	2023-06-30 00:00:00	2024-06-30 00:00:00	USD	372657384	2022-06-28 16:40:37	2022-06-28 16:40:37	3M	False	1
15	380	2.473	43	1	2	(-)	2023-06-30 00:00:00	2024-06-30 00:00:00	USD	372657386	2022-06-28 16:40:37	2022-06-28 16:40:37	3M	False	1
16	130	3.223	76	1	2	(+)	2023-06-30 00:00:00	2024-06-30 00:00:00	USD	372656023	2022-06-28 16:41:00	2022-06-28 16:41:00	3M	False	1
17	130	3.223	76	1	2	(+)	2023-06-30 00:00:00	2024-06-30 00:00:00	USD	372656033	2022-06-28 16:41:00	2022-06-28 16:41:00	3M	False	1
18	130	3.223	76	1	2	(-)	2023-06-30 00:00:00	2024-06-30 00:00:00	USD	372654200	2022-06-28 16:41:16	2022-06-28 16:41:16	3M	False	1
19	130	3.223	76	1	2	(-)	2023-06-30 00:00:00	2024-06-30 00:00:00	USD	372654203	2022-06-28 16:41:16	2022-06-28 16:41:16	3M	False	1

I am looking to combine these pairs of 2 whereby it will result in:

Columns Not, Strike, SD, ED stay the same.
Column Cents is added together
Column Structure is replaced with (=)
I dont care which timestamp is returned, whatever easy.

	Not	Strike	Cents	SD	ED	Structure	StartDate	EndDate	Index	DisseminationID	ExecutionTimestamp	EventTimestamp	FloatingRateResetFrequencyPeriod1	TruncatedNotional	TDateID
10	500	2.925	182.5	2	3	(+)	2024-06-30 00:00:00	2025-06-30 00:00:00	USD	372521736	2022-06-28 13:41:01	2022-06-28 13:41:01	3M	False	1
11	500	2.925	182.5	2	3	(+)	2024-06-30 00:00:00	2025-06-30 00:00:00	USD	372530892	2022-06-28 13:41:01	2022-06-28 13:41:01	3M	False	1
12	250	3.255	153.5	1	2	(+)	2023-06-30 00:00:00	2024-06-30 00:00:00	USD	372539195	2022-06-28 14:36:15	2022-06-28 14:36:15	3M	False	1
13	250	3.255	153.5	1	2	(+)	2023-06-30 00:00:00	2024-06-30 00:00:00	USD	372533865	2022-06-28 14:37:11	2022-06-28 14:37:11	3M	False	1
14	380	2.473	43	1	2	(-)	2023-06-30 00:00:00	2024-06-30 00:00:00	USD	372657384	2022-06-28 16:40:37	2022-06-28 16:40:37	3M	False	1
15	380	2.473	43	1	2	(-)	2023-06-30 00:00:00	2024-06-30 00:00:00	USD	372657386	2022-06-28 16:40:37	2022-06-28 16:40:37	3M	False	1
16	130	3.223	152	1	2	(=)	2023-06-30 00:00:00	2024-06-30 00:00:00	USD	372656023	2022-06-28 16:41:00	2022-06-28 16:41:00	3M	False	1
17	130	3.223	152	1	2	(=)	2023-06-30 00:00:00	2024-06-30 00:00:00	USD	372656033	2022-06-28 16:41:00	2022-06-28 16:41:00	3M	False	1

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

歌入人心 2025-02-20 21:05:06

iiuc，您可以在（+）上使用自定义组和merge_asof，带有lone（ - ）值的背面插入：

cols = ['A', 'B', 'C', 'D', 'E']

df['ExecutionTimestamp'] = pd.to_datetime(df['ExecutionTimestamp'])

# identify + rows
m = df['F'].eq('(+)')

# merge
out = (pd
  .merge_asof(df[m].reset_index(), df[~m].reset_index(),
              by=cols, on='ExecutionTimestamp',
              direction='nearest', tolerance=pd.Timedelta('2.5min'),
              suffixes=(None, '_')
             )
  .assign(F=lambda d: np.where(d['F_'].isna(), d['F'], '(=)'),
          C=lambda d: np.where(d['F_'].isna(), d['C'], d['C']*2), 
         )
 )

# add missing (lone) (-) value
missing = df.index.difference(out[['index', 'index_']].stack())

out = (pd
       .concat([out.set_index('index'), df.loc[missing]])
       .drop(columns=['index_', 'F_'])
      )

输出：

      A      B    C         D        E    F  ExecutionTimestamp
15  130  3.123   77  0.975342  1.97808  (+) 2022-06-28 13:41:00
16  130  3.123  154  0.975342  1.97808  (=) 2022-06-28 16:41:00
17  130  3.223  152  0.975342  1.97808  (=) 2022-06-28 16:41:00

IIUC, you can use a custom group and a merge_asof on (+) with back insertion of the lone (-) values:

cols = ['A', 'B', 'C', 'D', 'E']

df['ExecutionTimestamp'] = pd.to_datetime(df['ExecutionTimestamp'])

# identify + rows
m = df['F'].eq('(+)')

# merge
out = (pd
  .merge_asof(df[m].reset_index(), df[~m].reset_index(),
              by=cols, on='ExecutionTimestamp',
              direction='nearest', tolerance=pd.Timedelta('2.5min'),
              suffixes=(None, '_')
             )
  .assign(F=lambda d: np.where(d['F_'].isna(), d['F'], '(=)'),
          C=lambda d: np.where(d['F_'].isna(), d['C'], d['C']*2), 
         )
 )

# add missing (lone) (-) value
missing = df.index.difference(out[['index', 'index_']].stack())

out = (pd
       .concat([out.set_index('index'), df.loc[missing]])
       .drop(columns=['index_', 'F_'])
      )

output:

      A      B    C         D        E    F  ExecutionTimestamp
15  130  3.123   77  0.975342  1.97808  (+) 2022-06-28 13:41:00
16  130  3.123  154  0.975342  1.97808  (=) 2022-06-28 16:41:00
17  130  3.223  152  0.975342  1.97808  (=) 2022-06-28 16:41:00

回复收藏 0 原文

給妳壹絲溫柔 2025-02-20 21:05:06

这应该使您获得预期的结果。

df = df.groupby(['A', 'B', 'D', 'E']).agg({'C' : 'sum', 'ExecutionTimestamp' : 'last'}).reset_index()
df['F'] = '(=)'
df[['A', 'B', 'C', 'D', 'E', 'F', 'ExecutionTimestamp']]

我确定您的意思是“保持时间戳”，如果您的意思是您可以手动选择所需的时间戳或愿意接受哪个时间戳。我只是以为您想要我的代码中的“最后”一个，但是如果您更喜欢第一个遇到值，则可以将其更改为“首先”。

This should allow you to get your expected results.

df = df.groupby(['A', 'B', 'D', 'E']).agg({'C' : 'sum', 'ExecutionTimestamp' : 'last'}).reset_index()
df['F'] = '(=)'
df[['A', 'B', 'C', 'D', 'E', 'F', 'ExecutionTimestamp']]

I was sure what you meant by the "keep either timestamp" if you meant that you could manually choose which one of the timestamps you wanted or if you were willing to accept either. I just assumed you would want the 'last' one in my code, but you can change that to 'first' if you would prefer the first encounter value.

回复收藏 0 原文

~没有更多了~