熊猫 - 基于重复和相反的存在的组 /聚集行

发布于 2025-02-13 21:05:05 字数 15133 浏览 2 评论 0原文

我有一个数据框,有时包含2行,实际上是一个条目。识别这些方法的方法是:

  1. 列:不,罢工,Cents,SD,ED是相同的
  2. 列ExecutionTimestAmp,对于A(+)的结构,将在短时间内(< 2.5min)
  3. 在结构中,存在一个偏移( - )
NotStrikeCentsSDEDStructureStartDateEndDateIndexDisseminationIDExecutionTimestampEventTimestampFloatingRateResetFrequencyPeriod1TruncatedNotionalTDateID
105002.925182.523(+)2024-06-30 00:00:002025-06-30 00:00:00USD3725217362022-06- 28 13:41:012022-06-28 13:41:013M错误1
115002.925182.523(+)2024-06-06-06-30 00:00:002025-06-06-30 00:00:00:00 -06-28 13:41:012022-06-28 13:41:013MFALSE1
122503.255153.512(+)2023-06-30 00:00:002024-06-06-06-30 00:00:00USD3725391952022-06-282022-2022-06-06-2814:36:15错误1
132503.255153.512(+)2023-06-30 00:00:002024-06-30 00:00:00USD3725338652022-06-28 14:37:112022-06-28 14:37:113MFALSE1
143802.4734312( - )2023-06-06-06-06-06-30 00 :00:002024-06-30 00:00:00USD3726573842022-06-28 16:40:372022-06-28 16:40:373Mfalse1
153802.4734312( - )2023-06-06-06-30 00:00:002024-06-06-30 00:00:00 :00USD3726573862022-06-28 16:40:372022-06-28 16:40:373mfalse1
16 161616 161612(+)2023-06-06-30 00:00:002024-06-06-06-30 00:00:00USD3726560232022-06-28 16:41:0020222-06-28 16 :41:003M错误1
17173.2237612(+)2023-06-06-30 00:00:002024-06-30 00:00:00USD3726560332022-06-28 16:41:002022-06-28 16:41:003MFALSE1
181303.2237612( - )2023-06-06-06-06-06-30 00 :00:002024-06-30 00:00:00USD3726542002022-06-28 16:41:162022-06-28 16:41:163MFALSE1
191303.2237612( - )2023-06-06-06-30 00:00:002024-06-06-06-06-06-3000:00:00 USD37265420300 : 20222222222222 -06-28 16:41:162022-06-28 16:41:163Mfalse1

我想结合这些对2的对将导致:

  1. 列不,罢工,SD,ED保持不变。
  2. 将列的分数添加在一起,
  3. 将列结构替换为(=)
  4. 我不在乎返回哪个时间戳,无论容易。
NotStrikeCentsSDEDStructureStartDateEndDateIndexDisseminationIDExecutionTimestampEventTimestampFloatingRateResetFrequencyPeriod1TruncatedNotionalTDateID
105002.925182.523(+)2024-06-30 00:00:002025-06-30 00:00:00USD3725217362022-06- 28 13:41:012022-06-28 13:41:013M错误1
115002.925182.523(+)2024-06-06-06-30 00:00:002025-06-06-30 00:00:00:00 -06-28 13:41:012022-06-28 13:41:013MFALSE1
122503.255153.512(+)2023-06-30 00:00:002024-06-06-06-30 00:00:00USD3725391952022-06-282022-2022-06-06-2814:36:15错误1
132503.255153.512(+)2023-06-30 00:00:002024-06-30 00:00:00USD3725338652022-06-28 14:37:112022-06-28 14:37:113MFALSE1
143802.4734312( - )2023-06-06-06-06-06-30 00 :00:002024-06-30 00:00:00USD3726573842022-06-28 16:40:372022-06-28 16:40:373Mfalse1
153802.4734312( - )2023-06-06-06-30 00:00:002024-06-06-30 00:00:00 :00USD3726573862022-06-28 16:40:372022-06-28 16:40:373mfalse1
16 161616 1615212(=)2023-06-06-30 00:00:002024-06-06-06-30 00:00:00USD3726560232022-06-28 2022-06-28 16:41:0020222-06-28 16 :41:003M错误1
17173.22315212(=)2023-06-06-30 00:00:002024-06-30 00:00:00USD3726560332022-06-282022-06-283M16:41:0016:41:00

I have a Dataframe that sometimes contains 2 rows for what is, in reality, one entry. The way to identify these is:

  1. Columns: Not, Strike, Cents, SD, ED are identical
  2. Column ExecutionTimestamp is going to be within a short period of time (<2.5min)
  3. For a (+) in Structure, there exists an offsetting (-)
NotStrikeCentsSDEDStructureStartDateEndDateIndexDisseminationIDExecutionTimestampEventTimestampFloatingRateResetFrequencyPeriod1TruncatedNotionalTDateID
105002.925182.523(+)2024-06-30 00:00:002025-06-30 00:00:00USD3725217362022-06-28 13:41:012022-06-28 13:41:013MFalse1
115002.925182.523(+)2024-06-30 00:00:002025-06-30 00:00:00USD3725308922022-06-28 13:41:012022-06-28 13:41:013MFalse1
122503.255153.512(+)2023-06-30 00:00:002024-06-30 00:00:00USD3725391952022-06-28 14:36:152022-06-28 14:36:153MFalse1
132503.255153.512(+)2023-06-30 00:00:002024-06-30 00:00:00USD3725338652022-06-28 14:37:112022-06-28 14:37:113MFalse1
143802.4734312(-)2023-06-30 00:00:002024-06-30 00:00:00USD3726573842022-06-28 16:40:372022-06-28 16:40:373MFalse1
153802.4734312(-)2023-06-30 00:00:002024-06-30 00:00:00USD3726573862022-06-28 16:40:372022-06-28 16:40:373MFalse1
161303.2237612(+)2023-06-30 00:00:002024-06-30 00:00:00USD3726560232022-06-28 16:41:002022-06-28 16:41:003MFalse1
171303.2237612(+)2023-06-30 00:00:002024-06-30 00:00:00USD3726560332022-06-28 16:41:002022-06-28 16:41:003MFalse1
181303.2237612(-)2023-06-30 00:00:002024-06-30 00:00:00USD3726542002022-06-28 16:41:162022-06-28 16:41:163MFalse1
191303.2237612(-)2023-06-30 00:00:002024-06-30 00:00:00USD3726542032022-06-28 16:41:162022-06-28 16:41:163MFalse1

I am looking to combine these pairs of 2 whereby it will result in:

  1. Columns Not, Strike, SD, ED stay the same.
  2. Column Cents is added together
  3. Column Structure is replaced with (=)
  4. I dont care which timestamp is returned, whatever easy.
NotStrikeCentsSDEDStructureStartDateEndDateIndexDisseminationIDExecutionTimestampEventTimestampFloatingRateResetFrequencyPeriod1TruncatedNotionalTDateID
105002.925182.523(+)2024-06-30 00:00:002025-06-30 00:00:00USD3725217362022-06-28 13:41:012022-06-28 13:41:013MFalse1
115002.925182.523(+)2024-06-30 00:00:002025-06-30 00:00:00USD3725308922022-06-28 13:41:012022-06-28 13:41:013MFalse1
122503.255153.512(+)2023-06-30 00:00:002024-06-30 00:00:00USD3725391952022-06-28 14:36:152022-06-28 14:36:153MFalse1
132503.255153.512(+)2023-06-30 00:00:002024-06-30 00:00:00USD3725338652022-06-28 14:37:112022-06-28 14:37:113MFalse1
143802.4734312(-)2023-06-30 00:00:002024-06-30 00:00:00USD3726573842022-06-28 16:40:372022-06-28 16:40:373MFalse1
153802.4734312(-)2023-06-30 00:00:002024-06-30 00:00:00USD3726573862022-06-28 16:40:372022-06-28 16:40:373MFalse1
161303.22315212(=)2023-06-30 00:00:002024-06-30 00:00:00USD3726560232022-06-28 16:41:002022-06-28 16:41:003MFalse1
171303.22315212(=)2023-06-30 00:00:002024-06-30 00:00:00USD3726560332022-06-28 16:41:002022-06-28 16:41:003MFalse1

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

歌入人心 2025-02-20 21:05:06

iiuc,您可以在(+)上使用自定义组和merge_asof,带有lone( - )值的背面插入:

cols = ['A', 'B', 'C', 'D', 'E']

df['ExecutionTimestamp'] = pd.to_datetime(df['ExecutionTimestamp'])

# identify + rows
m = df['F'].eq('(+)')

# merge
out = (pd
  .merge_asof(df[m].reset_index(), df[~m].reset_index(),
              by=cols, on='ExecutionTimestamp',
              direction='nearest', tolerance=pd.Timedelta('2.5min'),
              suffixes=(None, '_')
             )
  .assign(F=lambda d: np.where(d['F_'].isna(), d['F'], '(=)'),
          C=lambda d: np.where(d['F_'].isna(), d['C'], d['C']*2), 
         )
 )

# add missing (lone) (-) value
missing = df.index.difference(out[['index', 'index_']].stack())

out = (pd
       .concat([out.set_index('index'), df.loc[missing]])
       .drop(columns=['index_', 'F_'])
      )

输出:

      A      B    C         D        E    F  ExecutionTimestamp
15  130  3.123   77  0.975342  1.97808  (+) 2022-06-28 13:41:00
16  130  3.123  154  0.975342  1.97808  (=) 2022-06-28 16:41:00
17  130  3.223  152  0.975342  1.97808  (=) 2022-06-28 16:41:00

IIUC, you can use a custom group and a merge_asof on (+) with back insertion of the lone (-) values:

cols = ['A', 'B', 'C', 'D', 'E']

df['ExecutionTimestamp'] = pd.to_datetime(df['ExecutionTimestamp'])

# identify + rows
m = df['F'].eq('(+)')

# merge
out = (pd
  .merge_asof(df[m].reset_index(), df[~m].reset_index(),
              by=cols, on='ExecutionTimestamp',
              direction='nearest', tolerance=pd.Timedelta('2.5min'),
              suffixes=(None, '_')
             )
  .assign(F=lambda d: np.where(d['F_'].isna(), d['F'], '(=)'),
          C=lambda d: np.where(d['F_'].isna(), d['C'], d['C']*2), 
         )
 )

# add missing (lone) (-) value
missing = df.index.difference(out[['index', 'index_']].stack())

out = (pd
       .concat([out.set_index('index'), df.loc[missing]])
       .drop(columns=['index_', 'F_'])
      )

output:

      A      B    C         D        E    F  ExecutionTimestamp
15  130  3.123   77  0.975342  1.97808  (+) 2022-06-28 13:41:00
16  130  3.123  154  0.975342  1.97808  (=) 2022-06-28 16:41:00
17  130  3.223  152  0.975342  1.97808  (=) 2022-06-28 16:41:00
給妳壹絲溫柔 2025-02-20 21:05:06

这应该使您获得预期的结果。

df = df.groupby(['A', 'B', 'D', 'E']).agg({'C' : 'sum', 'ExecutionTimestamp' : 'last'}).reset_index()
df['F'] = '(=)'
df[['A', 'B', 'C', 'D', 'E', 'F', 'ExecutionTimestamp']]

我确定您的意思是“保持时间戳”,如果您的意思是您可以手动选择所需的时间戳或愿意接受哪个时间戳。我只是以为您想要我的代码中的“最后”一个,但是如果您更喜欢第一个遇到值,则可以将其更改为“首先”。

This should allow you to get your expected results.

df = df.groupby(['A', 'B', 'D', 'E']).agg({'C' : 'sum', 'ExecutionTimestamp' : 'last'}).reset_index()
df['F'] = '(=)'
df[['A', 'B', 'C', 'D', 'E', 'F', 'ExecutionTimestamp']]

I was sure what you meant by the "keep either timestamp" if you meant that you could manually choose which one of the timestamps you wanted or if you were willing to accept either. I just assumed you would want the 'last' one in my code, but you can change that to 'first' if you would prefer the first encounter value.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文