从pandas dataframe中的每个ID中的第一个时间戳中的第一个时间戳减去最后时间戳

发布于 2025-02-12 22:04:47 字数 2159 浏览 0 评论 0 原文

I have a dataframe (df) with the following structure:

retweet_datetime tweet_id tweet_datetime
2020-04-24 03:33:15 85053699 2020-04-24 02:28:22
2020-04-24 02:43:35 85053699 2020-04 -24 02:28:22
2020-04-18 04:24:03 86095361 2020-04-18 00:06:01
2020-04-18 00:19:19:08 86095361 2020-04-18 00:06:01
2020-01 2020-04-04-18 00:18 00:18:18:38 86095361 2020- 04-18 00:06:01
2020-04-18 00:07:08 86095361 2020-04-18 00:06:01

retweet_datetime从最新转发到最新转发。

我想创建两个新列,如下所示:

  1. tweet_lifetime1 :上次转推时间与第一个转推时间之间的区别,即,对于每个Tweet_id:last retweet_dateTime -first retweet_dateet_datect_dateetime tweet_lifetime tweet_lifetime2
  2. tweet_lifetime2 /code>:上次转推时间和推文创建时间( tweet_datetime )之间的

更新

区别 示例,对于推文ID:“ 86095361”

  • tweet_lifetime1 = 2020-04-18 04:24:03-2020-2020-2020-04-18 00:07:07: 08 04:16:55
  • tweet_lifetime2 = = = 2020-04-18 04:24:03-2020-04-18 00:06:01 04:18:02

预期的输出DF:

retweet_dateTime Tweet_id tweet_id tweet_dateet_dateet_id tweet_dateTime Lifetime1 Lifetime2
2020-04-24 03:33:15 85053699 2020-04-24 02:28:22 00:49:40 01:04:53
2020-04-18 04:24:03 86095361 2020-04-18 004-18 00:06:01 04:16:55 04:18:02

我见过几个类似的帖子,但它们主要减去连续的行。例如,我可以为每个Tweet ID减去每个retweet_datetimes之间的时间差:如下:

df2 = df.assign(delta = df.groupby('tweet_id')['retweet_datetime'].diff())

I have a dataframe (df) with the following structure:

retweet_datetime tweet_id tweet_datetime
2020-04-24 03:33:15 85053699 2020-04-24 02:28:22
2020-04-24 02:43:35 85053699 2020-04-24 02:28:22
2020-04-18 04:24:03 86095361 2020-04-18 00:06:01
2020-04-18 00:19:08 86095361 2020-04-18 00:06:01
2020-04-18 00:18:38 86095361 2020-04-18 00:06:01
2020-04-18 00:07:08 86095361 2020-04-18 00:06:01

The retweet_datetime is sorted from latest to newest retweets.

I'd like to create two new columns as follows:

  1. tweet_lifetime1: the difference between the last retweet time and the first retweet time, i.e., for each tweet_id: last retweet_datetime - first retweet_datetime
  2. tweet_lifetime2: the difference between the last retweet time and tweet creation time (tweet_datetime)

Update

For example, for the tweet id: "86095361":

  • tweet_lifetime1 = 2020-04-18 04:24:03 - 2020-04-18 00:07:08 (04:16:55)
  • tweet_lifetime2 = 2020-04-18 04:24:03 - 2020-04-18 00:06:01 (04:18:02)

The expected output df:

retweet_datetime tweet_id tweet_datetime lifetime1 lifetime2
2020-04-24 03:33:15 85053699 2020-04-24 02:28:22 00:49:40 01:04:53
2020-04-18 04:24:03 86095361 2020-04-18 00:06:01 04:16:55 04:18:02

I've seen several similar posts, but they mostly subtract consecutive rows. For example, I can subtract the time difference between each retweet_datetimes for each tweet id as follows:

df2 = df.assign(delta = df.groupby('tweet_id')['retweet_datetime'].diff())

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

天涯离梦残月幽梦 2025-02-19 22:04:47

使用 noflowl noreferrer“> nater gotnegation” href =“ http://pandas.pydata.org/pandas-docs/stable/reference/reference/pandas.series.sub.html” rel =“ nofollow noreferrer”> a>, A>用于Drop列 TMP 处理后:

df1 = (df.groupby('tweet_id', as_index=False)
         .agg(retweet_datetime=('retweet_datetime','first'),
              tmp = ('retweet_datetime','last'),
              tweet_datetime = ('tweet_datetime','last')))

df1['lifetime1'] = df1['retweet_datetime'].sub(df1.pop('tmp'))
df1['lifetime2'] = df1['retweet_datetime'].sub(df1['tweet_datetime'])
print (df1)
   tweet_id    retweet_datetime      tweet_datetime       lifetime1  \
0  85053699 2020-04-24 03:33:15 2020-04-24 02:28:22 0 days 00:49:40   
1  86095361 2020-04-18 04:24:03 2020-04-18 00:06:01 0 days 04:16:55   

        lifetime2  
0 0 days 01:04:53  
1 0 days 04:18:02  

如果需要格式 HH:MM:SS 使用:

def f(x):
    ts = x.total_seconds()
    hours, remainder = divmod(ts, 3600)
    minutes, seconds = divmod(remainder, 60)
    return ('{:02d}:{:02d}:{:02d}').format(int(hours), int(minutes), int(seconds))

df1['lifetime1'] = df1['retweet_datetime'].sub(df1.pop('tmp')).apply(f)
df1['lifetime2'] = df1['retweet_datetime'].sub(df1['tweet_datetime']).apply(f)
print (df1)
   tweet_id    retweet_datetime      tweet_datetime lifetime1 lifetime2
0  85053699 2020-04-24 03:33:15 2020-04-24 02:28:22  00:49:40  01:04:53
1  86095361 2020-04-18 04:24:03 2020-04-18 00:06:01  04:16:55  04:18:02

Use named aggregation with subtract column with Series.sub, DataFrame.pop is used for drop column tmp after processing:

df1 = (df.groupby('tweet_id', as_index=False)
         .agg(retweet_datetime=('retweet_datetime','first'),
              tmp = ('retweet_datetime','last'),
              tweet_datetime = ('tweet_datetime','last')))

df1['lifetime1'] = df1['retweet_datetime'].sub(df1.pop('tmp'))
df1['lifetime2'] = df1['retweet_datetime'].sub(df1['tweet_datetime'])
print (df1)
   tweet_id    retweet_datetime      tweet_datetime       lifetime1  \
0  85053699 2020-04-24 03:33:15 2020-04-24 02:28:22 0 days 00:49:40   
1  86095361 2020-04-18 04:24:03 2020-04-18 00:06:01 0 days 04:16:55   

        lifetime2  
0 0 days 01:04:53  
1 0 days 04:18:02  

If need format HH:MM:SS use:

def f(x):
    ts = x.total_seconds()
    hours, remainder = divmod(ts, 3600)
    minutes, seconds = divmod(remainder, 60)
    return ('{:02d}:{:02d}:{:02d}').format(int(hours), int(minutes), int(seconds))

df1['lifetime1'] = df1['retweet_datetime'].sub(df1.pop('tmp')).apply(f)
df1['lifetime2'] = df1['retweet_datetime'].sub(df1['tweet_datetime']).apply(f)
print (df1)
   tweet_id    retweet_datetime      tweet_datetime lifetime1 lifetime2
0  85053699 2020-04-24 03:33:15 2020-04-24 02:28:22  00:49:40  01:04:53
1  86095361 2020-04-18 04:24:03 2020-04-18 00:06:01  04:16:55  04:18:02
じ违心 2025-02-19 22:04:47

GroupBy和Agg 首先要选择每个组的第一行,然后使用带有的最后一个,以选择每个组的最后一个值。然后 JOIN 汇总的数据框架并计算 Lifetime1 LifeTime2 通过减去相关列

g = df.groupby('tweet_id')
s = g.first().join(g['retweet_datetime'].agg(['last']))

s['lifetime1'] = s['retweet_datetime'] - s.pop('last')
s['lifetime2'] = s['retweet_datetime'] - s['tweet_datetime']

            retweet_datetime      tweet_datetime       lifetime1       lifetime2
tweet_id                                                                        
85053699 2020-04-24 03:33:15 2020-04-24 02:28:22 0 days 00:49:40 0 days 01:04:53
86095361 2020-04-18 04:24:03 2020-04-18 00:06:01 0 days 04:16:55 0 days 04:18:02

Groupby and agg with first to select the first row per group, then groupby and agg retweet_datetime with last to select last value per group. Then join aggregated dataframes and calculate lifetime1 and lifetime2 by subtracting relevant columns

g = df.groupby('tweet_id')
s = g.first().join(g['retweet_datetime'].agg(['last']))

s['lifetime1'] = s['retweet_datetime'] - s.pop('last')
s['lifetime2'] = s['retweet_datetime'] - s['tweet_datetime']

            retweet_datetime      tweet_datetime       lifetime1       lifetime2
tweet_id                                                                        
85053699 2020-04-24 03:33:15 2020-04-24 02:28:22 0 days 00:49:40 0 days 01:04:53
86095361 2020-04-18 04:24:03 2020-04-18 00:06:01 0 days 04:16:55 0 days 04:18:02
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文