如何最好地找到所需事件序列之间的时间差？

发布于 2025-01-13 13:13:40 字数 1442 浏览 1 评论 0原文

假设我有以下数据框：

import pandas as pd
df = pd.DataFrame(columns=['Time', 'Event'])
df.Time = ['06:15.2', '06:15.2', '06:45.3', '43:26.8', '43:26.8', '43:57.9', '27:30.0', '27:31.1', '27:31.1', '27:31.1', '27:31.1', '32:41.0', '10:22.6', '10:22.6', '10:54.7', '11:30.3', '11:30.3']
df.Event = ['login', 'other', 'logout', 'login', 'other', 'logout', 'login', 'other', 'other', 'login', 'logout', 'logout', 'login', 'other', 'logout', 'login', 'other']

我想查找登录和登录后立即注销之间的时间差。也就是说，如果有多次登录，然后一次注销，我想找到最后一次登录和注销之间的时间差。

我的尝试如下：

import numpy as np
import datetime

diff = np.zeros(len(df))

for i in range(len(df)):
    if df.Event[i]=='login':
        for j in range(len(df[i:])):
            if df.Event[j]=='logout':
                diff[j] = (datetime.datetime.strptime(df.Time[j], '%M:%S.%f') - datetime.datetime.strptime(df.Time[i], '%M:%S.%f'))/ datetime.timedelta(milliseconds=1)
                
df['Diff'] = diff
df

但它不会产生正确的结果：

正确的结果应该是：Diff 列中的非零数字应该为 30.1、31.1、 0.0, 0.0 (不知道如何处理这种情况，因为这个注销是在注销之后发生的...考虑到成对登录和注销会增加另一个级别的复杂性), 32.1

是否有任何 Pythonic 方法可以做到这一点？

原文

Say I have the following dataframe:

import pandas as pd
df = pd.DataFrame(columns=['Time', 'Event'])
df.Time = ['06:15.2', '06:15.2', '06:45.3', '43:26.8', '43:26.8', '43:57.9', '27:30.0', '27:31.1', '27:31.1', '27:31.1', '27:31.1', '32:41.0', '10:22.6', '10:22.6', '10:54.7', '11:30.3', '11:30.3']
df.Event = ['login', 'other', 'logout', 'login', 'other', 'logout', 'login', 'other', 'other', 'login', 'logout', 'logout', 'login', 'other', 'logout', 'login', 'other']

I want to find the time difference between a login and the logout that comes immediately after the login. That is, if there are multiple logins and then one logout, I want to find the time difference between the last login and the logout.

My attempt to do this is as follows:

import numpy as np
import datetime

diff = np.zeros(len(df))

for i in range(len(df)):
    if df.Event[i]=='login':
        for j in range(len(df[i:])):
            if df.Event[j]=='logout':
                diff[j] = (datetime.datetime.strptime(df.Time[j], '%M:%S.%f') - datetime.datetime.strptime(df.Time[i], '%M:%S.%f'))/ datetime.timedelta(milliseconds=1)
                
df['Diff'] = diff
df

But it doesn't produce the right result:

The correct result should be: the non-zero numbers in the Diff column should be 30.1, 31.1, 0.0, 0.0 (not sure how to handle this case, as this logout comes after a logout...considering pairwise logins and logouts would be adding another level of complexity), 32.1

Is there any Pythonic way of doing this?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

困倦 2025-01-20 13:13:40

我有一些非常长且低效的代码，但它有效。第一个函数稍后用于将时间转换为您想要的字符串格式。 Times 和 Events 列将转换为列表，然后进行列表理解以将时间字符串转换为分钟。

# str is in HH:MM:SS.MS format. this_time is in minutes.
def convert_time_to_str(this_time):

    seconds = (this_time % 1)*60
    ms = int(round((seconds % 1) * 1000, 0))
    seconds = int(seconds // 1)
    
    if this_time < 1:
        return str(seconds) + "." + str(ms)
        
    minutes = this_time % 1    
    hours = int(minutes // 60)
    minutes = int(minutes - hours*60)
    
    if minutes < 60:
        return str(minutes) + ":" + str(seconds) + "." + str(ms)
    else:
        return str(hours) + ":" + str(minutes) + ":" + str(seconds) + "." + str(ms)

times_list = df["Time"].tolist()
events = df["Event"].tolist()
times = []

# turning times_list into numbers (in terms of minutes).
for i in range(len(times_list)):
        
    this_time_str = times_list[i]
    this_time_split = this_time_str.split(":")
    
    s, ms = this_time_split[-1].split(".")
    
    if len(this_time_split) == 2:
        h = 0
        m = int(this_time_split[0])
    else:
        h = int(this_time_split[0])
        m = int(this_time_split[1])
         
    # times are in terms of minutes, since that seems to be the relevant scale in your case.
    this_time = int(h)*60 + int(m) + int(s)/60 + int(ms)/(60*1000)
    times.append(this_time)

# Finding time differences between login and logout.
final_time_diffs = [0]*len(events)
for i in range(1, len(events)):
    
    # iterate through rows corresponding to a logout.
    if events[i] == "logout":
        
        # j is how many indices before i (the logout row) the closest login is.
        last_login_idx = ""
        j = 1
        found_login = False
        
        while last_login_idx == "" and j <= i:
            previous_event = events[i-j]

            # If previous event was a logout, ignore this entire logout row.
            if previous_event == "logout":
                break
            elif previous_event == "login":
                last_login_idx = i - j
                found_login = True
                break
            j += 1
        
        # Find the time_diff between logout and closest previous login.
        if found_login:
            this_time_diff = times[i] - times[last_login_idx]
            this_time_diff_str = convert_time_to_str(this_time_diff)
            final_time_diffs[i] = this_time_diff_str
        else:
            final_time_diffs[i] = 0

# Name this column whatever you want; main thing is that it'll contain the time differences between login and logout. If there are 2 logouts in a row, the 2nd logout will have time_diff 0.
df["TimeDiffs"] = final_time_diffs

我还没有测试代码是否可以忍受 Times 列中的小时值，但理论上是可以的。它无法处理滚动到第二天的时间，所以要注意！另外，如果小时值非常大，则此代码将不必要地消耗内存来存储较大的时间，因为时间以分钟为单位；在这种情况下，我建议分块工作，并从所有相关时间中减去一定的小时数。

I've got some very long and inefficient code, but it works. The first function is used later to turn times into the string format you want. The Times and Events columns are turned into lists and then list comprehension is done to turn the time strings into minutes.

# str is in HH:MM:SS.MS format. this_time is in minutes.
def convert_time_to_str(this_time):

    seconds = (this_time % 1)*60
    ms = int(round((seconds % 1) * 1000, 0))
    seconds = int(seconds // 1)
    
    if this_time < 1:
        return str(seconds) + "." + str(ms)
        
    minutes = this_time % 1    
    hours = int(minutes // 60)
    minutes = int(minutes - hours*60)
    
    if minutes < 60:
        return str(minutes) + ":" + str(seconds) + "." + str(ms)
    else:
        return str(hours) + ":" + str(minutes) + ":" + str(seconds) + "." + str(ms)

times_list = df["Time"].tolist()
events = df["Event"].tolist()
times = []

# turning times_list into numbers (in terms of minutes).
for i in range(len(times_list)):
        
    this_time_str = times_list[i]
    this_time_split = this_time_str.split(":")
    
    s, ms = this_time_split[-1].split(".")
    
    if len(this_time_split) == 2:
        h = 0
        m = int(this_time_split[0])
    else:
        h = int(this_time_split[0])
        m = int(this_time_split[1])
         
    # times are in terms of minutes, since that seems to be the relevant scale in your case.
    this_time = int(h)*60 + int(m) + int(s)/60 + int(ms)/(60*1000)
    times.append(this_time)

# Finding time differences between login and logout.
final_time_diffs = [0]*len(events)
for i in range(1, len(events)):
    
    # iterate through rows corresponding to a logout.
    if events[i] == "logout":
        
        # j is how many indices before i (the logout row) the closest login is.
        last_login_idx = ""
        j = 1
        found_login = False
        
        while last_login_idx == "" and j <= i:
            previous_event = events[i-j]

            # If previous event was a logout, ignore this entire logout row.
            if previous_event == "logout":
                break
            elif previous_event == "login":
                last_login_idx = i - j
                found_login = True
                break
            j += 1
        
        # Find the time_diff between logout and closest previous login.
        if found_login:
            this_time_diff = times[i] - times[last_login_idx]
            this_time_diff_str = convert_time_to_str(this_time_diff)
            final_time_diffs[i] = this_time_diff_str
        else:
            final_time_diffs[i] = 0

# Name this column whatever you want; main thing is that it'll contain the time differences between login and logout. If there are 2 logouts in a row, the 2nd logout will have time_diff 0.
df["TimeDiffs"] = final_time_diffs

I haven't tested whether the code can put up with hours values in the Times column, but theoretically, it should. It cannot deal with times rolling over to the next day, so be aware! Also, if the hour values are very large, this code will unnecessarily eat up memory just to store the large times, since times are in terms of minutes; in that case, I'd advise working in chunks and subtracting a certain number of hours from all the times in question.

回复收藏 0 原文

疏忽 2025-01-20 13:13:40

import pandas as pd
df = pd.DataFrame(columns=['Time', 'Event'])
df.Time = ['06:15.2', '06:15.2', '06:45.3', '43:26.8', '43:26.8', '43:57.9', '27:30.0', '27:31.1', '27:31.1', '27:31.1', '27:31.1', '32:41.0', '10:22.6', '10:22.6', '10:54.7', '11:30.3', '11:30.3']
df.Event = ['login', 'other', 'logout', 'login', 'other', 'logout', 'login', 'other', 'other', 'login', 'logout', 'logout', 'login', 'other', 'logout', 'login', 'other']

df["Time"] = df["Time"].apply(lambda x : int(x[0:2])*60 + float(x[3:]))
df["Time1"] = df["Time"].shift(1).fillna(0)
df1 = df[df["Event"].isin(["logout","login"])]
df1["Event1"] = df1["Event"].shift(1).fillna(0)

df2 = df1[(df1["Event"] == "logout") & (df1["Event1"] == "login")].assign( Duration = df1["Time"] - df1["Time1"])
df.merge(df2, how = "left").drop(["Time1","Event1"], axis = 1).fillna(0)

这是输出：

Out[53]: 
      Time   Event  Duration
0    375.2   login       0.0
1    375.2   other       0.0
2    405.3  logout      30.1
3   2606.8   login       0.0
4   2606.8   other       0.0
5   2637.9  logout      31.1
6   1650.0   login       0.0
7   1651.1   other       0.0
8   1651.1   other       0.0
9   1651.1   login       0.0
10  1651.1  logout       0.0
11  1961.0  logout       0.0
12   622.6   login       0.0
13   622.6   other       0.0
14   654.7  logout      32.1
15   690.3   login       0.0
16   690.3   other       0.0

import pandas as pd
df = pd.DataFrame(columns=['Time', 'Event'])
df.Time = ['06:15.2', '06:15.2', '06:45.3', '43:26.8', '43:26.8', '43:57.9', '27:30.0', '27:31.1', '27:31.1', '27:31.1', '27:31.1', '32:41.0', '10:22.6', '10:22.6', '10:54.7', '11:30.3', '11:30.3']
df.Event = ['login', 'other', 'logout', 'login', 'other', 'logout', 'login', 'other', 'other', 'login', 'logout', 'logout', 'login', 'other', 'logout', 'login', 'other']

df["Time"] = df["Time"].apply(lambda x : int(x[0:2])*60 + float(x[3:]))
df["Time1"] = df["Time"].shift(1).fillna(0)
df1 = df[df["Event"].isin(["logout","login"])]
df1["Event1"] = df1["Event"].shift(1).fillna(0)

df2 = df1[(df1["Event"] == "logout") & (df1["Event1"] == "login")].assign( Duration = df1["Time"] - df1["Time1"])
df.merge(df2, how = "left").drop(["Time1","Event1"], axis = 1).fillna(0)

Here is the output :

Out[53]: 
      Time   Event  Duration
0    375.2   login       0.0
1    375.2   other       0.0
2    405.3  logout      30.1
3   2606.8   login       0.0
4   2606.8   other       0.0
5   2637.9  logout      31.1
6   1650.0   login       0.0
7   1651.1   other       0.0
8   1651.1   other       0.0
9   1651.1   login       0.0
10  1651.1  logout       0.0
11  1961.0  logout       0.0
12   622.6   login       0.0
13   622.6   other       0.0
14   654.7  logout      32.1
15   690.3   login       0.0
16   690.3   other       0.0

回复收藏 0 原文

~没有更多了~