基于大熊猫数据框中的ID列的多个日期之间的差异

发布于 2025-01-21 22:22:13 字数 502 浏览 2 评论 0原文

我的实际情况是,在一家商店中,我想知道第一次访问和第二次访问之间的访问期(在第二次访问,第二次和第三次访问之间,...

我有一个带有2列的Python数据集(访问IDS)对于每个客户和访问日期)

data = {'Id': ['A', 'B','A','B','A','A'],
'Date': ['01/03/2022', '03/03/2022', '05/03/2022', '07/03/2022', '09/03/2022','11/03/2022']
}

”“在此处输入图像描述”

我的问题:第一次访问与第二次访问之间有几天的时间?第二次访问和第三次访问之间的同样问题...

My practical case is the following, in a store I would like to know visit period (in day) between the 1st visit and the 2nd visit, the 2nd and the 3rd,...

I have a python dataset with 2 columns (visit IDs for each customer and a date of visit)

data = {'Id': ['A', 'B','A','B','A','A'],
'Date': ['01/03/2022', '03/03/2022', '05/03/2022', '07/03/2022', '09/03/2022','11/03/2022']
}

enter image description here

My question : How many days are there between the 1st visit and the 2nd visit for customers who have come 4 times? same question between the 2nd and the 3rd visit...

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

我一直都在从未离去 2025-01-28 22:22:13

您期望的输出尚不清楚,但是让我们以客户为索引来计算一个2D表,访问为列:

# convert to datetime (assuming DD/MM/YYYY)
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)

# setting up group
g = df.groupby('Id')['Date']

df2 = (df
 .sort_values(by='Id')      # ensure the dates are ordered
 .assign(visit=g.cumcount().add(1),      # visit number
         nb_visits=g.transform('count'), # total number of visits
         diff=g.diff()                   # difference between successive visits
         )
 .query('nb_visits >= 4')   # filter to keep customers with at least 4 visits
 .pivot(index='Id', columns='visit', values='diff')  # reshape to 2D
)

输出:

这给出了自上次访问以来每个客户(如果更多3个访问)以来的天数

visit   1      2      3      4
Id                            
A     NaT 4 days 4 days 2 days

nb 。您可以在此处删除第一列,因为它始终是未定义的(NAT)

The output you expect is unclear, but let's compute a 2D table with the customers as index and the visits as columns:

# convert to datetime (assuming DD/MM/YYYY)
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)

# setting up group
g = df.groupby('Id')['Date']

df2 = (df
 .sort_values(by='Id')      # ensure the dates are ordered
 .assign(visit=g.cumcount().add(1),      # visit number
         nb_visits=g.transform('count'), # total number of visits
         diff=g.diff()                   # difference between successive visits
         )
 .query('nb_visits >= 4')   # filter to keep customers with at least 4 visits
 .pivot(index='Id', columns='visit', values='diff')  # reshape to 2D
)

output:

This gives the number of days since the previous visit for each customer (if more that 3 visits)

visit   1      2      3      4
Id                            
A     NaT 4 days 4 days 2 days

NB. you could remove the first column here as it will always be undefined (NaT)

萌无敌 2025-01-28 22:22:13

您可以首先将它们施放到pd.timestamp对象,然后对ID和日期进行排序

df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df = df.sort_values(['Id', 'Date'])

是同一用户:

df['Date Since Last Visit'] = (df['Date'] - df['Date'].shift(1)) * (df['Id'] == df['Id'].shift())
df['Date Since Last Visit'] = df['Date Since Last Visit'].fillna(pd.Timedelta(0))

,您可以简单地计算时间戳中的差异,乘以布尔式蒙版,是否 将为您提供序列对象,您可以看到已经有多少天了。

您的输出将是:


    Id  Date        Date Since Last Visit
0   A   2022-03-01  0 days
2   A   2022-03-05  4 days
4   A   2022-03-09  4 days
5   A   2022-03-11  2 days
1   B   2022-03-03  0 days
3   B   2022-03-07  4 days

或者仅过滤出第一次访问:

df['Date Since Last Visit'] = df['Date'] - df['Date'].shift(1)
df = df[df['Id'] == df['Id'].shift()]

它给您:

    Id  Date        Date Since Last Visit
2   A   2022-03-05  4 days
4   A   2022-03-09  4 days
5   A   2022-03-11  2 days
3   B   2022-03-07  4 days

You can first cast them to pd.Timestamp objects and sort the ids and dates by

df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df = df.sort_values(['Id', 'Date'])

Then, you can simply calculate the difference in the timestamps, multiplied by the boolean mask of whether it is the same user:

df['Date Since Last Visit'] = (df['Date'] - df['Date'].shift(1)) * (df['Id'] == df['Id'].shift())
df['Date Since Last Visit'] = df['Date Since Last Visit'].fillna(pd.Timedelta(0))

Which will give you the timedelta objects that you can see how many days it has been.

Your output will be:


    Id  Date        Date Since Last Visit
0   A   2022-03-01  0 days
2   A   2022-03-05  4 days
4   A   2022-03-09  4 days
5   A   2022-03-11  2 days
1   B   2022-03-03  0 days
3   B   2022-03-07  4 days

Or just filter out the first visits after:

df['Date Since Last Visit'] = df['Date'] - df['Date'].shift(1)
df = df[df['Id'] == df['Id'].shift()]

Which gives you:

    Id  Date        Date Since Last Visit
2   A   2022-03-05  4 days
4   A   2022-03-09  4 days
5   A   2022-03-11  2 days
3   B   2022-03-07  4 days
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文