根据多个参数查找多个事件之间的时间差(以天为单位)
当电子邮件发送到特定客户时,我有 2 个 DataFrame,其中包含 Date
- 特定 Territory
上的 Account:External ID
df 名称是 emails
帐户:外部 ID
- 只是一个客户 ID(客户的编码名称)。两个 DataFrame 中可能存在或不存在相同的 ID。
emails
Territory Account: External ID Date type
5 1 1 2021-01-25 email
29 1 1 2021-05-29 email
15 1 2 2021-03-15 email
18 1 2 2021-04-02 email
24 1 2 2021-04-29 email
0 1 3 2021-01-01 email
2 1 3 2021-01-10 email
6 1 3 2021-01-25 email
11 1 3 2021-02-19 email
13 1 4 2021-03-03 email
21 1 4 2021-04-12 email
4 1 5 2021-01-22 email
26 1 5 2021-05-10 email
8 2 2 2021-01-30 email
9 2 2 2021-02-07 email
10 2 2 2021-02-10 email
20 2 3 2021-04-08 email
7 2 4 2021-01-28 email
17 2 5 2021-03-27 email
28 2 5 2021-05-22 email
25 3 1 2021-05-07 email
27 3 1 2021-05-12 email
22 3 2 2021-04-14 email
23 3 4 2021-04-23 email
3 3 5 2021-01-15 email
1 4 1 2021-01-05 email
12 4 3 2021-02-23 email
16 4 4 2021-03-20 email
14 4 5 2021-03-06 email
19 4 5 2021-04-07 email
第二个 df 具有完全相同的字段,但包含有关与客户交互的信息 对于第一个 DataFrame 中的每一行,我需要找到电子邮件
Territory Account: External ID Date type
11 1 2 2021-06-01 visit
0 2 2 2021-01-02 visit
8 3 1 2021-04-11 visit
10 3 1 2021-05-17 visit
5 3 3 2021-02-26 visit
1 3 4 2021-01-14 visit
4 3 4 2021-02-20 visit
2 4 1 2021-01-30 visit
3 4 2 2021-02-07 visit
7 4 2 2021-03-31 visit
9 4 2 2021-04-28 visit
6 4 4 2021-03-12 visit
发送和交互之间的天数。 我想分别查看交互之前或之后的天数。
应该有更优雅的方式
我设法编写了完美运行的代码,但我相信我的代码
#concant and sort 2 DataFrame
emails_visits=pd.concat([emails,visits],ignore_index=True)
emails_visits = emails_visits.sort_values(by=["Territory", "Account: External ID ", "Date"])
#Intermidiate step 1
emails_visits["date_after"] = emails_visits["Date"].where(emails_visits["type"]=="visit")
emails_visits["date_before"] = emails_visits["Date"].where(emails_visits["type"]=="visit")
#Fill columns with date
emails_visits["date_after"] = emails_visits.groupby(["Territory", "Account: External ID"])["date_after"].ffill()
emails_visits["date_before"] = emails_visits.groupby(["Territory", "Account: External ID"])["date_before"].bfill()
#calcate Difference in number of days
emails_visits["date_after"]=emails_visits.groupby(["Territory","Account: External ID ",])["date_after"].bfill()
emails_visits["date_before"]=emails_visits.groupby(["Territory","Account: External ID ",])["date_before"].ffill()
emails_visits["days_before_visit"]=emails_visits["date_before"]-emails_visits["Date"]
emails_visits["days_after_visit"]=emails_visits["date_after"]-emails_visits["Date"]
:输出
Territory Account: External ID Date type date_after date_before days_before_visit days_after_visit
0 1 1 2021-01-25 email NaT NaT NaT NaT
1 1 1 2021-05-29 email NaT NaT NaT NaT
2 1 2 2021-03-15 email 2021-06-01 NaT NaT 78 days
3 1 2 2021-04-02 email 2021-06-01 NaT NaT 60 days
4 1 2 2021-04-29 email 2021-06-01 NaT NaT 33 days
30 1 2 2021-06-01 visit 2021-06-01 2021-06-01 0 days 0 days
5 1 3 2021-01-01 email NaT NaT NaT NaT
6 1 3 2021-01-10 email NaT NaT NaT NaT
7 1 3 2021-01-25 email NaT NaT NaT NaT
8 1 3 2021-02-19 email NaT NaT NaT NaT
9 1 4 2021-03-03 email NaT NaT NaT NaT
10 1 4 2021-04-12 email NaT NaT NaT NaT
11 1 5 2021-01-22 email NaT NaT NaT NaT
12 1 5 2021-05-10 email NaT NaT NaT NaT
31 2 2 2021-01-02 visit 2021-01-02 2021-01-02 0 days 0 days
13 2 2 2021-01-30 email NaT 2021-01-02 -28 days NaT
14 2 2 2021-02-07 email NaT 2021-01-02 -36 days NaT
15 2 2 2021-02-10 email NaT 2021-01-02 -39 days NaT
16 2 3 2021-04-08 email NaT NaT NaT NaT
17 2 4 2021-01-28 email NaT NaT NaT NaT
18 2 5 2021-03-27 email NaT NaT NaT NaT
19 2 5 2021-05-22 email NaT NaT NaT NaT
32 3 1 2021-04-11 visit 2021-04-11 2021-04-11 0 days 0 days
20 3 1 2021-05-07 email 2021-05-17 2021-04-11 -26 days 10 days
21 3 1 2021-05-12 email 2021-05-17 2021-04-11 -31 days 5 days
33 3 1 2021-05-17 visit 2021-05-17 2021-05-17 0 days 0 days
22 3 2 2021-04-14 email NaT NaT NaT NaT
34 3 3 2021-02-26 visit 2021-02-26 2021-02-26 0 days 0 days
35 3 4 2021-01-14 visit 2021-01-14 2021-01-14 0 days 0 days
36 3 4 2021-02-20 visit 2021-02-20 2021-02-20 0 days 0 days
23 3 4 2021-04-23 email NaT 2021-02-20 -62 days NaT
24 3 5 2021-01-15 email NaT NaT NaT NaT
25 4 1 2021-01-05 email 2021-01-30 NaT NaT 25 days
37 4 1 2021-01-30 visit 2021-01-30 2021-01-30 0 days 0 days
38 4 2 2021-02-07 visit 2021-02-07 2021-02-07 0 days 0 days
39 4 2 2021-03-31 visit 2021-03-31 2021-03-31 0 days 0 days
40 4 2 2021-04-28 visit 2021-04-28 2021-04-28 0 days 0 days
26 4 3 2021-02-23 email NaT NaT NaT NaT
41 4 4 2021-03-12 visit 2021-03-12 2021-03-12 0 days 0 days
27 4 4 2021-03-20 email NaT 2021-03-12 -8 days NaT
28 4 5 2021-03-06 email NaT NaT NaT NaT
29 4 5 2021-04-07 email NaT NaT NaT NaT
这正是我期望的输出,但我希望它可以做得更简单。
I have 2 DataFrames first with Date
when email was sent to certain client - Account: External ID
on certain Territory
df name is emails
Account: External ID
- is just a customer ID (coded name of a client). Same ID could be present or not in both DataFrames.
emails
Territory Account: External ID Date type
5 1 1 2021-01-25 email
29 1 1 2021-05-29 email
15 1 2 2021-03-15 email
18 1 2 2021-04-02 email
24 1 2 2021-04-29 email
0 1 3 2021-01-01 email
2 1 3 2021-01-10 email
6 1 3 2021-01-25 email
11 1 3 2021-02-19 email
13 1 4 2021-03-03 email
21 1 4 2021-04-12 email
4 1 5 2021-01-22 email
26 1 5 2021-05-10 email
8 2 2 2021-01-30 email
9 2 2 2021-02-07 email
10 2 2 2021-02-10 email
20 2 3 2021-04-08 email
7 2 4 2021-01-28 email
17 2 5 2021-03-27 email
28 2 5 2021-05-22 email
25 3 1 2021-05-07 email
27 3 1 2021-05-12 email
22 3 2 2021-04-14 email
23 3 4 2021-04-23 email
3 3 5 2021-01-15 email
1 4 1 2021-01-05 email
12 4 3 2021-02-23 email
16 4 4 2021-03-20 email
14 4 5 2021-03-06 email
19 4 5 2021-04-07 email
Second df with exactly same fields but with info about interaction with customervisits
Territory Account: External ID Date type
11 1 2 2021-06-01 visit
0 2 2 2021-01-02 visit
8 3 1 2021-04-11 visit
10 3 1 2021-05-17 visit
5 3 3 2021-02-26 visit
1 3 4 2021-01-14 visit
4 3 4 2021-02-20 visit
2 4 1 2021-01-30 visit
3 4 2 2021-02-07 visit
7 4 2 2021-03-31 visit
9 4 2 2021-04-28 visit
6 4 4 2021-03-12 visit
For each row in first DataFrame I need to find number of days between email send and interaction.
I want separately see number of days before or after interaction.
I managed to write code that works perfectly but I believe there should be more elegant way
my code is here:
#concant and sort 2 DataFrame
emails_visits=pd.concat([emails,visits],ignore_index=True)
emails_visits = emails_visits.sort_values(by=["Territory", "Account: External ID ", "Date"])
#Intermidiate step 1
emails_visits["date_after"] = emails_visits["Date"].where(emails_visits["type"]=="visit")
emails_visits["date_before"] = emails_visits["Date"].where(emails_visits["type"]=="visit")
#Fill columns with date
emails_visits["date_after"] = emails_visits.groupby(["Territory", "Account: External ID"])["date_after"].ffill()
emails_visits["date_before"] = emails_visits.groupby(["Territory", "Account: External ID"])["date_before"].bfill()
#calcate Difference in number of days
emails_visits["date_after"]=emails_visits.groupby(["Territory","Account: External ID ",])["date_after"].bfill()
emails_visits["date_before"]=emails_visits.groupby(["Territory","Account: External ID ",])["date_before"].ffill()
emails_visits["days_before_visit"]=emails_visits["date_before"]-emails_visits["Date"]
emails_visits["days_after_visit"]=emails_visits["date_after"]-emails_visits["Date"]
output
Territory Account: External ID Date type date_after date_before days_before_visit days_after_visit
0 1 1 2021-01-25 email NaT NaT NaT NaT
1 1 1 2021-05-29 email NaT NaT NaT NaT
2 1 2 2021-03-15 email 2021-06-01 NaT NaT 78 days
3 1 2 2021-04-02 email 2021-06-01 NaT NaT 60 days
4 1 2 2021-04-29 email 2021-06-01 NaT NaT 33 days
30 1 2 2021-06-01 visit 2021-06-01 2021-06-01 0 days 0 days
5 1 3 2021-01-01 email NaT NaT NaT NaT
6 1 3 2021-01-10 email NaT NaT NaT NaT
7 1 3 2021-01-25 email NaT NaT NaT NaT
8 1 3 2021-02-19 email NaT NaT NaT NaT
9 1 4 2021-03-03 email NaT NaT NaT NaT
10 1 4 2021-04-12 email NaT NaT NaT NaT
11 1 5 2021-01-22 email NaT NaT NaT NaT
12 1 5 2021-05-10 email NaT NaT NaT NaT
31 2 2 2021-01-02 visit 2021-01-02 2021-01-02 0 days 0 days
13 2 2 2021-01-30 email NaT 2021-01-02 -28 days NaT
14 2 2 2021-02-07 email NaT 2021-01-02 -36 days NaT
15 2 2 2021-02-10 email NaT 2021-01-02 -39 days NaT
16 2 3 2021-04-08 email NaT NaT NaT NaT
17 2 4 2021-01-28 email NaT NaT NaT NaT
18 2 5 2021-03-27 email NaT NaT NaT NaT
19 2 5 2021-05-22 email NaT NaT NaT NaT
32 3 1 2021-04-11 visit 2021-04-11 2021-04-11 0 days 0 days
20 3 1 2021-05-07 email 2021-05-17 2021-04-11 -26 days 10 days
21 3 1 2021-05-12 email 2021-05-17 2021-04-11 -31 days 5 days
33 3 1 2021-05-17 visit 2021-05-17 2021-05-17 0 days 0 days
22 3 2 2021-04-14 email NaT NaT NaT NaT
34 3 3 2021-02-26 visit 2021-02-26 2021-02-26 0 days 0 days
35 3 4 2021-01-14 visit 2021-01-14 2021-01-14 0 days 0 days
36 3 4 2021-02-20 visit 2021-02-20 2021-02-20 0 days 0 days
23 3 4 2021-04-23 email NaT 2021-02-20 -62 days NaT
24 3 5 2021-01-15 email NaT NaT NaT NaT
25 4 1 2021-01-05 email 2021-01-30 NaT NaT 25 days
37 4 1 2021-01-30 visit 2021-01-30 2021-01-30 0 days 0 days
38 4 2 2021-02-07 visit 2021-02-07 2021-02-07 0 days 0 days
39 4 2 2021-03-31 visit 2021-03-31 2021-03-31 0 days 0 days
40 4 2 2021-04-28 visit 2021-04-28 2021-04-28 0 days 0 days
26 4 3 2021-02-23 email NaT NaT NaT NaT
41 4 4 2021-03-12 visit 2021-03-12 2021-03-12 0 days 0 days
27 4 4 2021-03-20 email NaT 2021-03-12 -8 days NaT
28 4 5 2021-03-06 email NaT NaT NaT NaT
29 4 5 2021-04-07 email NaT NaT NaT NaT
this is exactly output I expect but I hope it could be done simpler.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论