在Pyspark上拿到客户购买的重新度
我正在处理的数据集示例:
data = [(1, "2021-11-08"), (1, "2021-11-06"), (1, "2021-10-08"), (2, "2021-11-01"), (2, "2021-10-20"),
(2, "2021-08-05"), (3, "2021-08-02"), (3, "2021-05-08"), (3, "2021-03-01")]
df = spark.createDataFrame(data=data,schema=columns)
我想拿起客户的最新购买(客户的date.max())和先前的最高购买(倒数第二次购买)并进行两者之间的区别(我在所有购买中都假设相同的产品)。我仍然没有在Pyspark中找到这样做的东西。我的想法的一个例子是在组中进行此操作,例如以下最低日期和最高日期。
df1 = df.withColumn('data',to_date(df.data))
dados_agrupados_item = df1.groupBy(["cliente"]).agg(max("data"), \
min("data"), \
)
输出为:
对于我的问题,输出将是该客户的最大日期和倒数第二个购买日期。输出应为:
另一种方法也可以直接提供这两个日期之间的差异。但是我不知道如何实施它。
非常感谢您。
The sample of the dataset I am working on:
data = [(1, "2021-11-08"), (1, "2021-11-06"), (1, "2021-10-08"), (2, "2021-11-01"), (2, "2021-10-20"),
(2, "2021-08-05"), (3, "2021-08-02"), (3, "2021-05-08"), (3, "2021-03-01")]
df = spark.createDataFrame(data=data,schema=columns)
I'd like to take the customer's most recent purchase (the customer's date.max()) and the previous maximum purchase (the penultimate purchase) and take the difference between the two (I'm assuming the same product on all purchases). I still haven't found something in pyspark that does this. One example of my idea was to do this in a groupby, like a below with the minimum date and maximum date.
df1 = df.withColumn('data',to_date(df.data))
dados_agrupados_item = df1.groupBy(["cliente"]).agg(max("data"), \
min("data"), \
)
The output is:
For my problem, the output would be the maximum date and the penultimate purchase date from that customer. The output should be:
Another method could also be directly delivering the difference between these two dates. But I have no idea how to implement it.
Thank you very much in advance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
使用collect_list组的组来收集每个组的所有日期,
使用反向/array_sort来强制数组的降序。
参考第一和第二购买。 (我们希望他们已经购买了两次,或者我们需要更复杂的逻辑来处理它。)
Use group by with collect_list to collect all dates per group
Use reverse/array_sort to enforce descending order of the array.
Reference the first and second purchases. (We hope they have purchased twice or we'd need more complex logic to handle it.)