在Pyspark上拿到客户购买的重新度

发布于 2025-02-01 09:09:15 字数 1228 浏览 2 评论 0原文

我正在处理的数据集示例：

data = [(1, "2021-11-08"), (1, "2021-11-06"), (1, "2021-10-08"), (2, "2021-11-01"), (2, "2021-10-20"), 
        (2, "2021-08-05"), (3, "2021-08-02"), (3, "2021-05-08"), (3, "2021-03-01")]

df = spark.createDataFrame(data=data,schema=columns)

我想拿起客户的最新购买（客户的date.max（））和先前的最高购买（倒数第二次购买）并进行两者之间的区别（我在所有购买中都假设相同的产品）。我仍然没有在Pyspark中找到这样做的东西。我的想法的一个例子是在组中进行此操作，例如以下最低日期和最高日期。



df1 = df.withColumn('data',to_date(df.data))

dados_agrupados_item = df1.groupBy(["cliente"]).agg(max("data"), \
         min("data"), \
     )

输出为：

对于我的问题，输出将是该客户的最大日期和倒数第二个购买日期。输出应为：

另一种方法也可以直接提供这两个日期之间的差异。但是我不知道如何实施它。

非常感谢您。

原文

The sample of the dataset I am working on:

data = [(1, "2021-11-08"), (1, "2021-11-06"), (1, "2021-10-08"), (2, "2021-11-01"), (2, "2021-10-20"), 
        (2, "2021-08-05"), (3, "2021-08-02"), (3, "2021-05-08"), (3, "2021-03-01")]

df = spark.createDataFrame(data=data,schema=columns)

I'd like to take the customer's most recent purchase (the customer's date.max()) and the previous maximum purchase (the penultimate purchase) and take the difference between the two (I'm assuming the same product on all purchases). I still haven't found something in pyspark that does this. One example of my idea was to do this in a groupby, like a below with the minimum date and maximum date.



df1 = df.withColumn('data',to_date(df.data))

dados_agrupados_item = df1.groupBy(["cliente"]).agg(max("data"), \
         min("data"), \
     )

The output is:

For my problem, the output would be the maximum date and the penultimate purchase date from that customer. The output should be:

Another method could also be directly delivering the difference between these two dates. But I have no idea how to implement it.

Thank you very much in advance.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

小苏打饼 2025-02-08 09:09:16

使用collect_list组的组来收集每个组的所有日期，

使用反向/array_sort来强制数组的降序。

参考第一和第二购买。（我们希望他们已经购买了两次，或者我们需要更复杂的逻辑来处理它。）

df1.groupBy(["Client"])\
.agg( 
  reverse( # collect all the dates in an array in descending order from most recent to oldest.
    array_sort(
      collect_list( df1.data ))).alias("dates") )\
.select( 
  col("Client"), 
  col("dates")[0].alias("last_purchase"), # get last purchase
  col("dates")[1].alias("penultimate")   )\ #get last purchase + 1
.show()
+------+-------------+-----------+
|Client|last_purchase|penultimate|
+------+-------------+-----------+
|     1|   2021-11-08| 2021-11-06|
|     3|   2021-08-02| 2021-05-08|
|     2|   2021-11-01| 2021-10-20|
+------+-------------+-----------+

Use group by with collect_list to collect all dates per group

Use reverse/array_sort to enforce descending order of the array.

Reference the first and second purchases. (We hope they have purchased twice or we'd need more complex logic to handle it.)

df1.groupBy(["Client"])\
.agg( 
  reverse( # collect all the dates in an array in descending order from most recent to oldest.
    array_sort(
      collect_list( df1.data ))).alias("dates") )\
.select( 
  col("Client"), 
  col("dates")[0].alias("last_purchase"), # get last purchase
  col("dates")[1].alias("penultimate")   )\ #get last purchase + 1
.show()
+------+-------------+-----------+
|Client|last_purchase|penultimate|
+------+-------------+-----------+
|     1|   2021-11-08| 2021-11-06|
|     3|   2021-08-02| 2021-05-08|
|     2|   2021-11-01| 2021-10-20|
+------+-------------+-----------+

回复收藏 0 原文

~没有更多了~