计算客户在SQL中最后两个订单中的订单数量少于当前订单值的订单数？

发布于 2025-02-13 01:54:45 字数 2015 浏览 0 评论 0原文

认为我有一张桌子

客户	日	量
A	4	96
A	22	63
A	32	19
A	50	27
A	57	99
A	69	97
72 93 B	82	22 22
B	2	22
B	87	64
C 22 60 C	22	60
C	30	2 23
2 22	48	74
C	49	68
C	55	11
C	85	79

我需要计算每个客户的订单数量 小于他过去两个订单的订单值订单，即

客户	节	数量	计数
A	4	96	0
A	22	63	0
A	32	19	0
A	50	27	1
A	57	99	2
A	72	93	1
B	69	97	0
B	82	22	0
B	87	64	1
C	22	60	0
C	30	2 23	0
C	48	74	2
C	49	68	1
C	55	11	0
C	85	79	2

原文

Consider that I have a table

Customer	Day	Amount
A	4	96
A	22	63
A	32	19
A	50	27
A	57	99
A	72	93
B	69	97
B	82	22
B	87	64
C	22	60
C	30	22
C	48	74
C	49	68
C	55	11
C	85	79

I need to calculate the number of orders for every customer whose value is less than the order value of the present order among his past two orders, i.e

Customer	Day	Amount	Count
A	4	96	0
A	22	63	0
A	32	19	0
A	50	27	1
A	57	99	2
A	72	93	1
B	69	97	0
B	82	22	0
B	87	64	1
C	22	60	0
C	30	22	0
C	48	74	2
C	49	68	1
C	55	11	0
C	85	79	2

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

香橙ぽ 2025-02-20 01:54:46

由于您仅是 2行，因此您可以预先计算lag值，并利用case语句来生成所需的逻辑。

上更具动态和灵活性

它在本质

s = StringIO("""
Customer    Day Amount
A   4   96
A   22  63
A   32  19
A   50  27
A   57  99
A   72  93
B   69  97
B   82  22
B   87  64
C   22  60
C   30  22
C   48  74
C   49  68
C   55  11
C   85  79
"""
)

df = pd.read_csv(s,delimiter='\t')

sparkDF = sql.createDataFrame(df)

sparkDF.show()

+--------+---+------+
|Customer|Day|Amount|
+--------+---+------+
|       A|  4|    96|
|       A| 22|    63|
|       A| 32|    19|
|       A| 50|    27|
|       A| 57|    99|
|       A| 72|    93|
|       B| 69|    97|
|       B| 82|    22|
|       B| 87|    64|
|       C| 22|    60|
|       C| 30|    22|
|       C| 48|    74|
|       C| 49|    68|
|       C| 55|    11|
|       C| 85|    79|
+--------+---+------+

否则，如果您希望

sparkDF.registerTempTable("TB1")

sql.sql("""
SELECT
    CUSTOMER,
    DAY,
    AMOUNT,
    CASE 
        WHEN AMOUNT > LAG_1 AND AMOUNT > LAG_2 THEN 2
        WHEN AMOUNT > LAG_1 OR AMOUNT > LAG_2 THEN 1
        ELSE 0
    END as Count
FROM (
    SELECT
        CUSTOMER,
        DAY,
        AMOUNT,
        LAG(AMOUNT,1) OVER(PARTITION BY CUSTOMER ORDER BY DAY) as lag_1,
        LAG(AMOUNT,2) OVER(PARTITION BY CUSTOMER ORDER BY DAY) as lag_2
    FROM TB1
)
ORDER BY 1,2
;
""").show()

+--------+---+------+-----+
|CUSTOMER|DAY|AMOUNT|Count|
+--------+---+------+-----+
|       A|  4|    96|    0|
|       A| 22|    63|    0|
|       A| 32|    19|    0|
|       A| 50|    27|    1|
|       A| 57|    99|    2|
|       A| 72|    93|    1|
|       B| 69|    97|    0|
|       B| 82|    22|    0|
|       B| 87|    64|    1|
|       C| 22|    60|    0|
|       C| 30|    22|    0|
|       C| 48|    74|    2|
|       C| 49|    68|    1|
|       C| 55|    11|    0|
|       C| 85|    79|    2|
+--------+---+------+-----+

Since you are only PRECEEDING 2 ROWS , you can pre-compute the LAG values and utilise a CASE statement to generate the required logic.

Else if you want it to be more dynamic and flexible in nature you can utilise Shubham Sharma's answer

Data Preparation

s = StringIO("""
Customer    Day Amount
A   4   96
A   22  63
A   32  19
A   50  27
A   57  99
A   72  93
B   69  97
B   82  22
B   87  64
C   22  60
C   30  22
C   48  74
C   49  68
C   55  11
C   85  79
"""
)

df = pd.read_csv(s,delimiter='\t')

sparkDF = sql.createDataFrame(df)

sparkDF.show()

+--------+---+------+
|Customer|Day|Amount|
+--------+---+------+
|       A|  4|    96|
|       A| 22|    63|
|       A| 32|    19|
|       A| 50|    27|
|       A| 57|    99|
|       A| 72|    93|
|       B| 69|    97|
|       B| 82|    22|
|       B| 87|    64|
|       C| 22|    60|
|       C| 30|    22|
|       C| 48|    74|
|       C| 49|    68|
|       C| 55|    11|
|       C| 85|    79|
+--------+---+------+

SparkSQL

sparkDF.registerTempTable("TB1")

sql.sql("""
SELECT
    CUSTOMER,
    DAY,
    AMOUNT,
    CASE 
        WHEN AMOUNT > LAG_1 AND AMOUNT > LAG_2 THEN 2
        WHEN AMOUNT > LAG_1 OR AMOUNT > LAG_2 THEN 1
        ELSE 0
    END as Count
FROM (
    SELECT
        CUSTOMER,
        DAY,
        AMOUNT,
        LAG(AMOUNT,1) OVER(PARTITION BY CUSTOMER ORDER BY DAY) as lag_1,
        LAG(AMOUNT,2) OVER(PARTITION BY CUSTOMER ORDER BY DAY) as lag_2
    FROM TB1
)
ORDER BY 1,2
;
""").show()

+--------+---+------+-----+
|CUSTOMER|DAY|AMOUNT|Count|
+--------+---+------+-----+
|       A|  4|    96|    0|
|       A| 22|    63|    0|
|       A| 32|    19|    0|
|       A| 50|    27|    1|
|       A| 57|    99|    2|
|       A| 72|    93|    1|
|       B| 69|    97|    0|
|       B| 82|    22|    0|
|       B| 87|    64|    1|
|       C| 22|    60|    0|
|       C| 30|    22|    0|
|       C| 48|    74|    2|
|       C| 49|    68|    1|
|       C| 55|    11|    0|
|       C| 85|    79|    2|
+--------+---+------+-----+

回复收藏 0 原文

ˇ宁静的妩媚 2025-02-20 01:54:46

我们可以在否则的情况下做到这一点

from pyspark.sql import functions as F
from pyspark.sql import Window
schema="Customer string,day int,Amount int"

data=[('A',4,96),('A',22,63),('A',32,19),('A',50,27),('A',57,99),('A',72,93),('B',69,97),('B',82,22),('B',87,64),('C',22,60),('C',30,22),('C',48,74),('C',49,68),('C',55,11),('C',85,79)]
dql=spark.createDataFrame(data,schema)
dql.withColumn("count",when((col('Amount')>F.lag("Amount",2).over(Window.partitionBy("Customer").orderBy("day"))) & (col('Amount')>F.lag("Amount",1).over(Window.partitionBy("Customer").orderBy("day"))) ,2).when((col('Amount')>F.lag("Amount",2).over(Window.partitionBy("Customer").orderBy("day"))) | (col('Amount')>F.lag("Amount",1).over(Window.partitionBy("Customer").orderBy("day"))) ,1).otherwise("0")).show()

#output
+--------+---+------+-----+
|Customer|day|Amount|count|
+--------+---+------+-----+
|       A|  4|    96|    0|
|       A| 22|    63|    0|
|       A| 32|    19|    0|
|       A| 50|    27|    1|
|       A| 57|    99|    2|
|       A| 72|    93|    1|
|       B| 69|    97|    0|
|       B| 82|    22|    0|
|       B| 87|    64|    1|
|       C| 22|    60|    0|
|       C| 30|    22|    0|
|       C| 48|    74|    2|
|       C| 49|    68|    1|
|       C| 55|    11|    0|
|       C| 85|    79|    2|
+--------+---+------+-----+

We can do this with when otherwise and lag

from pyspark.sql import functions as F
from pyspark.sql import Window
schema="Customer string,day int,Amount int"

data=[('A',4,96),('A',22,63),('A',32,19),('A',50,27),('A',57,99),('A',72,93),('B',69,97),('B',82,22),('B',87,64),('C',22,60),('C',30,22),('C',48,74),('C',49,68),('C',55,11),('C',85,79)]
dql=spark.createDataFrame(data,schema)
dql.withColumn("count",when((col('Amount')>F.lag("Amount",2).over(Window.partitionBy("Customer").orderBy("day"))) & (col('Amount')>F.lag("Amount",1).over(Window.partitionBy("Customer").orderBy("day"))) ,2).when((col('Amount')>F.lag("Amount",2).over(Window.partitionBy("Customer").orderBy("day"))) | (col('Amount')>F.lag("Amount",1).over(Window.partitionBy("Customer").orderBy("day"))) ,1).otherwise("0")).show()

#output
+--------+---+------+-----+
|Customer|day|Amount|count|
+--------+---+------+-----+
|       A|  4|    96|    0|
|       A| 22|    63|    0|
|       A| 32|    19|    0|
|       A| 50|    27|    1|
|       A| 57|    99|    2|
|       A| 72|    93|    1|
|       B| 69|    97|    0|
|       B| 82|    22|    0|
|       B| 87|    64|    1|
|       C| 22|    60|    0|
|       C| 30|    22|    0|
|       C| 48|    74|    2|
|       C| 49|    68|    1|
|       C| 55|    11|    0|
|       C| 85|    79|    2|
+--------+---+------+-----+

回复收藏 0 原文

暖伴 2025-02-20 01:54:46

让我们使用 coce> collect_list_list代码>窗口收集与过去两个订单相对应的金额，然后 gentrage> gentreg> gentreg 计算列表中的值在当前行中小于金额

w = Window.partitionBy('Customer').orderBy('Day').rowsBetween(-2, -1)

df.withColumn(
    'Count',
    F.aggregate(
        F.collect_list('Amount').over(w), F.lit(0),
        lambda acc, x: acc + (x < F.col('Amount')).cast('int')
    )
)

+--------+---+------+-----+
|Customer|Day|Amount|Count|
+--------+---+------+-----+
|       A|  4|    96|    0|
|       A| 22|    63|    0|
|       A| 32|    19|    0|
|       A| 50|    27|    1|
|       A| 57|    99|    2|
|       A| 72|    93|    1|
|       B| 69|    97|    0|
|       B| 82|    22|    0|
|       B| 87|    64|    1|
|       C| 22|    60|    0|
|       C| 30|    22|    0|
|       C| 48|    74|    2|
|       C| 49|    68|    1|
|       C| 55|    11|    0|
|       C| 85|    79|    2|
+--------+---+------+-----+

Let us use collect_list over Window to collect the amounts corresponding to past two orders, then aggregate to count the values in list which are less than Amount in current row

w = Window.partitionBy('Customer').orderBy('Day').rowsBetween(-2, -1)

df.withColumn(
    'Count',
    F.aggregate(
        F.collect_list('Amount').over(w), F.lit(0),
        lambda acc, x: acc + (x < F.col('Amount')).cast('int')
    )
)

+--------+---+------+-----+
|Customer|Day|Amount|Count|
+--------+---+------+-----+
|       A|  4|    96|    0|
|       A| 22|    63|    0|
|       A| 32|    19|    0|
|       A| 50|    27|    1|
|       A| 57|    99|    2|
|       A| 72|    93|    1|
|       B| 69|    97|    0|
|       B| 82|    22|    0|
|       B| 87|    64|    1|
|       C| 22|    60|    0|
|       C| 30|    22|    0|
|       C| 48|    74|    2|
|       C| 49|    68|    1|
|       C| 55|    11|    0|
|       C| 85|    79|    2|
+--------+---+------+-----+

回复收藏 0 原文

~没有更多了~