使用 PySpark 的窗口函数
我有一个 PySpark Dataframe,我的目标是创建一个 Flag
列,其值取决于 Amount
列的值。 基本上,对于每个组,我想知道在前三个月中是否存在金额大于 0,以及是否存在在这种情况下,所有组的 Flag 列的值将为 1,否则该值将为 0。
我将提供一个示例来更好地说明。
初始 PySpark 数据框:
组 | 月 | 金额 |
---|---|---|
A | 1 | 0 |
A | 2 | 0 |
A | 3 | 35 |
A | 4 | 0 |
A | 5 | 0 |
B | 1 | 0 |
B | 2 | 0 |
C | 1 | 0 |
C | 2 | 0 |
C | 3 | 0 |
C | 4 | 13 |
D | 1 | 0 |
D | 2 | 24 |
D | 3 | 0 |
Final PySpark 数据框:
组 | 月份 | 金额 | 标志 |
---|---|---|---|
A | 1 | 0 | 1 |
A | 2 | 0 | 1 |
A | 3 | 35 | 1 |
A | 4 | 0 | 1 |
A | 5 | 0 | 1 |
B | 1 | 0 | 0 |
B | 2 | 0 | 0 |
C | 1 | 0 | 0 |
C | 2 | 0 | 0 |
C | 3 | 0 | 0 |
C | 4 | 13 | 0 |
D | 1 | 0 | 1 |
D | 2 | 24 | 1 |
D | 3 | 0 | 1 |
基本上,我想要什么是针对每组,将前 3 个月的金额相加。如果该总和大于 0,则该组的所有元素的标志为 1,否则为 0。
I have a PySpark Dataframe and my goal is to create a Flag
column whose value depends on the value of the Amount
column.
Basically, for each Group, I want to know if in any of the first three months, there is an amount greater than 0 and if that is the case, the value of the Flag column will be 1 for all the group, otherwise the value will be 0.
I will include an example to clarify a bit better.
Initial PySpark Dataframe:
Group | Month | Amount |
---|---|---|
A | 1 | 0 |
A | 2 | 0 |
A | 3 | 35 |
A | 4 | 0 |
A | 5 | 0 |
B | 1 | 0 |
B | 2 | 0 |
C | 1 | 0 |
C | 2 | 0 |
C | 3 | 0 |
C | 4 | 13 |
D | 1 | 0 |
D | 2 | 24 |
D | 3 | 0 |
Final PySpark Dataframe:
Group | Month | Amount | Flag |
---|---|---|---|
A | 1 | 0 | 1 |
A | 2 | 0 | 1 |
A | 3 | 35 | 1 |
A | 4 | 0 | 1 |
A | 5 | 0 | 1 |
B | 1 | 0 | 0 |
B | 2 | 0 | 0 |
C | 1 | 0 | 0 |
C | 2 | 0 | 0 |
C | 3 | 0 | 0 |
C | 4 | 13 | 0 |
D | 1 | 0 | 1 |
D | 2 | 24 | 1 |
D | 3 | 0 | 1 |
Basically, what I want is for each group, to sum the amount of the first 3 months. If that sum is greater than 0, the flag is 1 for all the elements of the group, and otherwise is 0.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以通过应用
Window
函数来创建flag
列。创建一个伪列,如果满足条件,则该伪列变为 1,然后最后对伪列求和,如果它大于 0,则至少有一次满足条件的行并设置flag
至 1。You can create the
flag
column by applying aWindow
function. Create a psuedo-column which becomes 1 if the criteria is met and then finally sum over the psuedo-column and if it's greater than 0, then there was atleast once row that met the criteria and set theflag
to 1.您可以将窗口函数与
count
和when
结合使用。You can use Window function with
count
andwhen
.