加入两个dataframes之后,在最终数据框架上有条件格式
PySpark DataFrame 场景:
- 有一个名为
DF
的 DataFrame。DF
的两个主要列是ID
和Date
。 - 每个
ID
平均有 40 多个唯一的Date
(非连续日期)。 - 现在,有第二个名为
DF_date
的 DataFrame,其中有一列名为Date
。Dates
中的日期范围介于DF
中“日期”的最大值和最小值之间。 - 现在,目标是用每个唯一“ID”的连续开始日期和结束日期填充
DF
(缺少的中断日期用DF_date< 之间的
left join
填充) /code> 和DF
预期
+-------------+-------------+----------------+
| Date| Val| ID|
+-------------+-------------+----------------+
| 2021-07-01| 81119.73| Ax3838J|
| 2021-07-04| 81289.62| Ax3838J|
| 2021-07-05| 81385.62| Ax3838J|
| 2021-07-02| 81249.76| Bz3838J|
| 2021-07-05| 81324.28| Bz3838J|
| 2021-07-06| 81329.28| Bz3838J|
+-------------+-------------+----------------+
最终
+-------------+
| Date|
+-------------+
| 2021-07-01|
| 2021-07-02|
| 2021-07-03|
| 2021-07-04|
| 2021-07-05|
| 2021-07-06|
+-------------+
输出:
+-------------+-------------+----------------+
| Date| Val| ID|
+-------------+-------------+----------------+
| 2021-07-01| 81119.73| Ax3838J|
| 2021-07-02| 81119.73| Ax3838J|
| 2021-07-03| 81119.73| Ax3838J|
| 2021-07-04| 81289.62| Ax3838J|
| 2021-07-05| 81385.62| Ax3838J|
| 2021-07-02| 81249.76| Bz3838J|
| 2021-07-03| 81249.76| Bz3838J|
| 2021-07-04| 81249.76| Bz3838J|
| 2021-07-05| 81324.28| Bz3838J|
| 2021-07-06| 81329.28| Bz3838J|
+-------------+-------------+----------------+
PySpark DataFrame Scenario:
- There is a DataFrame called
DF
. Two main columns ofDF
areID
andDate
. - Each
ID
has on average 40+ uniqueDate
s (not continuous dates). - Now, there is second DataFrame called
DF_date
which has one column namedDate
. The dates inDates
range between maximum and minimum of 'Date' fromDF
. - Now, the goal is to fill
DF
with the continuous Start and End date of each unique 'ID' (missing discontinued dates are filled withleft join
betweenDF_date
andDF
.
DF
+-------------+-------------+----------------+
| Date| Val| ID|
+-------------+-------------+----------------+
| 2021-07-01| 81119.73| Ax3838J|
| 2021-07-04| 81289.62| Ax3838J|
| 2021-07-05| 81385.62| Ax3838J|
| 2021-07-02| 81249.76| Bz3838J|
| 2021-07-05| 81324.28| Bz3838J|
| 2021-07-06| 81329.28| Bz3838J|
+-------------+-------------+----------------+
DF_date
+-------------+
| Date|
+-------------+
| 2021-07-01|
| 2021-07-02|
| 2021-07-03|
| 2021-07-04|
| 2021-07-05|
| 2021-07-06|
+-------------+
Expected Final Output:
+-------------+-------------+----------------+
| Date| Val| ID|
+-------------+-------------+----------------+
| 2021-07-01| 81119.73| Ax3838J|
| 2021-07-02| 81119.73| Ax3838J|
| 2021-07-03| 81119.73| Ax3838J|
| 2021-07-04| 81289.62| Ax3838J|
| 2021-07-05| 81385.62| Ax3838J|
| 2021-07-02| 81249.76| Bz3838J|
| 2021-07-03| 81249.76| Bz3838J|
| 2021-07-04| 81249.76| Bz3838J|
| 2021-07-05| 81324.28| Bz3838J|
| 2021-07-06| 81329.28| Bz3838J|
+-------------+-------------+----------------+
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
你的问题没有意义。为什么要有一个包含开始日期和结束日期的 DF_date 数据框,使用它们填充日期,然后诉诸于使用 DF 开始日期和结束日期。为什么不直接使用每组的
DF
最短和最长日期来填充缺失的日期。无论如何,这就是您根据
DF_Date
填写缺失日期的方法按照您的评论,请参阅我的编辑
Your question doesn't make sense. Why have a
DF_date
dataframe with start and end dates, use them to fill in date and then resort to using theDF
start and end date. Why not just fill in missing dates by usingDF
min and max date for each group.Anyway, this is how you fill in missing dates based on
DF_Date
Following your comments, see my edits
在上面的问题中,我后来意识到@wwnde建议,无需为日期创建单独的DF。
下面提供的代码也达到了目的 -
In the above question, I later realised as suggested @wwnde there is no need to create a separate DF for Dates.
Code provided below serves the purpose too -