AWS Glue 没有为 pyspark 提供一致的结果 - orderBy
在本地运行 pyspark 时,我得到了正确的结果,列表按 BOOK_ID 排序,但是在部署 AWS Glue 作业时,这些书籍似乎没有排序
root
|-- AUTHORID: integer
|-- NAME: string
|-- BOOK_LIST: array
| |-- BOOK_ID: integer
| |-- BOOK_NAME: string
from pyspark.sql import functions as F
result = (df_authors.join(df_books, on=["AUTHOR_ID"], how="left")
.orderBy(F.col("BOOK_ID").desc())
.groupBy("AUTHOR_ID", "NAME")
.agg(F.collect_list(F.struct("BOOK_ID", "BOOK_NAME")))
)
注意:我正在使用 pyspark 3.2.1 和 Glue 2.0
请提出任何建议
when running pyspark locally I get correct results with list ordered by BOOK_ID, But when deploying the AWS Glue job, the books seem not to be ordered
root
|-- AUTHORID: integer
|-- NAME: string
|-- BOOK_LIST: array
| |-- BOOK_ID: integer
| |-- BOOK_NAME: string
from pyspark.sql import functions as F
result = (df_authors.join(df_books, on=["AUTHOR_ID"], how="left")
.orderBy(F.col("BOOK_ID").desc())
.groupBy("AUTHOR_ID", "NAME")
.agg(F.collect_list(F.struct("BOOK_ID", "BOOK_NAME")))
)
Note: I'm using pyspark 3.2.1
and Glue 2.0
Any suggestion please
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
假设
虽然我设法在支持
spark 3.1
的 Glue 3.0 上运行该作业,但 orderBy 仍然给出错误的结果从 AWS Glue 迁移2.0 到 AWS Glue 3.0
似乎给出了良好结果的解决方案是将工作人员数量减少到 2,这是允许的最小工作人员数量
建议的解决方案
join
之前的每个数据帧.coalesce(1)
这允许获得正确的结果,但在这种情况下我们会失败表现
Supposition
Although I managed to run the job on Glue 3.0 that supports
spark 3.1
, the orderBy still giving wrong resultMigrating from AWS Glue 2.0 to AWS Glue 3.0
The solution that seems to give a good result is to reduce the number of workers to 2 which is the minimum allowed number of workers
Suggested Sollution
join
.coalesce(1)
Which allow to get the right result but in this case we lose in performance
我试图简化问题,与我合作:
让我们创建一个数据框示例:
现在,这样做:
我得到的正是您所要求的:
Im trying to simplify the issue, work with me:
Lets create a dataframe sample:
Now, Doing this:
Im getting exactly what you are allegedly asking for: