在写入期间 Spark repartiton、sortWithinParitions 和partitionBy 扰乱了我的排序
我有 scala Spark 代码,可将数据帧写入 csv 文件。代码如下所示
dataframe
.select("path", "id", "top_path")
.repartition(1, col("top_path"))
.sortWithinPartitions("path")
.write
.partitionBy("top_path")
.option("delimiter", "\t")
.mode("overwrite")
.csv(outputPath)
即使我在 path
列上执行 sortWithinPartitions
,我仍然看到某些输出未按预期排序。有谁知道为什么会发生这种情况以及如何解决它?我已经尝试过 sortWithinPartitions("top_path", "path")
但在写入时仍然没有按 path
正确排序。我希望按 path
升序排序。例如,在某些情况下,我看到的输出类似于
path1 1
path1/subpath1 2
path1 3
path1/subpath2 4
而不是
path1 1
path1 3
path1/subpath1 2
path1/subpath2 4
I have scala spark code that writes a dataframe to csv files. The code is shown below
dataframe
.select("path", "id", "top_path")
.repartition(1, col("top_path"))
.sortWithinPartitions("path")
.write
.partitionBy("top_path")
.option("delimiter", "\t")
.mode("overwrite")
.csv(outputPath)
Even though I am doing a sortWithinPartitions
on the path
column, I am still seeing that some of the output isn't sorted as expected. Does anyone know why this is happening and how it can be fixed? I have tried sortWithinPartitions("top_path", "path")
but that still didn't sort by path
properly when writing. I expect sorting to occur in ascending order by path
. For example in some cases I am seeing output like
path1 1
path1/subpath1 2
path1 3
path1/subpath2 4
instead of
path1 1
path1 3
path1/subpath1 2
path1/subpath2 4
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我的猜测是
partitionBy
会重置您之前的任何顺序。尝试partitionBy
,然后sortBy
My guess would be that
partitionBy
resets any order that you had before. TrypartitionBy
and thensortBy