不平衡数据框架的火花范围分区
我有一个带有下一个架构的数据框:
provider_id: Int,
quadkey18: String,
data: Array[ComplexObject]
我需要保存由provider_id和quadkey5分区的数据框架。 在一般数据字段中,所有提供商的大小都非常相似。 但是,某些提供商(某些提供者_ID)的数据阵列比其他提供商大100x-1000倍。
我正在尝试使用下一个代码获得余额数据集:
df
.withColumn("qk_for_range",
when($"provider"===high_freq_provider_id,substring($"quadkey18",1,14))
otherwise substring($"quadkey18",1,10) )
.withColumn("quadkey5", substring($"quadkey18",1,5) )
.repartitionByRange(nrPartitions, $"provider", $"qk_for_range")
.drop("qk_for_range")
.write
.partitionBy("provider", "quadkey5")
.format("parquet")
.option("compression", "gzip")
.option("maxRecordsPerFile",(maxCountInPartition).toInt)
.mode(SaveMode.Overwrite)
.save(exportUrl)
但是,当我想获得较小的分区(〜200 MB)时,我得到了非常巨大的分区木板文件(〜1GB)。 我可以降低“ maxRecordSperfile”选项,但是在这种情况下,我会为所有“轻”提供商(每个记录数据阵列较小)收到很多小文件。
我的问题是 - 如何分解“脂肪”分区?
I have dataframe with next schema:
provider_id: Int,
quadkey18: String,
data: Array[ComplexObject]
I need to save this dataframe partitioned by provider_id and quadkey5.
In general data field quite similar in terms of size across all providers.
However some providers (certain provider_id) have data array 100x-1000x times bigger than others.
I am trying to get balance dataset with next code:
df
.withColumn("qk_for_range",
when(quot;provider"===high_freq_provider_id,substring(quot;quadkey18",1,14))
otherwise substring(quot;quadkey18",1,10) )
.withColumn("quadkey5", substring(quot;quadkey18",1,5) )
.repartitionByRange(nrPartitions, quot;provider", quot;qk_for_range")
.drop("qk_for_range")
.write
.partitionBy("provider", "quadkey5")
.format("parquet")
.option("compression", "gzip")
.option("maxRecordsPerFile",(maxCountInPartition).toInt)
.mode(SaveMode.Overwrite)
.save(exportUrl)
However I got really huge partition parquet files (~ 1Gb), when I want to get smaller partitions (~200 mb).
I can decrease "maxRecordsPerFile" option, but in that case I would get a lot of small files for all "light" providers (those that have small data array per record).
My question is - how to break down "fat" partitions?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论