不平衡数据框架的火花范围分区

发布于 2025-01-21 06:52:09 字数 932 浏览 1 评论 0原文

我有一个带有下一个架构的数据框：

provider_id: Int,
quadkey18: String,
data: Array[ComplexObject]

我需要保存由provider_id和quadkey5分区的数据框架。在一般数据字段中，所有提供商的大小都非常相似。但是，某些提供商（某些提供者_ID）的数据阵列比其他提供商大100x-1000倍。

我正在尝试使用下一个代码获得余额数据集：

df
  .withColumn("qk_for_range",
    when($"provider"===high_freq_provider_id,substring($"quadkey18",1,14))
      otherwise substring($"quadkey18",1,10) )
  .withColumn("quadkey5", substring($"quadkey18",1,5) )
  .repartitionByRange(nrPartitions, $"provider", $"qk_for_range")
  .drop("qk_for_range")
  .write
  .partitionBy("provider", "quadkey5")
  .format("parquet")
  .option("compression", "gzip")
  .option("maxRecordsPerFile",(maxCountInPartition).toInt)
  .mode(SaveMode.Overwrite)
  .save(exportUrl)

但是，当我想获得较小的分区（〜200 MB）时，我得到了非常巨大的分区木板文件（〜1GB）。我可以降低“ maxRecordSperfile”选项，但是在这种情况下，我会为所有“轻”提供商（每个记录数据阵列较小）收到很多小文件。

我的问题是 - 如何分解“脂肪”分区？

原文

I have dataframe with next schema:

provider_id: Int,
quadkey18: String,
data: Array[ComplexObject]

I need to save this dataframe partitioned by provider_id and quadkey5.
In general data field quite similar in terms of size across all providers.
However some providers (certain provider_id) have data array 100x-1000x times bigger than others.

I am trying to get balance dataset with next code:

df
  .withColumn("qk_for_range",
    when(
quot;provider"===high_freq_provider_id,substring(quot;quadkey18",1,14))
      otherwise substring(quot;quadkey18",1,10) )
  .withColumn("quadkey5", substring(quot;quadkey18",1,5) )
  .repartitionByRange(nrPartitions, 
quot;provider", quot;qk_for_range")
  .drop("qk_for_range")
  .write
  .partitionBy("provider", "quadkey5")
  .format("parquet")
  .option("compression", "gzip")
  .option("maxRecordsPerFile",(maxCountInPartition).toInt)
  .mode(SaveMode.Overwrite)
  .save(exportUrl)

However I got really huge partition parquet files (~ 1Gb), when I want to get smaller partitions (~200 mb).
I can decrease "maxRecordsPerFile" option, but in that case I would get a lot of small files for all "light" providers (those that have small data array per record).

My question is - how to break down "fat" partitions?

分享到QQ

分享到微博