< spark dataframe> .write.parquet(< directory>)和< spark dataframe> .write.parquet(<文件名> .parquet)之间的区别

发布于 2025-02-03 22:42:38 字数 1293 浏览 4 评论 0 原文

我终于被介绍给了Parquet,并试图更好地理解它。我意识到,当跑步火花时,最好至少拥有与使用核心以充分利用火花一样多的镶木木材文件(分区)。但是,制作一个大型镶木文件与几个较小的木板文件来存储数据有任何优点/缺点?

作为测试,我正在使用此数据集:

这是我正在测试的代码:

import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master("local[*]") \
    .appName('test') \
    .getOrCreate()
df = spark.read.parquet('fhvhv_tripdata_2021-01.parquet')
df.write.parquet('test.parquet')
df.write.parquet('./test')

当我ls -lh时,我看到的是: test.parquet文件为4.0k

以及通过写入目录创建的两个文件为: 2.5k 和 189m

当我将其读回不同的数据范围时,它们的计数相同。

什么时候最好做一个?在写入目录时,平衡文件大小的最佳实践是什么?在编写/读取木板文件时要使用的任何指导/经验规则都非常感谢。

I've finally been introduced to parquet and am trying to understand it better. I realize that when running spark it is best to have at least as many parquet files (partitions) as you do cores to utilize spark to it's fullest. However, are there any advantages/disadvantages to making one large parquet file vs several smaller parquet files to store the data?

As a test I'm using this dataset:
https://nyc-tlc.s3.amazonaws.com/trip+data/fhvhv_tripdata_2021-01.parquet

This is the code I'm testing with:

import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master("local[*]") \
    .appName('test') \
    .getOrCreate()
df = spark.read.parquet('fhvhv_tripdata_2021-01.parquet')
df.write.parquet('test.parquet')
df.write.parquet('./test')

When I ls -lh the files I see that:
the test.parquet file is 4.0K
enter image description here

and the two files created by writing to a directory are:
2.5K
and
189M
enter image description here

When I read these back into different dataframes they have the same count.

enter image description here

When is it best practice to do one over the other? What is the best practice to balance the file sizes when writing to a directory and should you? Any guidance/rules of thumb to use when writing/reading parquet files is greatly appreciated.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

救赎№ 2025-02-10 22:42:38

在Spark中,您可以使用retartition以几乎相等的块来打破文件,并且如Databricks培训中所建议,您可以选择核心数量并使用该数字来重新分配您的文件,因为默认的shuffle分区设置为200,除非批次,否则存在数据。

一个带有重新分配的特定陷阱是当您的数据框架具有复杂的数据类型,而这些数据类型具有很大的大小变化的数据,您可以参考 this 堆栈上的问题

In spark you can use repartition to break the files in nearly equal chunks and as suggested in databricks training you can pick number of cores and use that number to repartition your file ,as the default shuffle partition is set to 200 which is bit high unless lots of data is present.

One specific gotcha with repartition is when your dataframe has complex data types and those have data in large variation of size for which you can refer to this question on stack

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文