如何使用Pyspark以最佳的方式将数据添加到镶木quet文件中?

发布于 2025-01-31 09:02:02 字数 352 浏览 4 评论 0原文

我有一个名为CustomerActions的镶木quet文件。每天我使用此语法在那里添加1000行:

spark.sql('select * from customerActions').write.mode('append').parquet("/Staging/Mind/customerActions/")

现在我面临以下问题:由于此文件包含很多文件,因此阅读此文件需要很多时间,因为每天我添加一个小的文件此文件的数据量“/staging/staging/customeractions/”“

如何使读取文件”/staging/staging/corventing/customeractions/“更快?

I have a parquet file called customerActions. Every day I add 1000 lines there using this syntax:

spark.sql('select * from customerActions').write.mode('append').parquet("/Staging/Mind/customerActions/")

And now I'm faced with the following problem: reading this file takes a lot of time due to the fact that this file contains a lot of files, because every day I add a small amount of data to this file "/Staging/Mind/customerActions/"

How can I make reading the file "/Staging/Mind/customerActions/" faster?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

柒七 2025-02-07 09:02:02

提高此速度的一种方法是将小文件合并为较大的文件,以便您可以将现有路径读取到数据框架中,然后在其上运行重新分配,然后回信,然后理想地提高读取的性能

One way to improve this speed is to coalesce small files into larger ones so you can read the existing path into dataframe then run repartition on it and then write it back which should ideally improve performance of reads

夜灵血窟げ 2025-02-07 09:02:02

如果您可以使用多个文件夹,则可以使用dataFrameWriter.partitionby将客户分组从某个星期分组为一个目录。我通常会按年/月/天进行分区数据,但也可能进行任何其他标准(请参见 /a>示例)。

当您想读取文件时,您可以读取数据的子集和/或并行读数,这两个都应该使其更快。

If you are ok with having multiple folders you could use DataFrameWriter.partitionBy to group customerActions from e.g. a certain week into one directory. I usually partition data by year/month/day but any other criterion is also possible (see this example).

When you want to read the files you could just read a sub-set of the data and/or parallelize reading which should both make it faster.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文