Dask在写入镶木quet文件时如何使用RAM？

发布于 2025-02-12 11:31:08 字数 494 浏览 0 评论 0原文

我使用dask的原因有2个：

（1）在将pandas数据框架分配给1亿行数据时减少RAM存储器的使用情况。（2）能够分析比RAM大的数据。（目前，我使用的是50GB RAM。）

我还需要将这1亿行数据保存到Parquet文件中。

dask是否将整个数据框加载到内存以写入镶木木文件？写入镶木quet文件时的内存使用效率有多高？

提前致谢。

这是写入压缩镶木文件的代码。

path_out = "/content/drive/MyDrive/AffinityScore_STAGING/staging_affinity_block1_from_DASK2.gzip"
dask_df.to_parquet(path_out, compression='gzip', write_metadata_file=False)

原文

I am using Dask for 2 reasons:

(1) To reduce the RAM memory usage when Pandas dataframe is assigned to 100 million rows of data.
(2) Able to analyze data that is bigger than RAM. (Currently, I am using 50GB RAM.)

I also need to save these 100 million rows of data to parquet files.

Does Dask load the entire dataframe to memory in order to write to a parquet file?
How efficient is the memory usage when writing to a parquet file?

Thanks in advance.

Here is the code to write to a compressed parquet file.

path_out = "/content/drive/MyDrive/AffinityScore_STAGING/staging_affinity_block1_from_DASK2.gzip"
dask_df.to_parquet(path_out, compression='gzip', write_metadata_file=False)

分享到QQ

分享到微博