Jupyterlab 中的 Dask Array.compute() 峰值内存

发布于 2025-01-11 17:08:00 字数 684 浏览 0 评论 0原文

我正在分布式集群上使用 dask，当将结果返回到本地进程时，我注意到内存消耗达到峰值。

我的最小示例包括实例化集群并使用 dask.array.arange 创建一个约 1.6G 的简单数组。

我预计内存消耗约为数组大小，但我观察到内存峰值约为 3.2G。

Dask在计算过程中是否有任何副本？或者Jupyterlab需要制作一份副本吗？

import dask.array
import dask_jobqueue
import distributed

cluster_conf = {
    "cores": 1,
    "log_directory": "/work/scratch/chevrir/dask-workspace",
    "walltime": '06:00:00',
    "memory": "5GB"
}

cluster = dask_jobqueue.PBSCluster(**cluster_conf)
cluster.scale(n=1)
client = distributed.Client(cluster)
client

# 1.6 G in memory
a = dask.array.arange(2e8)

%load_ext memory_profiler
%memit a.compute()
# peak memory: 3219.02 MiB, increment: 3064.36 MiB

原文

I am working with dask on a distributed cluster, and I noticed a peak memory consumption when getting the results back to the local process.

My minimal example consists in instanciating the cluster and creating a simple array of ~1.6G with dask.array.arange.

I expected the memory consumption to be around the array size, but I observed a memory peak around 3.2G.

Is there any copy done by Dask during the computation ? Or does Jupyterlab needs to make a copy ?

import dask.array
import dask_jobqueue
import distributed

cluster_conf = {
    "cores": 1,
    "log_directory": "/work/scratch/chevrir/dask-workspace",
    "walltime": '06:00:00',
    "memory": "5GB"
}

cluster = dask_jobqueue.PBSCluster(**cluster_conf)
cluster.scale(n=1)
client = distributed.Client(cluster)
client

# 1.6 G in memory
a = dask.array.arange(2e8)

%load_ext memory_profiler
%memit a.compute()
# peak memory: 3219.02 MiB, increment: 3064.36 MiB

分享到QQ

分享到微博