Jupyterlab 中的 Dask Array.compute() 峰值内存
我正在分布式集群上使用 dask,当将结果返回到本地进程时,我注意到内存消耗达到峰值。
我的最小示例包括实例化集群并使用 dask.array.arange 创建一个约 1.6G 的简单数组。
我预计内存消耗约为数组大小,但我观察到内存峰值约为 3.2G。
Dask在计算过程中是否有任何副本?或者Jupyterlab需要制作一份副本吗?
import dask.array
import dask_jobqueue
import distributed
cluster_conf = {
"cores": 1,
"log_directory": "/work/scratch/chevrir/dask-workspace",
"walltime": '06:00:00',
"memory": "5GB"
}
cluster = dask_jobqueue.PBSCluster(**cluster_conf)
cluster.scale(n=1)
client = distributed.Client(cluster)
client
# 1.6 G in memory
a = dask.array.arange(2e8)
%load_ext memory_profiler
%memit a.compute()
# peak memory: 3219.02 MiB, increment: 3064.36 MiB
I am working with dask on a distributed cluster, and I noticed a peak memory consumption when getting the results back to the local process.
My minimal example consists in instanciating the cluster and creating a simple array of ~1.6G with dask.array.arange.
I expected the memory consumption to be around the array size, but I observed a memory peak around 3.2G.
Is there any copy done by Dask during the computation ? Or does Jupyterlab needs to make a copy ?
import dask.array
import dask_jobqueue
import distributed
cluster_conf = {
"cores": 1,
"log_directory": "/work/scratch/chevrir/dask-workspace",
"walltime": '06:00:00',
"memory": "5GB"
}
cluster = dask_jobqueue.PBSCluster(**cluster_conf)
cluster.scale(n=1)
client = distributed.Client(cluster)
client
# 1.6 G in memory
a = dask.array.arange(2e8)
%load_ext memory_profiler
%memit a.compute()
# peak memory: 3219.02 MiB, increment: 3064.36 MiB
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
当你执行compute()时会发生什么:
您可以看到,这里的倒数第二步必然需要复制数据。原始字节缓冲区最终可能会在稍后被垃圾收集。
What happens when you do
compute()
:You can see that the penultimate step here necessarily requires duplication of data. The original bytes buffers may eventually be garbage collected later.