为什么无论数据帧大小如何，dask 都需要很长时间来计算

发布于 2025-01-16 11:05:10 字数 1023 浏览 1 评论 0原文

无论数据帧大小如何，dask 数据帧都需要很长时间来计算的原因是什么。如何避免这种情况的发生？其背后的原因是什么？

编辑：

我目前正在使用 ml.c5.2xlarge 实例类型使用 AWS Sagemaker，数据位于 S3 存储桶中。我没有连接到客户端，因为我无法连接到。当我通过本地集群运行客户端时出现此错误 --> AttributeError: MaterializedLayer' 对象没有属性 'pack_annotations'

因此，我在没有连接任何特定内容的情况下继续操作，现在它处于默认状态。（集群，工作线程：4，核心：8，内存：16.22 GB）

shape = df.shape
nrows = shape[0].compute()
print("nrows",nrows)
print(df.npartitions)

我尝试对 24700000 条记录（~27M）执行计算，有 23 个分区，执行时间为 CPU时间：用户4分钟48秒，系统：12.9秒，总计：5分钟1秒 Wall time: 4min 46s

对于 nrows 5120000 (~5M)，有 23 个分区，执行时间为 CPU times: user 4min 50s, sys: 12 s, Total: 5min 2s Wall time: 4min 46s

对于 nrows 7697351 (~7M) with 1 Partition, 花费的时间为 CPU时间：用户5分钟4秒，系统：10.6秒，总计：5分钟14秒 Wall time：4min 52s

我在 Pandas 中用 7690000 (~7M) 执行了相同的操作，执行时间为 CPU 时间：用户 502 µs，系统：0 ns，总计：502 µs 挂起时间：402 µs 对于上述所有情况，列数仍然为 5，

我只是想找到数据的形状，但在 Dask 中，无论操作类型如何，dask 都花费相同的时间来执行一项计算操作。

我可以知道这背后的原因是什么以及我需要做什么来避免这种情况并优化计算时间

原文

What is the reason that dask dataframe takes long time to compute regardless of the size of dataframe.
How to avoid this from happening ? What is the reason behind it?

EDIT:

I'm currently working on AWS Sagemaker with ml.c5.2xlarge instance type and the data is in S3 bucket.
I did not connect to client as I was not able to. I'm getting this error when I ran the client through local cluster --> AttributeError: MaterializedLayer' object has no attribute 'pack_annotations'

So, I proceeded without connecting with anything specific, there by it is now on Default. (Cluster, Workers: 4, Cores: 8,Memory: 16.22 GB )

shape = df.shape
nrows = shape[0].compute()
print("nrows",nrows)
print(df.npartitions)

I tried to perform compute on 24700000 records (~27M), with 23 partitions and the time taken to execute is
CPU times: user 4min 48s, sys: 12.9 s, total: 5min 1s
Wall time: 4min 46s

For nrows 5120000 (~5M), with 23 partitions, and the time taken to execute is CPU times: user 4min 50s, sys: 12 s, total: 5min 2s
Wall time: 4min 46s

For nrows 7697351 (~7M) with 1 partition, The time taken is
CPU times: user 5min 4s, sys: 10.6 s, total: 5min 14s
Wall time: 4min 52s

I performed the same operations in Pandas with 7690000 (~7M) and the time take to execute is CPU times: user 502 µs, sys: 0 ns, total: 502 µs
Wall time: 402 µs
Number of columns remains 5 for all the above cases

I'm just trying to find the shape of the data, But in Dask regardless of the type of operation the dask is taking same time to perform one compute operation.

May I know what is the reason behind this and what do I need to do avoid this and optimize the compute time

分享到QQ

分享到微博