为什么无论数据帧大小如何,dask 都需要很长时间来计算
无论数据帧大小如何,dask 数据帧都需要很长时间来计算的原因是什么。 如何避免这种情况的发生?其背后的原因是什么?
编辑:
我目前正在使用 ml.c5.2xlarge 实例类型使用 AWS Sagemaker,数据位于 S3 存储桶中。 我没有连接到客户端,因为我无法连接到。当我通过本地集群运行客户端时出现此错误 --> AttributeError: MaterializedLayer' 对象没有属性 'pack_annotations'
因此,我在没有连接任何特定内容的情况下继续操作,现在它处于默认状态。 (集群,工作线程:4,核心:8,内存:16.22 GB)
shape = df.shape
nrows = shape[0].compute()
print("nrows",nrows)
print(df.npartitions)
我尝试对 24700000 条记录(~27M)执行计算,有 23 个分区,执行时间为 CPU时间:用户4分钟48秒,系统:12.9秒,总计:5分钟1秒 Wall time: 4min 46s
对于 nrows 5120000 (~5M),有 23 个分区,执行时间为 CPU times: user 4min 50s, sys: 12 s, Total: 5min 2s Wall time: 4min 46s
对于 nrows 7697351 (~7M) with 1 Partition, 花费的时间为 CPU时间:用户5分钟4秒,系统:10.6秒,总计:5分钟14秒 Wall time:4min 52s
我在 Pandas 中用 7690000 (~7M) 执行了相同的操作,执行时间为 CPU 时间:用户 502 µs,系统:0 ns,总计:502 µs 挂起时间:402 µs 对于上述所有情况,列数仍然为 5,
我只是想找到数据的形状,但在 Dask 中,无论操作类型如何,dask 都花费相同的时间来执行一项计算操作。
我可以知道这背后的原因是什么以及我需要做什么来避免这种情况并优化计算时间
What is the reason that dask dataframe takes long time to compute regardless of the size of dataframe.
How to avoid this from happening ? What is the reason behind it?
EDIT:
I'm currently working on AWS Sagemaker with ml.c5.2xlarge instance type and the data is in S3 bucket.
I did not connect to client as I was not able to. I'm getting this error when I ran the client through local cluster --> AttributeError: MaterializedLayer' object has no attribute 'pack_annotations'
So, I proceeded without connecting with anything specific, there by it is now on Default. (Cluster, Workers: 4, Cores: 8,Memory: 16.22 GB )
shape = df.shape
nrows = shape[0].compute()
print("nrows",nrows)
print(df.npartitions)
I tried to perform compute on 24700000 records (~27M), with 23 partitions and the time taken to execute is
CPU times: user 4min 48s, sys: 12.9 s, total: 5min 1s
Wall time: 4min 46s
For nrows 5120000 (~5M), with 23 partitions, and the time taken to execute is CPU times: user 4min 50s, sys: 12 s, total: 5min 2s
Wall time: 4min 46s
For nrows 7697351 (~7M) with 1 partition, The time taken is
CPU times: user 5min 4s, sys: 10.6 s, total: 5min 14s
Wall time: 4min 52s
I performed the same operations in Pandas with 7690000 (~7M) and the time take to execute is CPU times: user 502 µs, sys: 0 ns, total: 502 µs
Wall time: 402 µs
Number of columns remains 5 for all the above cases
I'm just trying to find the shape of the data, But in Dask regardless of the type of operation the dask is taking same time to perform one compute operation.
May I know what is the reason behind this and what do I need to do avoid this and optimize the compute time
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
一般来说,给定的计算将具有可以分布(并行)的部分和必须顺序完成的部分,请参阅 阿姆达尔定律。如果给定的算法具有较大的串行组件,那么分布/缩放的收益将会很小。
在不知道任务图的具体情况的情况下,很难说到底是什么导致了瓶颈,但更广泛地说,即使输入相对较小,也可能有几个导致性能缓慢的原因:
上面列出的原因仍会随着数据而扩展,因此这并不完全是您问题的答案(“无论数据帧的大小如何”),但可能会有所帮助。
为了解决(或避免)这个问题,人们通常必须检查算法/代码,识别性能瓶颈,弄清楚它们是否可以并行化或者它们本质上是否是串行的。
In general, a given computation will have parts that can be distributed (parallelised) and those that have to be done sequentially, see Amdahl's law. If a given algorithm has a large serial component, then the gains from distributing/scaling are going to be small.
Without knowing the specifics of your task graph, it's hard to say what exactly is causing the bottleneck, but more broadly there could be several reasons for slow performance even with relatively small inputs:
The reasons listed above will still scale with the data, so this is not exactly an answer to your question ("regardless of the size of the dataframe"), but might help.
And to resolve this problem (or avoid it) one typically has to examine the algorithm/code, identify the performance bottlenecks, figure out if they can be parallelised or if they are inherently serial.
dask 数据帧花费更多时间来计算(形状或任何操作)的原因是,当调用计算操作时,dask 尝试执行从当前数据帧或其祖先的创建到调用compute() 的操作。
在问题中提出的场景中,dask 尝试从 S3 存储桶读取数据(从 s3 存储桶读取数据需要相当长的时间)。因此,当调用compute(以查找形状或任何其他操作)时,dask 尝试执行从 s3 读取 csv 数据文件的所有操作,这会增加执行时间。
应尽量少用compute(),但如果任何形式的计算操作都必须在当前或其子数据帧上一遍又一遍地执行,则保留数据帧会有所帮助。 persist() 允许数据存储在分布式内存中,因此它不会执行其祖先的所有操作,而是从数据保存的位置执行。
The reason dask dataframe is taking more time to compute (shape or any operation) is because when a compute op is called, dask tries to perform operations from the creation of the current dataframe or it's ancestors to the point where compute() is called.
In the scenario presented in the question, dask is trying to read data from S3 bucket (reading from s3 bucket takes reasonably long time). So, when compute is called (to find shape or any other operation) dask tries to perform all the operations from reading the csv data file from s3, which adds to the time taken to execute.
compute() should be minimally used, but incases where compute operation in any form has to perform over and over again on the current or it's child dataframes, persisting the dataframe helps. persist() allows the data to stored in distributed memory and thus it doesn't execute all the operations from it's ancestor, but it executes from where the data is persisted.