Dask计算商店结果吗?

发布于 2025-01-18 13:57:49 字数 552 浏览 4 评论 0原文

考虑下面的代码

import dask
import dask.dataframe as dd
import pandas as pd

data_dict = {'data1':[1,2,3,4,5,6,7,8,9,10]}
df_pd     = pd.DataFrame(data_dict) 
df_dask   = dd.from_pandas(df_pd,npartitions=2)

df_dask['data1x2'] = df_dask['data1'].apply(lambda x:2*x,meta=('data1x2','int64')).compute()

print('-'*80)
print(df_dask['data1x2'])
print('-'*80)
print(df_dask['data1x2'].compute())
print('-'*80)

我不明白的是:为什么第一次和第二次打印的输出之间存在差异?毕竟,我在应用该函数时调用了compute,并将结果存储在df_dask['data1x2']中。

Consider the following code

import dask
import dask.dataframe as dd
import pandas as pd

data_dict = {'data1':[1,2,3,4,5,6,7,8,9,10]}
df_pd     = pd.DataFrame(data_dict) 
df_dask   = dd.from_pandas(df_pd,npartitions=2)

df_dask['data1x2'] = df_dask['data1'].apply(lambda x:2*x,meta=('data1x2','int64')).compute()

print('-'*80)
print(df_dask['data1x2'])
print('-'*80)
print(df_dask['data1x2'].compute())
print('-'*80)

What I can't figure out is: why is there a difference between the output of the first and second print? After all, I called compute when I applied the function and stored the result in df_dask['data1x2'].

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

变身佩奇 2025-01-25 13:57:50

第一个打印将仅显示 dask 系列的惰性版本,df_dask["data1x2"]

Dask Series Structure:
npartitions=2
0    int64
5      ...
9      ...
Name: data1x2, dtype: int64
Dask Name: getitem, 15 tasks

这显示分区数量、索引值(如果已知)、需要完成的任务数量获得最终结果以及一些其他信息。在此阶段,dask 并未计算实际序列,因此该 dask 数组内的值未知。调用 .compute 会启动获取实际值所需的 15 个任务的计算,这就是第二次打印的内容。

The first print will only show the lazy version of the dask series, df_dask["data1x2"]:

Dask Series Structure:
npartitions=2
0    int64
5      ...
9      ...
Name: data1x2, dtype: int64
Dask Name: getitem, 15 tasks

This shows the number of partitions, index values (if known), number of tasks needed to be done to get the final result, and some other information. At this stage, dask did not compute the actual series, so the values inside this dask array are not known. Calling .compute launches computation of the 15 tasks needed to get the actual values and that's what is printed the second time.

仙女山的月亮 2025-01-25 13:57:50

Dask确实将存储在工人或调度程序上的内存中。但这不是推动显示结果的差异的原因。两者的显示方式不同,因为它们是不同类型的对象。

df_dask ['data1x2']dask.dataframe.series,它只会显示数据结构的预览以及有关计算涉及任务数量的信息值。显示任何数据至少需要将数据移动到主线程,如果不是计算,则可能是I/O,因此,除非明确要求使用df.head(),否则Dask永远不会执行此操作。

df_dask ['data1x2']。compute()pandas.series。它不再与Dask有关,并且根据定义中的内存。由于所有PANDAS数据结构都在存储器中,因此数据由Defualt显示。

当您在Dask对象上调用计算时,它将停止为DASK对象。在这种情况下,第一个计算返回熊猫系列。当您将熊猫系列分配给DASK数据框架时,Dask分区并将数据发送给工人,然后再也无法显示整个系列。因此,如果您想查看显示的系列,则必须再次致电Compute。

想象一下,如果您的整个数据框太大而无法适应内存,这将有多有用,例如,如果您有1000列和10m行。这就是Dask的设计目的。

Dask does store results in memory on the workers or scheduler. But that’s not what’s driving the differences in displayed results. The two are displayed differently because they are different types of objects.

df_dask['data1x2'] is a dask.dataframe.Series, which will only ever display a preview of the data structure and information about the number of tasks involved in calculating the values. Displaying any data requires at least moving data to the main thread, if not computation and possibly I/O, so dask will never do this unless explicitly asked to, e.g. with df.head().

df_dask['data1x2'].compute() is a pandas.Series. It no longer has anything to do with dask and is by definition in-memory. Since all pandas data structures are in memory, the data is displayed by defualt.

When you call compute on a dask object it ceases to be a dask object. In this case, the first compute returns a pandas series. When you assign a pandas series to a dask data frame, dask partitions and sends the data to the workers, and then can no longer display the whole series. So you have to call compute again if you’d like to see the series displayed.

Imagine how useful this would be if your whole data frame were too large to fit into memory, e.g. if you had 1000 columns and 10m rows. This is what dask is designed for.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文