dask map_blocks较早运行,重叠和嵌套过程的结果不好

发布于 2025-01-21 19:47:44 字数 993 浏览 0 评论 0原文

我正在使用dask创建一个简单的数据操作管道。我基本上使用了3个功能。前两个使用简单的map_blocks,第三个也使用map_blocks,但用于重叠的数据。

由于某种原因,第三个map_blocks比我想要的要早执行。请参阅代码和IPYTHON输出(不执行Run()

data = np.arange(2000)

data_da = da.from_array(data, chunks=(500,))

def func1(block, block_info=None):
    return block + 1

def func2(block, block_info=None):
    return block * 2

def func3(block, block_info=None):
    print("func3", block_info)
    return block

data_da_1 = data_da.map_blocks(func1)

data_da_2 = data_da_1.map_blocks(func2)

data_da_over = da.overlap.overlap(data_da_2, depth=(1), boundary='periodic')

data_da_map = data_da_over.map_blocks(func3)

data_da_3 = da.overlap.trim_internal(data_da_map, {0: 1})

输出是:

func3 None
func3 None

它仍然不尊重这里的4个块数。

我真的不知道这是什么问题 。

代码 ()在rechunk之前,我也已经对此进行了测试。

I'm using Dask to create a simple pipeline of data manipulation. I'm basically using 3 functions. The first two uses a simple map_blocks and the third one uses a map_blocks also but for an overlapped data.

For some reason, the third map_blocks is executing earlier than I want. See the code and the ipython output (without executing run():

data = np.arange(2000)

data_da = da.from_array(data, chunks=(500,))

def func1(block, block_info=None):
    return block + 1

def func2(block, block_info=None):
    return block * 2

def func3(block, block_info=None):
    print("func3", block_info)
    return block

data_da_1 = data_da.map_blocks(func1)

data_da_2 = data_da_1.map_blocks(func2)

data_da_over = da.overlap.overlap(data_da_2, depth=(1), boundary='periodic')

data_da_map = data_da_over.map_blocks(func3)

data_da_3 = da.overlap.trim_internal(data_da_map, {0: 1})

The output is:

func3 None
func3 None

It is still not respecting the number of blocks which is 4 here.

I really don't know what is wrong with this code. Specially because if I use visualize() to see the data graph, It builds the right data sequence I want.

Initially, I thought that overlap requires compute() before like rechunk does, but I've already tested that too.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

深爱不及久伴 2025-01-28 19:47:44

像许多DASK操作一样,DA.OverLap操作可以通过一个指定输出类型和维度的meta参数,或者Dask将使用数据的小(或长度为零)子集执行该功能。

来自 dask.aray.map_overlap文档:

请注意,此功能将尝试在计算之前自动确定输出数组类型,如果您希望在0-D数组操作时该函数不会成功,请参考META关键字参数。

Like many dask operations, da.overlap operations can either be passed a meta argument specifying the output types and dimensions, or dask will execute the function with a small (or length zero) subset of the data.

From the dask.array.map_overlap docs:

Note that this function will attempt to automatically determine the output array type before computing it, please refer to the meta keyword argument in map_blocks if you expect that the function will not succeed when operating on 0-d arrays.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文