dask map_blocks较早运行,重叠和嵌套过程的结果不好
我正在使用dask创建一个简单的数据操作管道。我基本上使用了3个功能。前两个使用简单的map_blocks
,第三个也使用map_blocks
,但用于重叠的数据。
由于某种原因,第三个map_blocks
比我想要的要早执行。请参阅代码和IPYTHON输出(不执行Run()
:
data = np.arange(2000)
data_da = da.from_array(data, chunks=(500,))
def func1(block, block_info=None):
return block + 1
def func2(block, block_info=None):
return block * 2
def func3(block, block_info=None):
print("func3", block_info)
return block
data_da_1 = data_da.map_blocks(func1)
data_da_2 = data_da_1.map_blocks(func2)
data_da_over = da.overlap.overlap(data_da_2, depth=(1), boundary='periodic')
data_da_map = data_da_over.map_blocks(func3)
data_da_3 = da.overlap.trim_internal(data_da_map, {0: 1})
输出是:
func3 None
func3 None
它仍然不尊重这里的4个块数。
我真的不知道这是什么问题 。
代码 ()在rechunk
之前,我也已经对此进行了测试。
I'm using Dask to create a simple pipeline of data manipulation. I'm basically using 3 functions. The first two uses a simple map_blocks
and the third one uses a map_blocks
also but for an overlapped data.
For some reason, the third map_blocks
is executing earlier than I want. See the code and the ipython output (without executing run()
:
data = np.arange(2000)
data_da = da.from_array(data, chunks=(500,))
def func1(block, block_info=None):
return block + 1
def func2(block, block_info=None):
return block * 2
def func3(block, block_info=None):
print("func3", block_info)
return block
data_da_1 = data_da.map_blocks(func1)
data_da_2 = data_da_1.map_blocks(func2)
data_da_over = da.overlap.overlap(data_da_2, depth=(1), boundary='periodic')
data_da_map = data_da_over.map_blocks(func3)
data_da_3 = da.overlap.trim_internal(data_da_map, {0: 1})
The output is:
func3 None
func3 None
It is still not respecting the number of blocks which is 4 here.
I really don't know what is wrong with this code. Specially because if I use visualize()
to see the data graph, It builds the right data sequence I want.
Initially, I thought that overlap
requires compute()
before like rechunk
does, but I've already tested that too.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
像许多DASK操作一样,DA.OverLap操作可以通过一个指定输出类型和维度的
meta
参数,或者Dask将使用数据的小(或长度为零)子集执行该功能。来自
dask.aray.map_overlap
文档:
Like many dask operations, da.overlap operations can either be passed a
meta
argument specifying the output types and dimensions, or dask will execute the function with a small (or length zero) subset of the data.From the
dask.array.map_overlap
docs: