通过dask阵列块迭代

发布于 2025-01-24 14:11:46 字数 584 浏览 0 评论 0原文

我正在尝试通过一个一个一个一个一个一个一个dask阵列的块手动迭代,并应用我的计算。我知道DASK的好处是它可以为我进行迭代,但是我的计算失败了(由于我认为与DASK无关的原因),我想手动迭代以进行调试。我该怎么做?

我想象的是:

import dask.array as da
data = da.random.randint(0, 30, size=(1_000, 100, 100), chunks=(-1, 10, 10))

for chunk in data.iterchunks():
    # chunk would contain some information about which chunk I have access to, 
    # and I could somehow get the data contained in that chunk
    chunk_data = get_chunk(chunk)
    my_function(chunk_data)

我回来的在哪里有一些有关我所在的块的信息,并且还会有该块的数据。

I am trying to manually iterate through the chunks of a dask array, one by one, and apply my computation. I understand that a benefit of dask is that it can to do the iteration for me, but my computation is failing (for reasons that I don't think are related to dask) and I want to iterate through manually for the purpose of debugging. How would I do that?

I am imagining something like:

import dask.array as da
data = da.random.randint(0, 30, size=(1_000, 100, 100), chunks=(-1, 10, 10))

for chunk in data.iterchunks():
    # chunk would contain some information about which chunk I have access to, 
    # and I could somehow get the data contained in that chunk
    chunk_data = get_chunk(chunk)
    my_function(chunk_data)

Where the chunk that I get back has some information about which chunk I am in, and there would also be get the data for that chunk.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

﹉夏雨初晴づ 2025-01-31 14:11:46

使用 arr.blocks 属性。 blockView对象具有类似数组的接口,但是访问块视图中的元素返回原始数组中所选块(S):

In [11]: data
Out[11]: dask.array<randint, shape=(1000, 100, 100), dtype=int64, chunksize=(1000, 10, 10), chunktype=numpy.ndarray>

In [12]: data.blocks
Out[12]: <dask.array.core.BlockView at 0x1730b2da0>

In [13]: data.blocks.shape
Out[13]: (1, 10, 10)

In [14]: data.blocks[0, 0, 0]
Out[14]: dask.array<blocks, shape=(1000, 10, 10), dtype=int64, chunksize=(1000, 10, 10), chunktype=numpy.ndarray>

In [15]: data.blocks[0, 0, 0].compute()
Out[15]:
array([[[14,  5, 24, ..., 25, 20,  6],
        [17, 12,  2, ..., 27, 13, 18],
        [13, 25,  2, ...,  7,  5, 22],
        ...,
        [12, 22, 26, ..., 15,  4, 11],
        [ 0, 26, 28, ..., 22, 14,  4],
        [ 9, 21, 14, ..., 15, 18, 21]],

       ...,

       [[ 3,  2, 20, ..., 27,  0, 12],
        [21, 17,  7, ..., 23,  3, 23],
        [24, 13,  0, ..., 26,  1,  0],
        ...,
        [ 5, 25,  6, ..., 22,  6, 16],
        [16, 25, 21, ..., 22, 14, 15],
        [ 8, 20, 17, ..., 29, 13,  1]]])

因此,在您的情况下,您可以循环遍历以下所有块:

In [34]: for inds in itertools.product(*map(range, data.blocks.shape)):
    ...:     chunk = data.blocks[inds]
    ...:     my_function(chunk)

这将很慢,很慢,但我认为您要寻找什么。

Access the data within each chunk using the arr.blocks property. The BlockView object has an array-like interface, but accessing an element in the BlockView array returns the selected chunk(s) in the original array:

In [11]: data
Out[11]: dask.array<randint, shape=(1000, 100, 100), dtype=int64, chunksize=(1000, 10, 10), chunktype=numpy.ndarray>

In [12]: data.blocks
Out[12]: <dask.array.core.BlockView at 0x1730b2da0>

In [13]: data.blocks.shape
Out[13]: (1, 10, 10)

In [14]: data.blocks[0, 0, 0]
Out[14]: dask.array<blocks, shape=(1000, 10, 10), dtype=int64, chunksize=(1000, 10, 10), chunktype=numpy.ndarray>

In [15]: data.blocks[0, 0, 0].compute()
Out[15]:
array([[[14,  5, 24, ..., 25, 20,  6],
        [17, 12,  2, ..., 27, 13, 18],
        [13, 25,  2, ...,  7,  5, 22],
        ...,
        [12, 22, 26, ..., 15,  4, 11],
        [ 0, 26, 28, ..., 22, 14,  4],
        [ 9, 21, 14, ..., 15, 18, 21]],

       ...,

       [[ 3,  2, 20, ..., 27,  0, 12],
        [21, 17,  7, ..., 23,  3, 23],
        [24, 13,  0, ..., 26,  1,  0],
        ...,
        [ 5, 25,  6, ..., 22,  6, 16],
        [16, 25, 21, ..., 22, 14, 15],
        [ 8, 20, 17, ..., 29, 13,  1]]])

So in your case, you could loop through all blocks with the following:

In [34]: for inds in itertools.product(*map(range, data.blocks.shape)):
    ...:     chunk = data.blocks[inds]
    ...:     my_function(chunk)

This will be slow, but it does I think what you're looking for.

月牙弯弯 2025-01-31 14:11:46

尝试使用data.chunks而不是data.iterchunks()

Try using data.chunks instead of data.iterchunks().

滥情哥ㄟ 2025-01-31 14:11:46

您可以使用 /a>并避免使用 -loop的

import dask.array as da
data = da.random.randint(0, 30, size=(1_000, 100, 100), chunks=(-1, 10, 10))
mapped_data = da.map_blocks(my_function, data)
# This is equivalent
mapped_data = data.map_blocks(my_function)

You can use da.map_blocks and avoid the for-loop:

import dask.array as da
data = da.random.randint(0, 30, size=(1_000, 100, 100), chunks=(-1, 10, 10))
mapped_data = da.map_blocks(my_function, data)
# This is equivalent
mapped_data = data.map_blocks(my_function)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文