优化基质图块的GPU分配/转移

发布于 2025-01-22 11:40:39 字数 1299 浏览 1 评论 0原文

我正在使用非常大的矩阵（＆gt; 1GB）工作，但想象我有以下矩阵：

A = [1 1 2 2;
     1 1 2 2;
     3 3 4 4;
     3 3 4 4]

我需要以异步的方式固定上一个矩阵的每个瓷砖以将其传输到GPU（使用cuda.jl package）。

以下代码分配了GPU中每个瓷砖的空间，并且正在工作：

function allocGPU!(gpu_buf, m,n)
    dev_buf = CUDA.Mem.alloc(CUDA.Mem.DeviceBuffer, m*n*8)
    dev_ptr = convert(CuPtr{Float64}, dev_buf);
    push!(gpu_buf, dev_buf)

    tile_gpu = unsafe_wrap(CuArray{Float64}, dev_ptr, (m,n)); 
    gpu_buf

    return tile_gpu
end

A_coor = [(1:2,1:2) (1:2, 3:4);
          (3:4,1:2) (3:4,3:4)]

A_tiles = [A[A_coor[i][1], A_coor[i,j][2]] for i=1:size(A_coor)[1], j=1:size(A_coor)[2]]
gpu_buf = []
A_tiles_gpu = [allocGPU!(gpu_buf, m,n) for i=1:size(A_tiles)[1], j=1:size(A_tiles)[2]]

但是它将每个瓷砖复制到一个新对象中，花费时间比我想要的更多。是否有任何方法可以将2x2阵列包裹到每个瓷砖以减少分配数量？

我还尝试了这条线：

A_tiles = [unsafe_wrap(Array{Float64}, pointer(A[A_coor[i][1], A_coor[i,j][2]]), (m,n)) for i=1:size(A_coor)[1], j=1:size(A_coor)[2]]

我也将固定矩阵A固定，然后将

copyto!(tile_gpu, A[1:2,1:2])

其转移到GPU时：，产生与第一个方法相同的结果。

编辑：

当我怀疑：

copyto!(tile_gpu, A[1:2,1:2])

在不同的内存位置创建一个新对象时，我还尝试使用@view宏，尽管它适用于CPU，但它似乎与<<<代码> copyto！到GPU内存。

原文

I am working with very large matrices (>1GB) but imagine that I have the following matrix:

A = [1 1 2 2;
     1 1 2 2;
     3 3 4 4;
     3 3 4 4]

I need to pin each tile of the previous matrix to transfer them to the GPU in an async way (using the CUDA.jl package).

The following code allocates the space of each tile in the GPU and it is working:

function allocGPU!(gpu_buf, m,n)
    dev_buf = CUDA.Mem.alloc(CUDA.Mem.DeviceBuffer, m*n*8)
    dev_ptr = convert(CuPtr{Float64}, dev_buf);
    push!(gpu_buf, dev_buf)

    tile_gpu = unsafe_wrap(CuArray{Float64}, dev_ptr, (m,n)); 
    gpu_buf

    return tile_gpu
end

A_coor = [(1:2,1:2) (1:2, 3:4);
          (3:4,1:2) (3:4,3:4)]

A_tiles = [A[A_coor[i][1], A_coor[i,j][2]] for i=1:size(A_coor)[1], j=1:size(A_coor)[2]]
gpu_buf = []
A_tiles_gpu = [allocGPU!(gpu_buf, m,n) for i=1:size(A_tiles)[1], j=1:size(A_tiles)[2]]

But it's copying each tile into a new object, taking more time than I would like. Is there any way to wrap a 2x2 Array to each tile in order to reduce the number of allocations?

I also tried with this line:

A_tiles = [unsafe_wrap(Array{Float64}, pointer(A[A_coor[i][1], A_coor[i,j][2]]), (m,n)) for i=1:size(A_coor)[1], j=1:size(A_coor)[2]]

I also though of pinning matrix A and then transfer to the GPU as:

copyto!(tile_gpu, A[1:2,1:2])

but I'm guessing julia will copy the A[1:2,1:2] into a new object and then transfer the tile, yielding the same results as 1st method.

Edit:

As I suspected the:

copyto!(tile_gpu, A[1:2,1:2])

Creates a new object, in a different memory location, I also tried to use the @view macro, although it works for the CPU it doesn't seem to work with copyto! to the GPU memory.

分享到QQ

分享到微博