PyCUDA - 通过引用将矩阵从 python 传递到 C++ CUDA代码

发布于 2024-11-28 01:41:24 字数 1116 浏览 6 评论 0原文

我必须编写一个 PyCUDA 函数，该函数获取两个矩阵 Nx3 和 Mx3，并返回一个矩阵 NxM，但我无法弄清楚如何在不知道列数的情况下通过引用传递矩阵。

我的代码基本上是这样的：

#kernel declaration
mod = SourceModule("""
__global__ void distance(int N, int M, float d1[][3], float d2[][3], float res[][M])
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    float x, y, z;
    x = d2[j][0]-d1[i][0];
    y = d2[j][1]-d1[i][1];
    z = d2[j][2]-d1[i][2];
    res[i][j] = x*x + y*y + z*z;
}
""")

#load data
data1 = numpy.loadtxt("data1.txt").astype(numpy.float32) # Nx3 matrix
data2 = numpy.loadtxt("data2.txt").astype(numpy.float32) # Mx3 matrix
N=data1.shape[0]
M=data2.shape[0]
res = numpy.zeros([N,M]).astype(numpy.float32) # NxM matrix

#invoke kernel
dist_gpu = mod.get_function("distance")
dist_gpu(cuda.In(numpy.int32(N)), cuda.In(numpy.int32(M)), cuda.In(data1), cuda.In(data2), cuda.Out(res), block=(N,M,1))

#save data
numpy.savetxt("results.txt", res)

编译这个我收到一个错误：

kernel.cu(3): error: a parameter is not allowed

也就是说，我不能在函数声明中使用 M 作为 res[][] 的列数。我不能不声明列数...

我需要一个矩阵 NxM 作为输出，但我不知道如何做到这一点。你能帮助我吗？

原文

I have to write in a PyCUDA function that gets two matrices Nx3 and Mx3, and return a matrix NxM, but I can't figure out how to pass by reference a matrix without knowing the number of columns.

My code basically is something like that:

#kernel declaration
mod = SourceModule("""
__global__ void distance(int N, int M, float d1[][3], float d2[][3], float res[][M])
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    float x, y, z;
    x = d2[j][0]-d1[i][0];
    y = d2[j][1]-d1[i][1];
    z = d2[j][2]-d1[i][2];
    res[i][j] = x*x + y*y + z*z;
}
""")

#load data
data1 = numpy.loadtxt("data1.txt").astype(numpy.float32) # Nx3 matrix
data2 = numpy.loadtxt("data2.txt").astype(numpy.float32) # Mx3 matrix
N=data1.shape[0]
M=data2.shape[0]
res = numpy.zeros([N,M]).astype(numpy.float32) # NxM matrix

#invoke kernel
dist_gpu = mod.get_function("distance")
dist_gpu(cuda.In(numpy.int32(N)), cuda.In(numpy.int32(M)), cuda.In(data1), cuda.In(data2), cuda.Out(res), block=(N,M,1))

#save data
numpy.savetxt("results.txt", res)

Compiling this I receive an error:

kernel.cu(3): error: a parameter is not allowed

that is, I cannot use M as the number of columns for res[][] in the declaretion of the function. I cannot either left the number of columns undeclared...

I need a matrix NxM as an output, but I can't figure out how to do this. Can you help me?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

静若繁花 2024-12-05 01:41:24

您应该在内核内部使用倾斜的线性内存访问，这就是 ndarray 和 gpuarray 在内部存储数据的方式，PyCUDA 将传递一个指向分配给 GPU 内存的数据的指针。 gpuarray 当它作为 PyCUDA 内核的参数提供时。所以（如果我理解你想要做什么）你的内核应该写成这样：

__device__ unsigned int idx2d(int i, int j, int lda)
{
    return j + i*lda;
}

__global__ void distance(int N, int M, float *d1, float *d2, float *res)
{
    int i = threadIdx.x + blockDim.x * blockIdx.x;
    int j = threadIdx.y + blockDim.y * blockIdx.y;
    float x, y, z;
    x = d2[idx2d(j,0,3)]-d1[idx2d(i,0,3)];
    y = d2[idx2d(j,1,3)]-d1[idx2d(i,1,3)];
    z = d2[idx2d(j,2,3)]-d1[idx2d(i,2,3)];

    res[idx2d(i,j,N)] = x*x + y*y + z*z;
}

这里我在定义 idx2d 帮助器时假设了 numpy 的默认行主要排序功能。您发布的代码的 Python 端仍然存在问题，但我想您已经知道了。

编辑：这是基于您问题中发布的代码的完整工作重现案例。请注意，它仅使用单个块（如原始块），因此在尝试在除微不足道的小情况之外的任何情况下运行它时，请注意块和网格尺寸。

import numpy as np
from pycuda import compiler, driver
from pycuda import autoinit

#kernel declaration
mod = compiler.SourceModule("""
__device__ unsigned int idx2d(int i, int j, int lda)
{
    return j + i*lda;
}

__global__ void distance(int N, int M, float *d1, float *d2, float *res)
{
    int i = threadIdx.x + blockDim.x * blockIdx.x;
    int j = threadIdx.y + blockDim.y * blockIdx.y;
    float x, y, z;
    x = d2[idx2d(j,0,3)]-d1[idx2d(i,0,3)];
    y = d2[idx2d(j,1,3)]-d1[idx2d(i,1,3)];
    z = d2[idx2d(j,2,3)]-d1[idx2d(i,2,3)];

    res[idx2d(i,j,N)] = x*x + y*y + z*z;
}
""")

#make data
data1 = np.random.uniform(size=18).astype(np.float32).reshape(-1,3)
data2 = np.random.uniform(size=12).astype(np.float32).reshape(-1,3)
N=data1.shape[0]
M=data2.shape[0]
res = np.zeros([N,M]).astype(np.float32) # NxM matrix

#invoke kernel
dist_gpu = mod.get_function("distance")
dist_gpu(np.int32(N), np.int32(M), driver.In(data1), driver.In(data2), \
        driver.Out(res), block=(N,M,1), grid=(1,1))

print res

You should use pitched linear memory access inside the kernel, that is how ndarray and gpuarray store data internally, and PyCUDA will pass a pointer to the data in gpu memory allocated for a gpuarray when it is supplied as a argument to a PyCUDA kernel. So (if I understand what you are trying to do) your kernel should be written as something like:

__device__ unsigned int idx2d(int i, int j, int lda)
{
    return j + i*lda;
}

__global__ void distance(int N, int M, float *d1, float *d2, float *res)
{
    int i = threadIdx.x + blockDim.x * blockIdx.x;
    int j = threadIdx.y + blockDim.y * blockIdx.y;
    float x, y, z;
    x = d2[idx2d(j,0,3)]-d1[idx2d(i,0,3)];
    y = d2[idx2d(j,1,3)]-d1[idx2d(i,1,3)];
    z = d2[idx2d(j,2,3)]-d1[idx2d(i,2,3)];

    res[idx2d(i,j,N)] = x*x + y*y + z*z;
}

Here I have assumed the numpy default row major ordering in defining the idx2d helper function. There are still problems with the Python side of the code you posted, but I guess you know that already.

EDIT: Here is a complete working repro case based of the code posted in your question. Note that it only uses a single block (like the original), so be mindful of block and grid dimensions when trying to run it on anything other than trivially small cases.

import numpy as np
from pycuda import compiler, driver
from pycuda import autoinit

#kernel declaration
mod = compiler.SourceModule("""
__device__ unsigned int idx2d(int i, int j, int lda)
{
    return j + i*lda;
}

__global__ void distance(int N, int M, float *d1, float *d2, float *res)
{
    int i = threadIdx.x + blockDim.x * blockIdx.x;
    int j = threadIdx.y + blockDim.y * blockIdx.y;
    float x, y, z;
    x = d2[idx2d(j,0,3)]-d1[idx2d(i,0,3)];
    y = d2[idx2d(j,1,3)]-d1[idx2d(i,1,3)];
    z = d2[idx2d(j,2,3)]-d1[idx2d(i,2,3)];

    res[idx2d(i,j,N)] = x*x + y*y + z*z;
}
""")

#make data
data1 = np.random.uniform(size=18).astype(np.float32).reshape(-1,3)
data2 = np.random.uniform(size=12).astype(np.float32).reshape(-1,3)
N=data1.shape[0]
M=data2.shape[0]
res = np.zeros([N,M]).astype(np.float32) # NxM matrix

#invoke kernel
dist_gpu = mod.get_function("distance")
dist_gpu(np.int32(N), np.int32(M), driver.In(data1), driver.In(data2), \
        driver.Out(res), block=(N,M,1), grid=(1,1))

print res

回复收藏 0 原文

~没有更多了~