PyCUDA:C/C++包括?

发布于 2024-10-31 17:41:00 字数 184 浏览 5 评论 0原文

任何地方都没有真正提到(至少我可以看到)是哪些库函数暴露给内联 CUDA 内核。

具体来说,我正在执行小型/愚蠢的矩阵乘法,这些乘法不值得单独卸载到 GPU,但正在卸载包含此乘法的算法的较大部分。没有人喜欢使用自己的 linalg 函数,因为总有人做得更好。

TLDR 在 PyCUDA 下的内联内核中我可以使用哪些库?

Something that isn't really mentioned anywhere (at least that I can see) is what library functions are exposed to inline CUDA kernels.

Specifically I'm doing small / stupid matrix multiplications that don't deserve to be individually offloaded to the GPU but am offloading a larger section of the algorithm which includes this multiplication. Noone ever liked using their own linalg functions since someone has always done it better.

TLDR What libraries can I play with while in inline kernels under PyCUDA?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

牵你的手,一向走下去 2024-11-07 17:41:00

我什么都不知道,但我一直认为拥有它会很有用。

对于我通常处理的问题的大小(有限元方法中出现的小矩阵和张量),我只是编写 C++ 模板来执行操作。模板化函数允许编译器在编译时知道行程计数,并且它可以展开循环并将结果或中间结果保存在寄存器中,这对于内核吞吐量来说往往非常有效。因此,矩阵-矩阵乘积被声明为

template < typename Real, unsigned int l, unsigned int m, unsigned int n >
__device__ __host__ 
void matmul(const Real *a,
            const Real *b,
                  Real *c)
{
    for(int i=0; i<l; i++) {
        for(int j=0; j<n; j++) {
            Real dotprod = Real(0);
               for(int k=0; k<m; k++) {
                   dotprod += a[idx2c(i,k,l)] * b[idx2c(k,j,m)];
                }
                c[idx2c(i,j,l)] = dotprod;
           }
     }
}

对于我的内核中出现的那种大小(2x2、3x3、4x4、8x8、9x9),执行上述操作并让编译解决问题似乎与其他任何方法一样好我尝试过的方法。因为在线程级别 CUDA 实际上是标量,所以没有任何矢量基元或类似的东西可用于加速此类小型操作。

I don't know of any, and I always thought it would be useful to have.

For the size of problems that I usually work with (small matrices and tensors that arise in the finite element method), I just wrote C++ templates to do the operations. Templating the functions allows the compiler to know the trip counts at compile time, and it can unroll loops and keep results or intermediate results in register, which tends to be very efficient for kernel throughput. So the matrix-matrix product gets declared as

template < typename Real, unsigned int l, unsigned int m, unsigned int n >
__device__ __host__ 
void matmul(const Real *a,
            const Real *b,
                  Real *c)
{
    for(int i=0; i<l; i++) {
        for(int j=0; j<n; j++) {
            Real dotprod = Real(0);
               for(int k=0; k<m; k++) {
                   dotprod += a[idx2c(i,k,l)] * b[idx2c(k,j,m)];
                }
                c[idx2c(i,j,l)] = dotprod;
           }
     }
}

For the sort of sizes that crop up in my kernels (2x2, 3x3, 4x4, 8x8, 9x9), doing the above and letting the compile work things out seems to be as good as any other approach I have tried. Because at the thread level CUDA is effectively scalar, there aren't any vector primitives or stuff like that which can be used to accelerate these sort of small operations.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文