cuda-内核优化

发布于 2024-11-28 20:50:19 字数 906 浏览 2 评论 0原文

我创建了一个简单的粒子系统。我有一台计算能力为 2.1 的设备。我可以改变什么来优化内核?

我假设变量 tPos 和 tVel 存储在寄存器中。

__global__ void particles_kernel(float4 *vbo, float4 *pos, float4 *vel)
{
     int tid = blockIdx.x * blockDim.x + threadIdx.x;

     float4 tPos = pos[tid];
     float4 tVel = vel[tid];

     tPos.x += tVel.x;
     tPos.y += tVel.y;
     tPos.z += tVel.z;

     if(tPos.x < -2.0f)
     {
         tVel.x = -tVel.x;
     }
     else if(tPos.x > 2.0f)
     {
         tVel.x = -tVel.x;
     }


     if(tPos.y < -2.0f)
     {
         tVel.y = -tVel.y;
     }
     else if(tPos.y > 2.0f)
     {
         tVel.y = -tVel.y;
     }


     if(tPos.z < -2.0f)
     {
         tVel.z = -tVel.z;
     }
     else if(tPos.z > 2.0f)
     {
         tVel.z = -tVel.z;
     }


     pos[tid] = tPos;
     vel[tid] = tVel;


     vbo[tid] = make_float4(tPos.x, tPos.y, tPos.z, tPos.w);
}

I created a simple particle system. I have a device with compute capability 2.1. What could I change to optimize the kernel?

I assume that variables tPos and tVel are stored in the registers.

__global__ void particles_kernel(float4 *vbo, float4 *pos, float4 *vel)
{
     int tid = blockIdx.x * blockDim.x + threadIdx.x;

     float4 tPos = pos[tid];
     float4 tVel = vel[tid];

     tPos.x += tVel.x;
     tPos.y += tVel.y;
     tPos.z += tVel.z;

     if(tPos.x < -2.0f)
     {
         tVel.x = -tVel.x;
     }
     else if(tPos.x > 2.0f)
     {
         tVel.x = -tVel.x;
     }


     if(tPos.y < -2.0f)
     {
         tVel.y = -tVel.y;
     }
     else if(tPos.y > 2.0f)
     {
         tVel.y = -tVel.y;
     }


     if(tPos.z < -2.0f)
     {
         tVel.z = -tVel.z;
     }
     else if(tPos.z > 2.0f)
     {
         tVel.z = -tVel.z;
     }


     pos[tid] = tPos;
     vel[tid] = tVel;


     vbo[tid] = make_float4(tPos.x, tPos.y, tPos.z, tPos.w);
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

萌面超妹 2024-12-05 20:50:20

除非我遗漏了一些东西,否则您的钳位代码可以像这样简化:

if (fabsf(tVel.x) > 2.0f) tVel.x = -tVel.x;
if (fabsf(tVel.y) > 2.0f) tVel.y = -tVel.y;
if (fabsf(tVel.z) > 2.0f) tVel.z = -tVel.z;

但是,考虑到计算量相对较小,这种更改可能不会提高性能,因为代码似乎受内存限制(您正在流式传输数据)。也许您的应用程序中的其他地方有额外的计算,您可以将其与此计算结合起来以增加计算密度?

Unless I am missing something, your clamping code can be simplified like this:

if (fabsf(tVel.x) > 2.0f) tVel.x = -tVel.x;
if (fabsf(tVel.y) > 2.0f) tVel.y = -tVel.y;
if (fabsf(tVel.z) > 2.0f) tVel.z = -tVel.z;

However given the relatively small amont of computation, this change will probably not improve performance as the code appears to be memory bound (you are streaming through the data). Maybe there is additional computation elsewhere in your app that you could combine with this computation to increase the computational density?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文