如何在 CUDA 应用程序中构建数据以获得最佳速度

发布于 2024-08-20 09:11:44 字数 574 浏览 4 评论 0原文

我正在尝试编写一个简单的粒子系统,利用 CUDA 来更新粒子位置。现在,我定义的粒子有一个对象,该对象的位置由三个浮点值定义,速度也由三个浮点值定义。更新粒子时,我向速度的 Y 分量添加一个常量值以模拟重力,然后将速度添加到当前位置以得出新位置。就内存管理而言,最好维护两个独立的浮点数组来存储数据或以面向对象的方式进行结构。像这样的事情:

struct Vector
{
    float x, y, z;
};

struct Particle
{
    Vector position;
    Vector velocity;
};

看起来数据的大小与这两种方法相同(每个浮点数 4 个字节,每个向量 3 个浮点数,每个粒子 2 个向量,总共 24 字节)似乎 OO 方法将允许更有效的数据传输在 CPU 和 GPU 之间,因为我可以使用单个内存复制语句而不是 2 个(从长远来看,更多,因为还有一些关于粒子的其他信息将变得相关,例如年龄、寿命、重量/质量、温度等)此外,代码的简单可读性和易于处理也使我倾向于 OO 方法。但我看到的例子没有利用结构化数据,所以这让我想知道是否有原因。

所以问题是哪个更好:单独的数据数组还是结构化对象?

I am attempting to write a simple particle system that leverages CUDA to do the updating of the particle positions. Right now I am defining a particle has an object with a position defined with three float values, and a velocity also defined with three float values. When updating the particles, I am adding a constant value to the Y component of the velocity to simulate gravity, then adding the velocity to the current position to come up with the new position. In terms of memory management is it better to maintain two separate arrays of floats to store the data or to structure in a object oriented way. Something like this:

struct Vector
{
    float x, y, z;
};

struct Particle
{
    Vector position;
    Vector velocity;
};

It seems like the size of the data is the same with either method (4 bytes per float, 3 floats per Vector, 2 Vectors per Particle totaling 24 bytes total) It seems like the OO approach would allow more effiecient data transfer between the CPU and GPU because I could use a single Memory copy statement instead of 2 (and in the long run more, as there are a few other bits of information about particles that will become relevant, like Age, Lifetime, Weight/Mass, Temperature, etc) And then theres also just the simple readability of the code and ease of dealing with it that also makes me inclined toward the OO approach. But the examples I have seen don't utilize structured data, so it makes me wonder if theres a reason.

So the question is which is better: individual arrays of data or structured objects?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

泅渡 2024-08-27 09:11:44

在数据并行编程中,谈论“数组结构”(SOA) 与“结构数组”(AOS) 是很常见的,其中两个示例中的第一个是 AOS,第二个是 SOA。许多并行编程范例,特别是 SIMD 样式范例,会更喜欢 SOA。

在 GPU 编程中,通常首选 SOA 的原因是优化对全局内存的访问。您可以观看 GTC 去年在 Advanced CUDA C 上录制的演示文稿GPU如何访问内存的详细描述。

要点是内存事务的最小大小为 32 字节,并且您希望最大化每个事务的效率。

使用 AOS:

position[base + tid].x = position[base + tid].x + velocity[base + tid].x * dt;
//  ^ write to every third address                    ^ read from every third address
//                           ^ read from every third address

使用 SOA:

position.x[base + tid] = position.x[base + tid] + velocity.x[base + tid] * dt;
//  ^ write to consecutive addresses                  ^ read from consecutive addresses
//                           ^ read from consecutive addresses

在第二种情况下,从连续地址读取意味着您的效率为 100%,而第一种情况为 33%。请注意,在较旧的 GPU(计算能力 1.0 和 1.1)上,情况要糟糕得多(效率为 13%)。

还有另一种可能性 - 如果结构中有两个或四个浮点数,那么您可以以 100% 的效率读取 AOS:

float4 lpos;
float4 lvel;
lpos = position[base + tid];
lvel = velocity[base + tid];
lpos.x += lvel.x * dt;
//...
position[base + tid] = lpos;

再次查看 Advanced CUDA C 演示文稿以了解详细信息。

It's common in data parallel programming to talk about "Struct of Arrays" (SOA) versus "Array of Structs" (AOS), where the first of your two examples is AOS and the second is SOA. Many parallel programming paradigms, in particular SIMD-style paradigms, will prefer SOA.

In GPU programming, the reason that SOA is typically preferred is to optimise the accesses to the global memory. You can view the recorded presentation on Advanced CUDA C from GTC last year for a detailed description of how the GPU accesses memory.

The main point is that memory transactions have a minimum size of 32 bytes and you want to maximise the efficiency of each transaction.

With AOS:

position[base + tid].x = position[base + tid].x + velocity[base + tid].x * dt;
//  ^ write to every third address                    ^ read from every third address
//                           ^ read from every third address

With SOA:

position.x[base + tid] = position.x[base + tid] + velocity.x[base + tid] * dt;
//  ^ write to consecutive addresses                  ^ read from consecutive addresses
//                           ^ read from consecutive addresses

In the second case, reading from consecutive addresses means that you have 100% efficiency versus 33% in the first case. Note that on older GPUs (compute capability 1.0 and 1.1) the situation is much worse (13% efficiency).

There is one other possibility - if you had two or four floats in the struct then you could read the AOS with 100% efficiency:

float4 lpos;
float4 lvel;
lpos = position[base + tid];
lvel = velocity[base + tid];
lpos.x += lvel.x * dt;
//...
position[base + tid] = lpos;

Again, check out the Advanced CUDA C presentation for the details.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文