使用纹理的 Cuda 线性插值

发布于 2024-12-21 12:54:05 字数 339 浏览 3 评论 0原文

我有一条曲线如下:

float points[] = {1, 4, 6, 9, 14, 25, 69};
float images[] = {0.3, 0.4, 0.7, 0.9, 1, 2.5, 5.3};

为了进行插值,假设 f(3) 我将在 1 和 4 之间使用线性插值 为了进行插值,假设 f(15) 我将对点数组应用二分搜索并得到下限,即 25 并考虑在区间 [14,25] 等中进行插值。

我已经找到了这个方法使我的设备运行速度非常慢。我听说我可以使用纹理内存和 tex1D 来做到这一点!即使points[]不是统一的(按恒定步长递增),有可能吗

I have a curve as follows:

float points[] = {1, 4, 6, 9, 14, 25, 69};
float images[] = {0.3, 0.4, 0.7, 0.9, 1, 2.5, 5.3};

In order to interpolate let's say f(3) I would use linear interpolation between 1 and 4
In order to interpolate let's say f(15) I would apply a binary search on the array of points and get the lowerBound which is 25 and consider interpolation in the interval [14,25] and so on..

I have found out this method is making my device function very slow. I've heard I can use texture memory and tex1D in order to do so ! is it possible even if points[] is not let's say uniform (incremented by constant step)

Any idea ?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

顾铮苏瑾 2024-12-28 12:54:05

看起来这个问题可以分为两部分:

  1. 使用点数组将 f(x) 中的 x 值转换为 0 到 7 之间的浮点索引(需要对点 [] 进行二分搜索)
  2. 使用该浮点索引从图像数组中获取线性插值

Cuda 纹理内存可以使步骤 2 变得非常快。然而,我猜测内核中的大部分时间都花在步骤 1 上,并且我不认为纹理内存可以在这方面为您提供帮助。

如果您还没有利用共享内存,那么将数组移动到共享内存将为您带来比使用纹理内存更大的加速。最近的硬件上有 48k 的共享内存,因此如果您的数组少于 24k(6k 元素),它们都应该适合共享内存。第 1 步可以从共享内存中受益匪浅,因为它需要非连续读取点[],这在全局内存中非常非常慢。

如果您的数组不适合共享内存,您应该将数组分成大小相等的块,每个块包含 6k 个元素,并将每个块分配给一个块。让每个块读取您正在迭代的所有点,如果该点不在其共享内存中存储的点[]数组的部分内,则让它忽略该点。

It looks like this problem can be broken into two parts:

  1. Use the points array to convert the x value in f(x) to a floating point index between 0 and 7 (requires binary search on points[])
  2. Use that floating point index to get a linearly interpolated value from the images array

Cuda texture memory can make step 2 very fast. I am guessing, however, that most of the time in your kernel is spent on step 1, and I don't think texture memory can help you there.

If you aren't already taking advantage of shared memory, moving your arrays to shared memory will give you a much bigger speedup than using texture memory. There is 48k of shared memory on recent hardware, so if your arrays are less than 24k (6k elements) they should both fit in shared memory. Step 1 can benefit greatly from shared memory because it requires non-contiguous reads of points[], which is very very slow in global memory.

If your arrays don't fit in shared memory, you should break up your arrays into equally sized pieces with 6k elements each and assign each piece to a block. Have each block read through all of the points you are iterpolating, and have it ignore the point if it's not within the portion of the points[] array stored in its shared memory.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文