GPU中缓存未命中的变化

发布于 2024-11-25 05:06:37 字数 1420 浏览 1 评论 0原文

我一直在玩弄一个 OpenCL 内核，它访问 7 个全局内存缓冲区，对值执行一些操作并将结果存储回第 8 个全局内存缓冲区。正如我所观察到的，随着输入大小的增加，L1 缓存未命中率（=未命中（未命中 + 命中））变化很大。我找不到这种变异的来源。这里的输入大小是指全局工作项的数量（2的幂，工作组大小的倍数）。工作组规模的数量仍然是 256。

这些是结果。这些显示了 L1 缓存未命中率。从 4096 个工作项（16 个工作组）开始。

0.677125
0.55946875
0.345994792
0.054078125
0.436167969
0.431871745
0.938546224
0.959258789
0.952941406
0.955016479

探查器表示每个线程使用 18 个寄存器。这是代码（函数 TTsum() 应该只执行一堆相关的超越操作，所以我猜它与缓存无关）：

float TTsum(float x1, float x2, float x3, float x4, float x5, float x6, float x7)
{
        float temp = 0;
        for (int j = 0; j < 2; j++)
                temp = temp +  x1 + (float)x2 + x3 + x4 + x5 + x6 + x7;
        temp = sqrt(temp);
        temp = exp(temp);
        temp = temp / x1;
        temp = temp / (float)x2;
        for (int j = 0; j < 20; j++) temp = sqrt(temp);
        return temp;
}

__kernel void histogram(__global float* x1,
                        __global int* x2,
                        __global float* x3,
                        __global float* x4,
                        __global float* x5,
                        __global float* x6,
                        __global float* x7,
                        __global float* y)
{
  int id = get_global_id(0);    
  for (int j = 0; j < 1000; j++)
    y[id] = TTsum(x1[id], x2[id], x3[id], x4[id], x5[id], x6[id], x7[id]);
}

有人可以解释缓存行为吗？实验是在GTX580上完成的。

原文

I have been toying an OpenCL kernel that access 7 global memory buffers, do something on the values and store the result back to a 8th global memory buffer. As I observed, as the input size increases, the L1 cache miss ratio (=misses(misses + hits)) varies a lot. I can't find the source of this variation. The input size here means the number of global work items (a power of 2, and a multiple of workgroup size). The number of workgroup size remains 256.

These are the results. These show the L1 cache miss ratio. Starting from 4096 work-items (16 workgroups).

0.677125
0.55946875
0.345994792
0.054078125
0.436167969
0.431871745
0.938546224
0.959258789
0.952941406
0.955016479

The profiler says it uses 18 registers per thread.
Here is the code (the function TTsum() is supposed to do just a bunch of dependent transcendent operations, so it has nothing to do with caches I guess) :

float TTsum(float x1, float x2, float x3, float x4, float x5, float x6, float x7)
{
        float temp = 0;
        for (int j = 0; j < 2; j++)
                temp = temp +  x1 + (float)x2 + x3 + x4 + x5 + x6 + x7;
        temp = sqrt(temp);
        temp = exp(temp);
        temp = temp / x1;
        temp = temp / (float)x2;
        for (int j = 0; j < 20; j++) temp = sqrt(temp);
        return temp;
}

__kernel void histogram(__global float* x1,
                        __global int* x2,
                        __global float* x3,
                        __global float* x4,
                        __global float* x5,
                        __global float* x6,
                        __global float* x7,
                        __global float* y)
{
  int id = get_global_id(0);    
  for (int j = 0; j < 1000; j++)
    y[id] = TTsum(x1[id], x2[id], x3[id], x4[id], x5[id], x6[id], x7[id]);
}

Can someone explain the cache behavior? The experiments are done in GTX580.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

枉心 2024-12-02 05:06:37

在 CUDA 中计算直方图相当困难。我相信 y[] 上的随机访问很可能是您观察到的行为的原因。如果您还没有，也许可以阅读此内容： http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/histogram256/doc/histogram.pdf

回复收藏 0 原文

~没有更多了~