试图故意降低L1-D-CACHE命中率

发布于 2025-02-08 05:17:42 字数 3074 浏览 1 评论 0原文

我正在尝试编写一个有意的代码,该代码的L1 D-CACHE命中率很低,它进行了:

#include <stdio.h>
#include <sys/time.h>
#include <stdlib.h>

#define S 16*1024*1024
int largedata[S];

int main()
{
    struct timeval tv1;
    struct timeval tv2;

    for (int j = 0; j < 10; j++)
    {
            gettimeofday(&tv1, NULL);
            int total = 0;
            for (int i = 1; i < S; i++) {
                    largedata[i] = largedata[rand()%S] + 1;
                    total += largedata[i];
            }
            gettimeofday(&tv2, NULL);
            int elapsed = 1000000 * (tv2.tv_sec - tv1.tv_sec) + (tv2.tv_usec - tv1.tv_usec);
            printf("Round %d elapsed %d us --> %d\n", j, elapsed, total);
    }

    return 0;
}

它的作用是,它具有64MB的缓冲区和随机访问。我的机器的L1 D-CACHE尺寸为32KB,所以我预计缓存比率很低,但是我看到的是2.57%的MISS率,如下:

# gcc test.c &&    perf stat -e L1-dcache-load-misses -e L1-icache-load-misses -e L1-dcache-loads -e L1-dcache-stores  ./a.out
Round 0 elapsed 399706 us --> 26671607
Round 1 elapsed 344664 us --> 57210118
Round 2 elapsed 342444 us --> 79296375
Round 3 elapsed 344605 us --> 92293029
Round 4 elapsed 342173 us --> 93904234
Round 5 elapsed 346295 us --> 76478386
Round 6 elapsed 343390 us --> 98878844
Round 7 elapsed 347893 us --> 107286968
Round 8 elapsed 355442 us --> 87289283
Round 9 elapsed 362253 us --> 101320374

 Performance counter stats for './a.out':

     104128257      L1-dcache-load-misses     #    2.57% of all L1-dcache accesses
       2630683      L1-icache-load-misses
    4047961891      L1-dcache-loads
    2192632892      L1-dcache-stores

   3.539076619 seconds time elapsed

   3.479520000 seconds user
   0.032746000 seconds sys

我希望更高(例如20〜30%),任何人都可以说明这个X86(Xeon)如何做到这一点?

顺便说一句,我的CPU:Intel(R)Xeon(R)Gold 6230

更新

Xorshift和优化的效果就像魅力!

    for (int j = 0; j < 10; j++)
    {
            gettimeofday(&tv1, NULL);
            int total = 0;
            for (int i = 1; i < S; i++) {

                    uint32_t t = x;
                    t ^= t << 11U;
                    t ^= t >> 8U;
                    x = y; y = z; z = w;
                    w ^= w >> 19U;
                    w ^= t;

                    largedata[i] = largedata[w%S] + 1;
                    total += largedata[i];
            }
            gettimeofday(&tv2, NULL);
            int elapsed = 1000000 * (tv2.tv_sec - tv1.tv_sec) + (tv2.tv_usec - tv1.tv_usec);
            printf("Round %d elapsed %d us --> %d\n", j, elapsed, total);
    }

这导致47%的高速缓存失误。我想打电话给兰德大约需要大约30个内存访问,就像布伦丹提到的那样。

 Performance counter stats for './a.out':

      87715381      L1-dcache-load-misses     #   47.35% of all L1-dcache accesses
       1064131      L1-icache-load-misses
     185235862      L1-dcache-loads
     177250983      L1-dcache-stores

   0.715463900 seconds time elapsed

   0.664886000 seconds user
   0.028505000 seconds sys

I'm trying to write an intentional code that has very low L1 d-cache hit rate and here it goes:

#include <stdio.h>
#include <sys/time.h>
#include <stdlib.h>

#define S 16*1024*1024
int largedata[S];

int main()
{
    struct timeval tv1;
    struct timeval tv2;

    for (int j = 0; j < 10; j++)
    {
            gettimeofday(&tv1, NULL);
            int total = 0;
            for (int i = 1; i < S; i++) {
                    largedata[i] = largedata[rand()%S] + 1;
                    total += largedata[i];
            }
            gettimeofday(&tv2, NULL);
            int elapsed = 1000000 * (tv2.tv_sec - tv1.tv_sec) + (tv2.tv_usec - tv1.tv_usec);
            printf("Round %d elapsed %d us --> %d\n", j, elapsed, total);
    }

    return 0;
}

What it does is, it has 64MB of buffer and randomly accesses. The L1 D-cache size of my machine is 32KB so I expect very low cache-hit ratio but what I see is 2.57% miss-rate as following:

# gcc test.c &&    perf stat -e L1-dcache-load-misses -e L1-icache-load-misses -e L1-dcache-loads -e L1-dcache-stores  ./a.out
Round 0 elapsed 399706 us --> 26671607
Round 1 elapsed 344664 us --> 57210118
Round 2 elapsed 342444 us --> 79296375
Round 3 elapsed 344605 us --> 92293029
Round 4 elapsed 342173 us --> 93904234
Round 5 elapsed 346295 us --> 76478386
Round 6 elapsed 343390 us --> 98878844
Round 7 elapsed 347893 us --> 107286968
Round 8 elapsed 355442 us --> 87289283
Round 9 elapsed 362253 us --> 101320374

 Performance counter stats for './a.out':

     104128257      L1-dcache-load-misses     #    2.57% of all L1-dcache accesses
       2630683      L1-icache-load-misses
    4047961891      L1-dcache-loads
    2192632892      L1-dcache-stores

   3.539076619 seconds time elapsed

   3.479520000 seconds user
   0.032746000 seconds sys

I'd expect way more higher (like 20~30%) and can anyone explain how this X86 (xeon) do this magic?

BTW, my CPU: Intel(R) Xeon(R) Gold 6230

UPDATE

That xorshift and the optimization worked like a charm!

    for (int j = 0; j < 10; j++)
    {
            gettimeofday(&tv1, NULL);
            int total = 0;
            for (int i = 1; i < S; i++) {

                    uint32_t t = x;
                    t ^= t << 11U;
                    t ^= t >> 8U;
                    x = y; y = z; z = w;
                    w ^= w >> 19U;
                    w ^= t;

                    largedata[i] = largedata[w%S] + 1;
                    total += largedata[i];
            }
            gettimeofday(&tv2, NULL);
            int elapsed = 1000000 * (tv2.tv_sec - tv1.tv_sec) + (tv2.tv_usec - tv1.tv_usec);
            printf("Round %d elapsed %d us --> %d\n", j, elapsed, total);
    }

Which results in 47% of cache miss. I guess calling rand takes about ~30 memory accesses like Brendan mentioned.

 Performance counter stats for './a.out':

      87715381      L1-dcache-load-misses     #   47.35% of all L1-dcache accesses
       1064131      L1-icache-load-misses
     185235862      L1-dcache-loads
     177250983      L1-dcache-stores

   0.715463900 seconds time elapsed

   0.664886000 seconds user
   0.028505000 seconds sys

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文