试图故意降低L1-D-CACHE命中率

发布于 2025-02-08 05:17:42 字数 3074 浏览 1 评论 0原文

我正在尝试编写一个有意的代码，该代码的L1 D-CACHE命中率很低，它进行了：

#include <stdio.h>
#include <sys/time.h>
#include <stdlib.h>

#define S 16*1024*1024
int largedata[S];

int main()
{
    struct timeval tv1;
    struct timeval tv2;

    for (int j = 0; j < 10; j++)
    {
            gettimeofday(&tv1, NULL);
            int total = 0;
            for (int i = 1; i < S; i++) {
                    largedata[i] = largedata[rand()%S] + 1;
                    total += largedata[i];
            }
            gettimeofday(&tv2, NULL);
            int elapsed = 1000000 * (tv2.tv_sec - tv1.tv_sec) + (tv2.tv_usec - tv1.tv_usec);
            printf("Round %d elapsed %d us --> %d\n", j, elapsed, total);
    }

    return 0;
}

它的作用是，它具有64MB的缓冲区和随机访问。我的机器的L1 D-CACHE尺寸为32KB，所以我预计缓存比率很低，但是我看到的是2.57％的MISS率，如下：

# gcc test.c &&    perf stat -e L1-dcache-load-misses -e L1-icache-load-misses -e L1-dcache-loads -e L1-dcache-stores  ./a.out
Round 0 elapsed 399706 us --> 26671607
Round 1 elapsed 344664 us --> 57210118
Round 2 elapsed 342444 us --> 79296375
Round 3 elapsed 344605 us --> 92293029
Round 4 elapsed 342173 us --> 93904234
Round 5 elapsed 346295 us --> 76478386
Round 6 elapsed 343390 us --> 98878844
Round 7 elapsed 347893 us --> 107286968
Round 8 elapsed 355442 us --> 87289283
Round 9 elapsed 362253 us --> 101320374

 Performance counter stats for './a.out':

     104128257      L1-dcache-load-misses     #    2.57% of all L1-dcache accesses
       2630683      L1-icache-load-misses
    4047961891      L1-dcache-loads
    2192632892      L1-dcache-stores

   3.539076619 seconds time elapsed

   3.479520000 seconds user
   0.032746000 seconds sys

我希望更高（例如20〜30％），任何人都可以说明这个X86（Xeon）如何做到这一点？

顺便说一句，我的CPU：Intel（R）Xeon（R）Gold 6230

更新

Xorshift和优化的效果就像魅力！

    for (int j = 0; j < 10; j++)
    {
            gettimeofday(&tv1, NULL);
            int total = 0;
            for (int i = 1; i < S; i++) {

                    uint32_t t = x;
                    t ^= t << 11U;
                    t ^= t >> 8U;
                    x = y; y = z; z = w;
                    w ^= w >> 19U;
                    w ^= t;

                    largedata[i] = largedata[w%S] + 1;
                    total += largedata[i];
            }
            gettimeofday(&tv2, NULL);
            int elapsed = 1000000 * (tv2.tv_sec - tv1.tv_sec) + (tv2.tv_usec - tv1.tv_usec);
            printf("Round %d elapsed %d us --> %d\n", j, elapsed, total);
    }

这导致47％的高速缓存失误。我想打电话给兰德大约需要大约30个内存访问，就像布伦丹提到的那样。

 Performance counter stats for './a.out':

      87715381      L1-dcache-load-misses     #   47.35% of all L1-dcache accesses
       1064131      L1-icache-load-misses
     185235862      L1-dcache-loads
     177250983      L1-dcache-stores

   0.715463900 seconds time elapsed

   0.664886000 seconds user
   0.028505000 seconds sys

原文

I'm trying to write an intentional code that has very low L1 d-cache hit rate and here it goes:

#include <stdio.h>
#include <sys/time.h>
#include <stdlib.h>

#define S 16*1024*1024
int largedata[S];

int main()
{
    struct timeval tv1;
    struct timeval tv2;

    for (int j = 0; j < 10; j++)
    {
            gettimeofday(&tv1, NULL);
            int total = 0;
            for (int i = 1; i < S; i++) {
                    largedata[i] = largedata[rand()%S] + 1;
                    total += largedata[i];
            }
            gettimeofday(&tv2, NULL);
            int elapsed = 1000000 * (tv2.tv_sec - tv1.tv_sec) + (tv2.tv_usec - tv1.tv_usec);
            printf("Round %d elapsed %d us --> %d\n", j, elapsed, total);
    }

    return 0;
}

What it does is, it has 64MB of buffer and randomly accesses. The L1 D-cache size of my machine is 32KB so I expect very low cache-hit ratio but what I see is 2.57% miss-rate as following:

# gcc test.c &&    perf stat -e L1-dcache-load-misses -e L1-icache-load-misses -e L1-dcache-loads -e L1-dcache-stores  ./a.out
Round 0 elapsed 399706 us --> 26671607
Round 1 elapsed 344664 us --> 57210118
Round 2 elapsed 342444 us --> 79296375
Round 3 elapsed 344605 us --> 92293029
Round 4 elapsed 342173 us --> 93904234
Round 5 elapsed 346295 us --> 76478386
Round 6 elapsed 343390 us --> 98878844
Round 7 elapsed 347893 us --> 107286968
Round 8 elapsed 355442 us --> 87289283
Round 9 elapsed 362253 us --> 101320374

 Performance counter stats for './a.out':

     104128257      L1-dcache-load-misses     #    2.57% of all L1-dcache accesses
       2630683      L1-icache-load-misses
    4047961891      L1-dcache-loads
    2192632892      L1-dcache-stores

   3.539076619 seconds time elapsed

   3.479520000 seconds user
   0.032746000 seconds sys

I'd expect way more higher (like 20~30%) and can anyone explain how this X86 (xeon) do this magic?

BTW, my CPU: Intel(R) Xeon(R) Gold 6230

UPDATE

That xorshift and the optimization worked like a charm!

    for (int j = 0; j < 10; j++)
    {
            gettimeofday(&tv1, NULL);
            int total = 0;
            for (int i = 1; i < S; i++) {

                    uint32_t t = x;
                    t ^= t << 11U;
                    t ^= t >> 8U;
                    x = y; y = z; z = w;
                    w ^= w >> 19U;
                    w ^= t;

                    largedata[i] = largedata[w%S] + 1;
                    total += largedata[i];
            }
            gettimeofday(&tv2, NULL);
            int elapsed = 1000000 * (tv2.tv_sec - tv1.tv_sec) + (tv2.tv_usec - tv1.tv_usec);
            printf("Round %d elapsed %d us --> %d\n", j, elapsed, total);
    }

Which results in 47% of cache miss. I guess calling rand takes about ~30 memory accesses like Brendan mentioned.

 Performance counter stats for './a.out':

      87715381      L1-dcache-load-misses     #   47.35% of all L1-dcache accesses
       1064131      L1-icache-load-misses
     185235862      L1-dcache-loads
     177250983      L1-dcache-stores

   0.715463900 seconds time elapsed

   0.664886000 seconds user
   0.028505000 seconds sys

分享到QQ

分享到微博