试图故意降低L1-D-CACHE命中率
我正在尝试编写一个有意的代码,该代码的L1 D-CACHE命中率很低,它进行了:
#include <stdio.h>
#include <sys/time.h>
#include <stdlib.h>
#define S 16*1024*1024
int largedata[S];
int main()
{
struct timeval tv1;
struct timeval tv2;
for (int j = 0; j < 10; j++)
{
gettimeofday(&tv1, NULL);
int total = 0;
for (int i = 1; i < S; i++) {
largedata[i] = largedata[rand()%S] + 1;
total += largedata[i];
}
gettimeofday(&tv2, NULL);
int elapsed = 1000000 * (tv2.tv_sec - tv1.tv_sec) + (tv2.tv_usec - tv1.tv_usec);
printf("Round %d elapsed %d us --> %d\n", j, elapsed, total);
}
return 0;
}
它的作用是,它具有64MB的缓冲区和随机访问。我的机器的L1 D-CACHE尺寸为32KB,所以我预计缓存比率很低,但是我看到的是2.57%的MISS率,如下:
# gcc test.c && perf stat -e L1-dcache-load-misses -e L1-icache-load-misses -e L1-dcache-loads -e L1-dcache-stores ./a.out
Round 0 elapsed 399706 us --> 26671607
Round 1 elapsed 344664 us --> 57210118
Round 2 elapsed 342444 us --> 79296375
Round 3 elapsed 344605 us --> 92293029
Round 4 elapsed 342173 us --> 93904234
Round 5 elapsed 346295 us --> 76478386
Round 6 elapsed 343390 us --> 98878844
Round 7 elapsed 347893 us --> 107286968
Round 8 elapsed 355442 us --> 87289283
Round 9 elapsed 362253 us --> 101320374
Performance counter stats for './a.out':
104128257 L1-dcache-load-misses # 2.57% of all L1-dcache accesses
2630683 L1-icache-load-misses
4047961891 L1-dcache-loads
2192632892 L1-dcache-stores
3.539076619 seconds time elapsed
3.479520000 seconds user
0.032746000 seconds sys
我希望更高(例如20〜30%),任何人都可以说明这个X86(Xeon)如何做到这一点?
顺便说一句,我的CPU:Intel(R)Xeon(R)Gold 6230
更新
Xorshift和优化的效果就像魅力!
for (int j = 0; j < 10; j++)
{
gettimeofday(&tv1, NULL);
int total = 0;
for (int i = 1; i < S; i++) {
uint32_t t = x;
t ^= t << 11U;
t ^= t >> 8U;
x = y; y = z; z = w;
w ^= w >> 19U;
w ^= t;
largedata[i] = largedata[w%S] + 1;
total += largedata[i];
}
gettimeofday(&tv2, NULL);
int elapsed = 1000000 * (tv2.tv_sec - tv1.tv_sec) + (tv2.tv_usec - tv1.tv_usec);
printf("Round %d elapsed %d us --> %d\n", j, elapsed, total);
}
这导致47%的高速缓存失误。我想打电话给兰德大约需要大约30个内存访问,就像布伦丹提到的那样。
Performance counter stats for './a.out':
87715381 L1-dcache-load-misses # 47.35% of all L1-dcache accesses
1064131 L1-icache-load-misses
185235862 L1-dcache-loads
177250983 L1-dcache-stores
0.715463900 seconds time elapsed
0.664886000 seconds user
0.028505000 seconds sys
I'm trying to write an intentional code that has very low L1 d-cache hit rate and here it goes:
#include <stdio.h>
#include <sys/time.h>
#include <stdlib.h>
#define S 16*1024*1024
int largedata[S];
int main()
{
struct timeval tv1;
struct timeval tv2;
for (int j = 0; j < 10; j++)
{
gettimeofday(&tv1, NULL);
int total = 0;
for (int i = 1; i < S; i++) {
largedata[i] = largedata[rand()%S] + 1;
total += largedata[i];
}
gettimeofday(&tv2, NULL);
int elapsed = 1000000 * (tv2.tv_sec - tv1.tv_sec) + (tv2.tv_usec - tv1.tv_usec);
printf("Round %d elapsed %d us --> %d\n", j, elapsed, total);
}
return 0;
}
What it does is, it has 64MB of buffer and randomly accesses. The L1 D-cache size of my machine is 32KB so I expect very low cache-hit ratio but what I see is 2.57% miss-rate as following:
# gcc test.c && perf stat -e L1-dcache-load-misses -e L1-icache-load-misses -e L1-dcache-loads -e L1-dcache-stores ./a.out
Round 0 elapsed 399706 us --> 26671607
Round 1 elapsed 344664 us --> 57210118
Round 2 elapsed 342444 us --> 79296375
Round 3 elapsed 344605 us --> 92293029
Round 4 elapsed 342173 us --> 93904234
Round 5 elapsed 346295 us --> 76478386
Round 6 elapsed 343390 us --> 98878844
Round 7 elapsed 347893 us --> 107286968
Round 8 elapsed 355442 us --> 87289283
Round 9 elapsed 362253 us --> 101320374
Performance counter stats for './a.out':
104128257 L1-dcache-load-misses # 2.57% of all L1-dcache accesses
2630683 L1-icache-load-misses
4047961891 L1-dcache-loads
2192632892 L1-dcache-stores
3.539076619 seconds time elapsed
3.479520000 seconds user
0.032746000 seconds sys
I'd expect way more higher (like 20~30%) and can anyone explain how this X86 (xeon) do this magic?
BTW, my CPU: Intel(R) Xeon(R) Gold 6230
UPDATE
That xorshift and the optimization worked like a charm!
for (int j = 0; j < 10; j++)
{
gettimeofday(&tv1, NULL);
int total = 0;
for (int i = 1; i < S; i++) {
uint32_t t = x;
t ^= t << 11U;
t ^= t >> 8U;
x = y; y = z; z = w;
w ^= w >> 19U;
w ^= t;
largedata[i] = largedata[w%S] + 1;
total += largedata[i];
}
gettimeofday(&tv2, NULL);
int elapsed = 1000000 * (tv2.tv_sec - tv1.tv_sec) + (tv2.tv_usec - tv1.tv_usec);
printf("Round %d elapsed %d us --> %d\n", j, elapsed, total);
}
Which results in 47% of cache miss. I guess calling rand takes about ~30 memory accesses like Brendan mentioned.
Performance counter stats for './a.out':
87715381 L1-dcache-load-misses # 47.35% of all L1-dcache accesses
1064131 L1-icache-load-misses
185235862 L1-dcache-loads
177250983 L1-dcache-stores
0.715463900 seconds time elapsed
0.664886000 seconds user
0.028505000 seconds sys
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论