为什么 _mm_stream_ps 会产生 L1/LL 缓存未命中？

发布于 2024-12-31 23:04:17 字数 2593 浏览 4 评论 0原文

我正在尝试优化计算密集型算法，但遇到了一些缓存问题。我有一个巨大的缓冲区，它偶尔会随机写入，并且在应用程序结束时只读取一次。显然，写入缓冲区会产生大量缓存未命中，并且还会污染随后需要再次进行计算的缓存。我尝试使用非时间移动内在函数，但缓存未命中（由 valgrind 报告并由运行时测量支持）仍然发生。然而，为了进一步研究非时间移动，我编写了一个小测试程序，您可以在下面看到。顺序访问，大缓冲区，仅写入。

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <smmintrin.h>

void tim(const char *name, void (*func)()) {
    struct timespec t1, t2;
    clock_gettime(CLOCK_REALTIME, &t1);
    func();
    clock_gettime(CLOCK_REALTIME, &t2);
    printf("%s : %f s.\n", name, (t2.tv_sec - t1.tv_sec) + (float) (t2.tv_nsec - t1.tv_nsec) / 1000000000);
}

const int CACHE_LINE = 64;
const int FACTOR = 1024;
float *arr;
int length;

void func1() {
    for(int i = 0; i < length; i++) {
        arr[i] = 5.0f;
    }
}

void func2() {
    for(int i = 0; i < length; i += 4) {
        arr[i] = 5.0f;
        arr[i+1] = 5.0f;
        arr[i+2] = 5.0f;
        arr[i+3] = 5.0f;
    }
}

void func3() {
    __m128 buf = _mm_setr_ps(5.0f, 5.0f, 5.0f, 5.0f);
    for(int i = 0; i < length; i += 4) {
        _mm_stream_ps(&arr[i], buf);
    }
}

void func4() {
    __m128 buf = _mm_setr_ps(5.0f, 5.0f, 5.0f, 5.0f);
    for(int i = 0; i < length; i += 16) {
        _mm_stream_ps(&arr[i], buf);
        _mm_stream_ps(&arr[4], buf);
        _mm_stream_ps(&arr[8], buf);
        _mm_stream_ps(&arr[12], buf);
    }
}

int main() {
    length = CACHE_LINE * FACTOR * FACTOR;

    arr = malloc(length * sizeof(float));
    tim("func1", func1);
    free(arr);

    arr = malloc(length * sizeof(float));
    tim("func2", func2);
    free(arr);

    arr = malloc(length * sizeof(float));
    tim("func3", func3);
    free(arr);

    arr = malloc(length * sizeof(float));
    tim("func4", func4);
    free(arr);

    return 0;
}

函数 1 是简单的方法，函数 2 使用循环展开。函数 3 使用 movntps，实际上至少在我检查 -O0 时它已插入到程序集中。在函数 4 中，我尝试同时发出多个 movntps 指令来帮助 CPU 进行写组合。我使用 gcc -g -lrt -std=gnu99 -OX -msse4.1 test.c 编译了代码，其中 X 是 [0..3] 之一。结果是..充其量是有趣的：

-O0
func1 : 0.407794 s.
func2 : 0.320891 s.
func3 : 0.161100 s.
func4 : 0.401755 s.
-O1
func1 : 0.194339 s.
func2 : 0.182536 s.
func3 : 0.101712 s.
func4 : 0.383367 s.
-O2
func1 : 0.108488 s.
func2 : 0.088826 s.
func3 : 0.101377 s.
func4 : 0.384106 s.
-O3
func1 : 0.078406 s.
func2 : 0.084927 s.
func3 : 0.102301 s.
func4 : 0.383366 s.

正如您所看到的，当程序未经过 gcc 优化时，_mm_stream_ps 比其他程序要快一点，但当 gcc 优化打开时，它的目的就明显失败了。 Valgrind 仍然报告大量缓存写入未命中。

因此，问题是：为什么即使我使用 NTA 流指令，那些 (L1+LL) 缓存未命中仍然会发生？为什么特别是 func4 这么慢？！有人可以解释/推测这里发生了什么吗？

原文

I'm trying to optimize a computation-intensive algorithm and am kind of stuck at some cache problem. I have a huge buffer which is written occasionally and at random and read only once at the end of the application. Obviously, writing into the buffer produces lots of cache misses and besides pollutes the caches which are afterwards needed again for computation. I tried to use non-temporal move instrinsics, but the cache misses (reported by valgrind and supported by runtime measurements) still occur. However, to further investigate non-temporal moves, I wrote a little test program, which you can see below. Sequential access, large buffer, only writes.

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <smmintrin.h>

void tim(const char *name, void (*func)()) {
    struct timespec t1, t2;
    clock_gettime(CLOCK_REALTIME, &t1);
    func();
    clock_gettime(CLOCK_REALTIME, &t2);
    printf("%s : %f s.\n", name, (t2.tv_sec - t1.tv_sec) + (float) (t2.tv_nsec - t1.tv_nsec) / 1000000000);
}

const int CACHE_LINE = 64;
const int FACTOR = 1024;
float *arr;
int length;

void func1() {
    for(int i = 0; i < length; i++) {
        arr[i] = 5.0f;
    }
}

void func2() {
    for(int i = 0; i < length; i += 4) {
        arr[i] = 5.0f;
        arr[i+1] = 5.0f;
        arr[i+2] = 5.0f;
        arr[i+3] = 5.0f;
    }
}

void func3() {
    __m128 buf = _mm_setr_ps(5.0f, 5.0f, 5.0f, 5.0f);
    for(int i = 0; i < length; i += 4) {
        _mm_stream_ps(&arr[i], buf);
    }
}

void func4() {
    __m128 buf = _mm_setr_ps(5.0f, 5.0f, 5.0f, 5.0f);
    for(int i = 0; i < length; i += 16) {
        _mm_stream_ps(&arr[i], buf);
        _mm_stream_ps(&arr[4], buf);
        _mm_stream_ps(&arr[8], buf);
        _mm_stream_ps(&arr[12], buf);
    }
}

int main() {
    length = CACHE_LINE * FACTOR * FACTOR;

    arr = malloc(length * sizeof(float));
    tim("func1", func1);
    free(arr);

    arr = malloc(length * sizeof(float));
    tim("func2", func2);
    free(arr);

    arr = malloc(length * sizeof(float));
    tim("func3", func3);
    free(arr);

    arr = malloc(length * sizeof(float));
    tim("func4", func4);
    free(arr);

    return 0;
}

Function 1 is the naive approach, function 2 uses loop unrolling. Function 3 uses movntps, which in fact was inserted in the assembly at least when I checked for -O0. In function 4 I tried to issue several movntps instructions at once to help the CPU do its write combining. I compiled the code with gcc -g -lrt -std=gnu99 -OX -msse4.1 test.c where X is one of [0..3]. The results are .. interesting to say at best:

-O0
func1 : 0.407794 s.
func2 : 0.320891 s.
func3 : 0.161100 s.
func4 : 0.401755 s.
-O1
func1 : 0.194339 s.
func2 : 0.182536 s.
func3 : 0.101712 s.
func4 : 0.383367 s.
-O2
func1 : 0.108488 s.
func2 : 0.088826 s.
func3 : 0.101377 s.
func4 : 0.384106 s.
-O3
func1 : 0.078406 s.
func2 : 0.084927 s.
func3 : 0.102301 s.
func4 : 0.383366 s.

As you can see _mm_stream_ps is a little faster than the others when the program is not optimized by gcc but then significantly fails its purpose when gcc optimization is turned on. Valgrind still reports lots of cache write misses.

So, questions are: Why do those (L1+LL) cache misses still occur even if I'm using NTA streaming instructions? Why is especially func4 so slow?! Can someone explain/speculate what is happening here?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

微暖i 2025-01-07 23:04:17

您的基准测试可能主要衡量内存分配性能，而不仅仅是写入性能。您的操作系统可能不在 malloc 中分配内存页，而是在第一次接触时在 func* 函数内分配内存页。分配大量内存后，操作系统也可能会进行一些内存洗牌，因此在内存分配后立即执行的任何基准测试可能不可靠。
您的代码存在别名问题：编译器无法保证您的数组指针在填充此过程中不会更改数组，因此它必须始终从内存加载 arr 值，而不是使用寄存器。这可能会降低一些性能。避免别名的最简单方法是将 arr 和 length 复制到局部变量，并仅使用局部变量来填充数组。有许多众所周知的建议可以避免使用全局变量。别名是原因之一。
如果数组按 64 字节对齐，_mm_stream_ps 效果会更好。在您的代码中，不保证对齐（实际上，malloc 将其对齐 16 个字节）。这种优化仅对于短数组才显着。
最好在使用 _mm_stream_ps 后调用 _mm_mfence。这是为了正确性所需要的，而不是为了性能。

回复收藏 0 原文

不再让梦枯萎 2025-01-07 23:04:17

func4 不应该是这样的：

void func4() {
    __m128 buf = _mm_setr_ps(5.0f, 5.0f, 5.0f, 5.0f);
    for(int i = 0; i < length; i += 16) {
        _mm_stream_ps(&arr[i], buf);
        _mm_stream_ps(&arr[i+4], buf);
        _mm_stream_ps(&arr[i+8], buf);
        _mm_stream_ps(&arr[i+12], buf);
    }
}

Shouldn't func4 be this:

void func4() {
    __m128 buf = _mm_setr_ps(5.0f, 5.0f, 5.0f, 5.0f);
    for(int i = 0; i < length; i += 16) {
        _mm_stream_ps(&arr[i], buf);
        _mm_stream_ps(&arr[i+4], buf);
        _mm_stream_ps(&arr[i+8], buf);
        _mm_stream_ps(&arr[i+12], buf);
    }
}

回复收藏 0 原文

~没有更多了~