内存访问的内存吞吐量

发布于 2025-02-10 17:09:16 字数 1230 浏览 1 评论 0原文

我正在使用_MM256_I32GATHER_EPI32内在的内存来测量内存吞吐量和运行时间。这是我用于测试的循环：

for (int i = 0; i < len; i+=8) {
    const __m256i* indexes_2 = reinterpret_cast<const __m256i*>(indexes_ptr + i);
    __m256i index_reg = _mm256_loadu_si256(indexes_2);
    __m256i values = _mm256_i32gather_epi32(data_ptr, index_reg, 4);
    sum = _mm256_add_epi32(sum, values);
}

我使用索引数组（通过indexes_ptr指定）将访问模式更改为data_ptr array。 data_ptr数组的大小为256 MB，因此所有内容都错过了缓存。 _ptr的可能值：

顺序-0、1、2、3等，
4-0、4、8、12
16-0、16、32、48
步骤
步幅
以下是索引

等_MM256_I32GATHER_EPI32将加载8个值。在我的系统中，缓存线的大小为64个字节，因此：

顺序触摸一个缓存线
步长4触摸两条缓存线
16
触摸八个缓存线
迈

步行大步16、64和128将具有相似的运行时间和内存吞吐量。但是事实并非如此。 Here are the numbers:

sequential, 0.13 s, 16828.2607 MB/s
strided 4, 0.07 s, 17246.1914 MB/s
strided 16, 0.918406, 5205.1085 MB/s
strided 32, 1.650566s, 4756.5279 MB/s
stride 64, 1.798604, 5440.2228 MB /s
步幅128，2.186620，4672.1329 Mb/s，

因为它们都在每个说明中都访问了8个缓存线，从而在16、32、64和128之间的差异来自哪里？

原文

I am measuring memory throughput and runtimes using _mm256_i32gather_epi32 intrinsic. Here is the loop I use for testing:

for (int i = 0; i < len; i+=8) {
    const __m256i* indexes_2 = reinterpret_cast<const __m256i*>(indexes_ptr + i);
    __m256i index_reg = _mm256_loadu_si256(indexes_2);
    __m256i values = _mm256_i32gather_epi32(data_ptr, index_reg, 4);
    sum = _mm256_add_epi32(sum, values);
}

I use the index array (specified through indexes_ptr) to change the access pattern into data_ptr array. The data_ptr array is 256 MB in size, so everything misses the caches. Here are possible values for indexes_ptr:

sequential - 0, 1, 2, 3, etc
stride 4 - 0, 4, 8, 12
stride 16 - 0, 16, 32, 48, etc
stride 32
stride 64
stride 128

So, the intrinsic _mm256_i32gather_epi32 will load 8 values. In my system, the size of a cache line is 64 bytes, so:

sequential touches one cache line
stride 4 touches two cache lines
stride 16 touches eight cache lines
stride 64 touches eight cache lines
stride 128 touches eight cache lines

My expectations is that the stride 16, 64 and 128 will have similar runtimes and memory throughputs. This is however not the case. Here are the numbers: