内存访问的内存吞吐量
我正在使用_MM256_I32GATHER_EPI32
内在的内存来测量内存吞吐量和运行时间。这是我用于测试的循环:
for (int i = 0; i < len; i+=8) {
const __m256i* indexes_2 = reinterpret_cast<const __m256i*>(indexes_ptr + i);
__m256i index_reg = _mm256_loadu_si256(indexes_2);
__m256i values = _mm256_i32gather_epi32(data_ptr, index_reg, 4);
sum = _mm256_add_epi32(sum, values);
}
我使用索引数组(通过indexes_ptr
指定)将访问模式更改为data_ptr
array。 data_ptr
数组的大小为256 MB,因此所有内容都错过了缓存。 _ptr的可能值:
- 顺序-0、1、2、3等,
- 4-0、4、8、12
- 16-0、16、32、48
- 步骤
- 步幅
- 以下是索引
等_MM256_I32GATHER_EPI32
将加载8个值。在我的系统中,缓存线的大小为64个字节,因此:
- 顺序触摸一个缓存线
- 步长4触摸两条缓存线
- 16
- 触摸八个缓存线
- 迈
步行 大步16、64和128将具有相似的运行时间和内存吞吐量。但是事实并非如此。 Here are the numbers:
- sequential, 0.13 s, 16828.2607 MB/s
- strided 4, 0.07 s, 17246.1914 MB/s
- strided 16, 0.918406, 5205.1085 MB/s
- strided 32, 1.650566s, 4756.5279 MB/s
- stride 64, 1.798604, 5440.2228 MB /s
- 步幅128,2.186620,4672.1329 Mb/s,
因为它们都在每个说明中都访问了8个缓存线,从而在16、32、64和128之间的差异来自哪里?
I am measuring memory throughput and runtimes using _mm256_i32gather_epi32
intrinsic. Here is the loop I use for testing:
for (int i = 0; i < len; i+=8) {
const __m256i* indexes_2 = reinterpret_cast<const __m256i*>(indexes_ptr + i);
__m256i index_reg = _mm256_loadu_si256(indexes_2);
__m256i values = _mm256_i32gather_epi32(data_ptr, index_reg, 4);
sum = _mm256_add_epi32(sum, values);
}
I use the index array (specified through indexes_ptr
) to change the access pattern into data_ptr
array. The data_ptr
array is 256 MB in size, so everything misses the caches. Here are possible values for indexes_ptr:
- sequential - 0, 1, 2, 3, etc
- stride 4 - 0, 4, 8, 12
- stride 16 - 0, 16, 32, 48, etc
- stride 32
- stride 64
- stride 128
So, the intrinsic _mm256_i32gather_epi32
will load 8 values. In my system, the size of a cache line is 64 bytes, so:
- sequential touches one cache line
- stride 4 touches two cache lines
- stride 16 touches eight cache lines
- stride 64 touches eight cache lines
- stride 128 touches eight cache lines
My expectations is that the stride 16, 64 and 128 will have similar runtimes and memory throughputs. This is however not the case. Here are the numbers:
- sequential, 0.13 s, 16828.2607 MB/s
- strided 4, 0.07 s, 17246.1914 MB/s
- strided 16, 0.918406, 5205.1085 MB/s
- strided 32, 1.650566s, 4756.5279 MB/s
- stride 64, 1.798604, 5440.2228 MB/s
- stride 128, 2.186620, 4672.1329 MB/s
Where does the difference between stride 16, 32, 64 and 128 come from, since they all are accessing exactly 8 cache lines in each instructions?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论