错误的单线程内存带宽基准
为了测量主存储器的带宽,我提出了以下方法。
代码(针对英特尔编译器)
#include <omp.h>
#include <iostream> // std::cout
#include <limits> // std::numeric_limits
#include <cstdlib> // std::free
#include <unistd.h> // sysconf
#include <stdlib.h> // posix_memalign
#include <random> // std::mt19937
int main()
{
// test-parameters
const auto size = std::size_t{150 * 1024 * 1024} / sizeof(double);
const auto experiment_count = std::size_t{500};
//+/////////////////
// access a data-point 'on a whim'
//+/////////////////
// warm-up
for (auto counter = std::size_t{}; counter < experiment_count / 2; ++counter)
{
// garbage data allocation and memory page loading
double* data = nullptr;
posix_memalign(reinterpret_cast<void**>(&data), sysconf(_SC_PAGESIZE), size * sizeof(double));
if (data == nullptr)
{
std::cerr << "Fatal error! Unable to allocate memory." << std::endl;
std::abort();
}
//#pragma omp parallel for simd safelen(8) schedule(static)
for (auto index = std::size_t{}; index < size; ++index)
{
data[index] = -1.0;
}
//#pragma omp parallel for simd safelen(8) schedule(static)
#pragma omp simd safelen(8)
for (auto index = std::size_t{}; index < size; ++index)
{
data[index] = 10.0;
}
// deallocate resources
free(data);
}
// timed run
auto min_duration = std::numeric_limits<double>::max();
for (auto counter = std::size_t{}; counter < experiment_count; ++counter)
{
// garbage data allocation and memory page loading
double* data = nullptr;
posix_memalign(reinterpret_cast<void**>(&data), sysconf(_SC_PAGESIZE), size * sizeof(double));
if (data == nullptr)
{
std::cerr << "Fatal error! Unable to allocate memory." << std::endl;
std::abort();
}
//#pragma omp parallel for simd safelen(8) schedule(static)
for (auto index = std::size_t{}; index < size; ++index)
{
data[index] = -1.0;
}
const auto dur1 = omp_get_wtime() * 1E+6;
//#pragma omp parallel for simd safelen(8) schedule(static)
#pragma omp simd safelen(8)
for (auto index = std::size_t{}; index < size; ++index)
{
data[index] = 10.0;
}
const auto dur2 = omp_get_wtime() * 1E+6;
const auto run_duration = dur2 - dur1;
if (run_duration < min_duration)
{
min_duration = run_duration;
}
// deallocate resources
free(data);
}
// REPORT
const auto traffic = size * sizeof(double) * 2; // 1x load, 1x write
std::cout << "Using " << omp_get_max_threads() << " threads. Minimum duration: " << min_duration << " us;\n"
<< "Maximum bandwidth: " << traffic / min_duration * 1E-3 << " GB/s;" << std::endl;
return 0;
}
代码注释
- 假设是一种“天真的”方法,也仅限于 Linux。仍应作为
- 使用带有编译器标志的 ICC 的模型性能的粗略指标
-O3 -ffast-math -march=coffeelake
- 大小 (150 MiB) 比系统最低级缓存(9 MiB)大得多<一href="https://www.intel.ca/content/www/ca/en/products/sku/126687/intel-core-i58400-processor-9m-cache-up-to-4-00-ghz/规格.html" rel="nofollow noreferrer">i5-8400 Coffee Lake),配备 2 个 16 GiB DIMM DDR4 3200 MT/s
- 每次迭代上的新分配应使前一次的所有高速缓存行无效(以消除高速缓存命中)
- 记录最小延迟以抵消中断和操作系统调度的影响:线程暂时脱离内核等
- 。进行预热运行是为了抵消动态频率缩放的影响(内核功能,也可以通过使用
用户空间
调节器关闭)。
代码结果
在我的机器上,我得到90 GB/s。 Intel Advisor 运行自己的基准测试,经计算或测量该带宽实际为 25 GB/s。 (请参阅我之前的问题:英特尔顾问的带宽信息,此代码的早期版本正在获取页面- 定时区域内的故障。)
程序集:这是为上述代码生成的程序集的链接:https://godbolt.org/z/Ma7PY49bE
我无法理解我的带宽如何得到如此不合理的高结果。任何有助于促进我理解的提示将不胜感激。
In an attempt to measure the bandwidth of the main memory, I have come up with the following approach.
Code (for the Intel compiler)
#include <omp.h>
#include <iostream> // std::cout
#include <limits> // std::numeric_limits
#include <cstdlib> // std::free
#include <unistd.h> // sysconf
#include <stdlib.h> // posix_memalign
#include <random> // std::mt19937
int main()
{
// test-parameters
const auto size = std::size_t{150 * 1024 * 1024} / sizeof(double);
const auto experiment_count = std::size_t{500};
//+/////////////////
// access a data-point 'on a whim'
//+/////////////////
// warm-up
for (auto counter = std::size_t{}; counter < experiment_count / 2; ++counter)
{
// garbage data allocation and memory page loading
double* data = nullptr;
posix_memalign(reinterpret_cast<void**>(&data), sysconf(_SC_PAGESIZE), size * sizeof(double));
if (data == nullptr)
{
std::cerr << "Fatal error! Unable to allocate memory." << std::endl;
std::abort();
}
//#pragma omp parallel for simd safelen(8) schedule(static)
for (auto index = std::size_t{}; index < size; ++index)
{
data[index] = -1.0;
}
//#pragma omp parallel for simd safelen(8) schedule(static)
#pragma omp simd safelen(8)
for (auto index = std::size_t{}; index < size; ++index)
{
data[index] = 10.0;
}
// deallocate resources
free(data);
}
// timed run
auto min_duration = std::numeric_limits<double>::max();
for (auto counter = std::size_t{}; counter < experiment_count; ++counter)
{
// garbage data allocation and memory page loading
double* data = nullptr;
posix_memalign(reinterpret_cast<void**>(&data), sysconf(_SC_PAGESIZE), size * sizeof(double));
if (data == nullptr)
{
std::cerr << "Fatal error! Unable to allocate memory." << std::endl;
std::abort();
}
//#pragma omp parallel for simd safelen(8) schedule(static)
for (auto index = std::size_t{}; index < size; ++index)
{
data[index] = -1.0;
}
const auto dur1 = omp_get_wtime() * 1E+6;
//#pragma omp parallel for simd safelen(8) schedule(static)
#pragma omp simd safelen(8)
for (auto index = std::size_t{}; index < size; ++index)
{
data[index] = 10.0;
}
const auto dur2 = omp_get_wtime() * 1E+6;
const auto run_duration = dur2 - dur1;
if (run_duration < min_duration)
{
min_duration = run_duration;
}
// deallocate resources
free(data);
}
// REPORT
const auto traffic = size * sizeof(double) * 2; // 1x load, 1x write
std::cout << "Using " << omp_get_max_threads() << " threads. Minimum duration: " << min_duration << " us;\n"
<< "Maximum bandwidth: " << traffic / min_duration * 1E-3 << " GB/s;" << std::endl;
return 0;
}
Notes on code
- Assumed to be a 'naive' approach, also linux-only. Should still serve as a rough indicator of model performance
- using ICC with compiler flags
-O3 -ffast-math -march=coffeelake
- size (150 MiB) is much bigger than lowest level cache of system (9 MiB on i5-8400 Coffee Lake), with 2x 16 GiB DIMM DDR4 3200 MT/s
- new allocations on each iteration should invalidate all cache-lines from the previous one (to eliminate cache hits)
- minimum latency is recorded to counter-act the effects of interrupts and OS-scheduling: threads being taken off cores for a short while etc.
- a warm-up run is done to counter-act the effects of dynamic frequency scaling (kernel feature, can alternatively be turned off by using the
userspace
governor).
Results of code
On my machine, I am getting 90 GB/s. Intel Advisor, which runs its own benchmarks, has calculated or measured this bandwidth to actually be 25 GB/s. (See my previous question: Intel Advisor's bandwidth information where a previous version of this code was getting page-faults inside the timed region.)
Assembly: here's a link to the assembly generated for the above code: https://godbolt.org/z/Ma7PY49bE
I am not able to understand how I'm getting such an unreasonably high result with my bandwidth. Any tips to help facilitate my understanding would be greatly appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
实际上,问题似乎是“为什么获得的带宽如此高?”,我从@PeterCordes和@Sebastian那里得到了很多输入。这些信息需要在自己的时间里消化。
我仍然可以对感兴趣的主题提供辅助“答案”。通过用一个廉价例如按位操作替换写入操作(据我现在的理解,如果不深入研究汇编就无法在基准测试中正确建模),我们可以阻止编译器完成其工作有点太好了。
更新的代码
该基准测试仍然是一个“幼稚”的基准测试,只能作为模型性能的指标(而不是可以精确计算内存带宽的程序)。
使用更新后的代码,单线程的速度为 24 GiB/s,当所有 6 个核心都参与时,速度为 37 GiB/s。与 Intel Advisor 的测量值 25.5 GiB/s 和 37.5 GiB/s 相比,我认为这是可以接受的。
@PeterCordes我保留了预热循环,以便对整个过程进行完全相同的运行,以便抵消未知的影响(健康程序员的偏执狂)。
编辑 在这种情况下,预热循环确实是多余的,因为正在计时最短持续时间。
Actually, the question seems to be, "why is the obtained bandwidth so high?", to which I have gotten quite a lot of input from @PeterCordes and @Sebastian. This information needs to be digested in its own time.
I can still offer an auxiliary 'answer' to the topic of interest. By substituting the write operation (which, as I now understand, cannot be properly modeled in a benchmark without delving into the assembly) by a cheap e.g. a bitwise operation, we can prevent the compiler from doing its job a little too well.
Updated code
The benchmark remains a 'naive' one and shall only serve as an indicator of the model's performance (as opposed to a program which can exactly calculate the memory bandwidth).
With the updated code, I get 24 GiB/s for single thread and 37 GiB/s when all 6 cores get involved. When compared to Intel Advisor's measured values of 25.5 GiB/s and 37.5 GiB/s, I think this is acceptable.
@PeterCordes I have retained the warm-up loop to once do an exactly identical run of the whole procedure so as to counter-act against effects unknown (healthy programmer's paranoia).
Edit In this case, the warm-up loop is indeed redundant because the minimum duration is being clocked.