测量CCNUMA系统上的带宽
在CCNUMA系统上测试内存带宽
- 我试图用2x Intel(R)Xeon(R)Platinum 8168: 24核心 @ 2.70 GHz,
- L1 Cache 32 KB,L2 Cache 1 Mb和L3 Cache 33 MB
。作为参考,我使用的是Intel Advisor的屋顶图,该图描述了每个CPU数据路径的带宽。据此,带宽为230 GB/s。
为了对此进行基准测试,我正在使用自己的小基准辅助工具,该工具在循环中执行定时实验。 API提供了一个抽象类,称为persiment_functor
看起来这样:
class experiment_functor
{
public:
//+/////////////////
// main functionality
//+/////////////////
virtual void init() = 0;
virtual void* data(const std::size_t&) = 0;
virtual void perform_experiment() = 0;
virtual void finish() = 0;
};
然后,用户(我自己)可以定义数据初始化,定时进行的工作和清理例程,以便每个实验都可以使用新鲜分配的数据。可以将派生类的实例提供给API函数:
perf_stats perform_experiments(experiment_functor& exp_fn, const std::size_t& data_size_in_byte, const std::size_t& exp_count)
这是SchönauerVectorTriad的类实现:
class exp_fn : public experiment_functor
{
//+/////////////////
// members
//+/////////////////
const std::size_t data_size_;
double* vec_a_ = nullptr;
double* vec_b_ = nullptr;
double* vec_c_ = nullptr;
double* vec_d_ = nullptr;
public:
//+/////////////////
// lifecycle
//+/////////////////
exp_fn(const std::size_t& data_size)
: data_size_(data_size) {}
//+/////////////////
// main functionality
//+/////////////////
void init() final
{
// allocate
const auto page_size = sysconf(_SC_PAGESIZE) / sizeof(double);
posix_memalign(reinterpret_cast<void**>(&vec_a_), page_size, data_size_ * sizeof(double));
posix_memalign(reinterpret_cast<void**>(&vec_b_), page_size, data_size_ * sizeof(double));
posix_memalign(reinterpret_cast<void**>(&vec_c_), page_size, data_size_ * sizeof(double));
posix_memalign(reinterpret_cast<void**>(&vec_d_), page_size, data_size_ * sizeof(double));
if (vec_a_ == nullptr || vec_b_ == nullptr || vec_c_ == nullptr || vec_d_ == nullptr)
{
std::cerr << "Fatal error, failed to allocate memory." << std::endl;
std::abort();
}
// apply first-touch
#pragma omp parallel for schedule(static)
for (auto index = std::size_t{}; index < data_size_; index += page_size)
{
vec_a_[index] = 0.0;
vec_b_[index] = 0.0;
vec_c_[index] = 0.0;
vec_d_[index] = 0.0;
}
}
void* data(const std::size_t&) final
{
return reinterpret_cast<void*>(vec_d_);
}
void perform_experiment() final
{
#pragma omp parallel for simd safelen(8) schedule(static)
for (auto index = std::size_t{}; index < data_size_; ++index)
{
vec_d_[index] = vec_a_[index] + vec_b_[index] * vec_c_[index]; // fp_count: 2, traffic: 4+1
}
}
void finish() final
{
std::free(vec_a_);
std::free(vec_b_);
std::free(vec_c_);
std::free(vec_d_);
}
};
注意:函数data
在它试图取消NUMA平衡的影响。在随机迭代中,函数persion_experiments
经常以随机的方式使用所有线程写入此功能提供的数据。
问题:使用它,我一直在获得最大值。 201 GB/s的带宽。为什么我无法达到230 GB/s的规定?
如果需要,我很高兴提供任何额外的信息。非常感谢您的回答。
更新:
按照@Victoreijkhout的建议,我现在对仅阅读带宽进行了强大的缩放实验。
如您所见,峰带宽的确是平均217 GB/s,最大225 GB/s。值得注意的是,在某个时刻,添加CPU实际上会降低有效的带宽。
I'm attempting to benchmark the memory bandwidth on a ccNUMA system with 2x Intel(R) Xeon(R) Platinum 8168:
- 24 cores @ 2.70 GHz,
- L1 cache 32 kB, L2 cache 1 MB and L3 cache 33 MB.
As a reference, I'm using the Intel Advisor's Roofline plot, which depicts the bandwidths of each CPU data-path available. According to this, the bandwidth is 230 GB/s.
In order to benchmark this, I'm using my own little benchmark helper tool which performs timed experiments in a loop. The API offers an abstract class called experiment_functor
which looks like this:
class experiment_functor
{
public:
//+/////////////////
// main functionality
//+/////////////////
virtual void init() = 0;
virtual void* data(const std::size_t&) = 0;
virtual void perform_experiment() = 0;
virtual void finish() = 0;
};
The user (myself) can then define the data initialization, the work to be timed i.e. the experiment and the clean-up routine so that freshly allocated data can be used for each experiment. An instance of the derived class can be provided to the API function:
perf_stats perform_experiments(experiment_functor& exp_fn, const std::size_t& data_size_in_byte, const std::size_t& exp_count)
Here's the implementation of the class for the Schönauer vector triad:
class exp_fn : public experiment_functor
{
//+/////////////////
// members
//+/////////////////
const std::size_t data_size_;
double* vec_a_ = nullptr;
double* vec_b_ = nullptr;
double* vec_c_ = nullptr;
double* vec_d_ = nullptr;
public:
//+/////////////////
// lifecycle
//+/////////////////
exp_fn(const std::size_t& data_size)
: data_size_(data_size) {}
//+/////////////////
// main functionality
//+/////////////////
void init() final
{
// allocate
const auto page_size = sysconf(_SC_PAGESIZE) / sizeof(double);
posix_memalign(reinterpret_cast<void**>(&vec_a_), page_size, data_size_ * sizeof(double));
posix_memalign(reinterpret_cast<void**>(&vec_b_), page_size, data_size_ * sizeof(double));
posix_memalign(reinterpret_cast<void**>(&vec_c_), page_size, data_size_ * sizeof(double));
posix_memalign(reinterpret_cast<void**>(&vec_d_), page_size, data_size_ * sizeof(double));
if (vec_a_ == nullptr || vec_b_ == nullptr || vec_c_ == nullptr || vec_d_ == nullptr)
{
std::cerr << "Fatal error, failed to allocate memory." << std::endl;
std::abort();
}
// apply first-touch
#pragma omp parallel for schedule(static)
for (auto index = std::size_t{}; index < data_size_; index += page_size)
{
vec_a_[index] = 0.0;
vec_b_[index] = 0.0;
vec_c_[index] = 0.0;
vec_d_[index] = 0.0;
}
}
void* data(const std::size_t&) final
{
return reinterpret_cast<void*>(vec_d_);
}
void perform_experiment() final
{
#pragma omp parallel for simd safelen(8) schedule(static)
for (auto index = std::size_t{}; index < data_size_; ++index)
{
vec_d_[index] = vec_a_[index] + vec_b_[index] * vec_c_[index]; // fp_count: 2, traffic: 4+1
}
}
void finish() final
{
std::free(vec_a_);
std::free(vec_b_);
std::free(vec_c_);
std::free(vec_d_);
}
};
Note: The function data
serves a special purpose in that it tries to cancel out effects of NUMA-balancing. Ever so often, in a random iteration, the function perform_experiments
writes in a random fashion, using all threads, to the data provided by this function.
Question: Using this I am consistently getting a max. bandwidth of 201 GB/s. Why am I unable to achieve the stated 230 GB/s?
I am happy to provide any extra information if needed. Thanks very much in advance for your answers.
Update:
Following the suggestions made by @VictorEijkhout, I've now conducted a strong scaling experiment for the read-only bandwidth.
As you can see, the peak bandwidth is indeed average 217 GB/s, maximum 225 GB/s. It is still very puzzling to note that, at a certain point, adding CPUs actually reduces the effective bandwidth.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
带宽性能取决于您执行的操作类型。读取&amp;写道,您确实不会得到峰值数字。如果您只阅读,您将变得更近。
我建议您阅读“流基准”的文档,并查看已发布的数字。
进一步注意:希望您将线程与
OMP_PROC_BIND
将其绑定在一起?此外,您的架构在核心用完之前就没有带宽。您的最佳带宽性能可能会少于核心总数。Bandwidth performance depends on the type of operation you do. For a mix of reads & writes you will indeed not get the peak number; if you only do reads you will get closer.
I suggest you read the documentation for the "Stream benchmark", and take a look at the posted numbers.
Further notes: I hope you tie your threads down with
OMP_PROC_BIND
? Also, your architecture runs out of bandwidth before it runs out of cores. Your optimal bandwidth performance may happen with less than the total number of cores.