为什么在一定的工作负载尺寸后与OpenMP并行相关的加速度下降?
我正试图进入OpenMP并写出一小部分代码,以感觉到对速度的期望:
#include <algorithm>
#include <chrono>
#include <functional>
#include <iostream>
#include <numeric>
#include <vector>
#include <random>
void SingleThreaded(std::vector<float> &weights, int size)
{
auto totalWeight = 0.0f;
for (int index = 0; index < size; index++)
{
totalWeight += weights[index];
}
for (int index = 0; index < size; index++)
{
weights[index] /= totalWeight;
}
}
void MultiThreaded(std::vector<float> &weights, int size)
{
auto totalWeight = 0.0f;
#pragma omp parallel shared(weights, size, totalWeight) default(none)
{
// clang-format off
#pragma omp for reduction(+ : totalWeight)
// clang-format on
for (int index = 0; index < size; index++)
{
totalWeight += weights[index];
}
#pragma omp for
for (int index = 0; index < size; index++)
{
weights[index] /= totalWeight;
}
}
}
float TimeIt(std::function<void(void)> function)
{
auto startTime = std::chrono::high_resolution_clock::now().time_since_epoch();
function();
auto endTime = std::chrono::high_resolution_clock::now().time_since_epoch();
std::chrono::duration<float> duration = endTime - startTime;
return duration.count();
}
int main(int argc, char *argv[])
{
std::vector<float> weights(1 << 24);
std::srand(std::random_device{}());
std::generate(weights.begin(), weights.end(), []()
{ return std::rand() / static_cast<float>(RAND_MAX); });
for (int size = 1; size <= weights.size(); size <<= 1)
{
auto singleThreadedDuration = TimeIt(std::bind(SingleThreaded, std::ref(weights), size));
auto multiThreadedDuration = TimeIt(std::bind(MultiThreaded, std::ref(weights), size));
std::cout << "Size: " << size << std::endl;
std::cout << "Speed up: " << singleThreadedDuration / multiThreadedDuration << std::endl;
}
}
我在Win10上使用Mingw G ++编译并运行了上述代码,例如:
g ++ -O3 -static -fopenmp OpenMP.CPP; ./a.exe
输出(见下文)显示的最大速度约为4.2,在矢量大小为524288。这意味着多线程代码的速度比向量的单线程代码快4.2倍。大小为524288。
Size: 1
Speedup: 0.00614035
Size: 2
Speedup: 0.00138696
Size: 4
Speedup: 0.00264201
Size: 8
Speedup: 0.00324149
Size: 16
Speedup: 0.00316957
Size: 32
Speedup: 0.00315457
Size: 64
Speedup: 0.00297177
Size: 128
Speedup: 0.00569801
Size: 256
Speedup: 0.00596125
Size: 512
Speedup: 0.00979021
Size: 1024
Speedup: 0.019943
Size: 2048
Speedup: 0.0317662
Size: 4096
Speedup: 0.181818
Size: 8192
Speedup: 0.133713
Size: 16384
Speedup: 0.216568
Size: 32768
Speedup: 0.566396
Size: 65536
Speedup: 1.10169
Size: 131072
Speedup: 1.99395
Size: 262144
Speedup: 3.4772
Size: 524288
Speedup: 4.20111
Size: 1048576
Speedup: 2.82819
Size: 2097152
Speedup: 3.98878
Size: 4194304
Speedup: 4.00481
Size: 8388608
Speedup: 2.91028
Size: 16777216
Speedup: 3.85507
所以我的问题是:
- 为什么较小的矢量大小的多线程代码较慢?纯粹是因为创建线程并分发工作的开销,还是我做错了什么?
- 为什么在一定尺寸之后我会降低加速?
- 理论上我在使用的CPU(i7 7700k)上可以实现的最佳案例加速度是什么?
- 物理CPU内核与逻辑CPU内核之间的区别在加速方面是否重要?
- 我的代码中是否犯了任何公然错误?我可以改善一些东西吗?
I'm trying to get into OpenMP and wrote up a small piece of code to get a feel for what to expect in terms of speedup:
#include <algorithm>
#include <chrono>
#include <functional>
#include <iostream>
#include <numeric>
#include <vector>
#include <random>
void SingleThreaded(std::vector<float> &weights, int size)
{
auto totalWeight = 0.0f;
for (int index = 0; index < size; index++)
{
totalWeight += weights[index];
}
for (int index = 0; index < size; index++)
{
weights[index] /= totalWeight;
}
}
void MultiThreaded(std::vector<float> &weights, int size)
{
auto totalWeight = 0.0f;
#pragma omp parallel shared(weights, size, totalWeight) default(none)
{
// clang-format off
#pragma omp for reduction(+ : totalWeight)
// clang-format on
for (int index = 0; index < size; index++)
{
totalWeight += weights[index];
}
#pragma omp for
for (int index = 0; index < size; index++)
{
weights[index] /= totalWeight;
}
}
}
float TimeIt(std::function<void(void)> function)
{
auto startTime = std::chrono::high_resolution_clock::now().time_since_epoch();
function();
auto endTime = std::chrono::high_resolution_clock::now().time_since_epoch();
std::chrono::duration<float> duration = endTime - startTime;
return duration.count();
}
int main(int argc, char *argv[])
{
std::vector<float> weights(1 << 24);
std::srand(std::random_device{}());
std::generate(weights.begin(), weights.end(), []()
{ return std::rand() / static_cast<float>(RAND_MAX); });
for (int size = 1; size <= weights.size(); size <<= 1)
{
auto singleThreadedDuration = TimeIt(std::bind(SingleThreaded, std::ref(weights), size));
auto multiThreadedDuration = TimeIt(std::bind(MultiThreaded, std::ref(weights), size));
std::cout << "Size: " << size << std::endl;
std::cout << "Speed up: " << singleThreadedDuration / multiThreadedDuration << std::endl;
}
}
I compiled and ran the above code with MinGW g++ on Win10 like so:
g++ -O3 -static -fopenmp OpenMP.cpp; ./a.exe
The output (see below) shows a maximum speedup of around 4.2 at a vector size of 524288. That means that the multi-threaded code ran 4.2 times faster than the single-threaded code for a vector size of 524288.
Size: 1
Speedup: 0.00614035
Size: 2
Speedup: 0.00138696
Size: 4
Speedup: 0.00264201
Size: 8
Speedup: 0.00324149
Size: 16
Speedup: 0.00316957
Size: 32
Speedup: 0.00315457
Size: 64
Speedup: 0.00297177
Size: 128
Speedup: 0.00569801
Size: 256
Speedup: 0.00596125
Size: 512
Speedup: 0.00979021
Size: 1024
Speedup: 0.019943
Size: 2048
Speedup: 0.0317662
Size: 4096
Speedup: 0.181818
Size: 8192
Speedup: 0.133713
Size: 16384
Speedup: 0.216568
Size: 32768
Speedup: 0.566396
Size: 65536
Speedup: 1.10169
Size: 131072
Speedup: 1.99395
Size: 262144
Speedup: 3.4772
Size: 524288
Speedup: 4.20111
Size: 1048576
Speedup: 2.82819
Size: 2097152
Speedup: 3.98878
Size: 4194304
Speedup: 4.00481
Size: 8388608
Speedup: 2.91028
Size: 16777216
Speedup: 3.85507
So my questions are:
- Why is the multi-threaded code slower for a smaller vector size? Is it purely because of the overhead of creating the threads and distributing the work or am I doing something wrong?
- Why does the speedup I get decrease after a certain size?
- What would be the best case speedup I could theoretically achieve on the CPU I used (i7 7700k)?
- Does the distinction between physical CPU cores and logical CPU cores matter in terms of speedup?
- Did I make any blatant mistakes in my code? Can I improve something?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
cblas_sasum
和cblas_sscal
从良好的BLAS实现实现)对您具有的两个函数进行基准测试。目前,您很有可能在桌子上留下很多单线性能。cblas_sasum
andcblas_sscal
from a good BLAS implementation). It's quite possible that you're leaving a lot of single thread performance on the table at the moment.