为什么在一定的工作负载尺寸后与OpenMP并行相关的加速度下降？

发布于 2025-01-29 14:22:53 字数 3343 浏览 1 评论 0原文

我正试图进入OpenMP并写出一小部分代码，以感觉到对速度的期望：

#include <algorithm>
#include <chrono>
#include <functional>
#include <iostream>
#include <numeric>
#include <vector>
#include <random>

void SingleThreaded(std::vector<float> &weights, int size)
{
    auto totalWeight = 0.0f;

    for (int index = 0; index < size; index++)
    {
        totalWeight += weights[index];
    }

    for (int index = 0; index < size; index++)
    {
        weights[index] /= totalWeight;
    }
}

void MultiThreaded(std::vector<float> &weights, int size)
{
    auto totalWeight = 0.0f;

#pragma omp parallel shared(weights, size, totalWeight) default(none)
    {
        // clang-format off
#pragma omp for reduction(+ : totalWeight)
        // clang-format on
        for (int index = 0; index < size; index++)
        {
            totalWeight += weights[index];
        }

#pragma omp for
        for (int index = 0; index < size; index++)
        {
            weights[index] /= totalWeight;
        }
    }
}

float TimeIt(std::function<void(void)> function)
{
    auto startTime = std::chrono::high_resolution_clock::now().time_since_epoch();
    function();
    auto endTime = std::chrono::high_resolution_clock::now().time_since_epoch();
    std::chrono::duration<float> duration = endTime - startTime;

    return duration.count();
}

int main(int argc, char *argv[])
{
    std::vector<float> weights(1 << 24);
    std::srand(std::random_device{}());
    std::generate(weights.begin(), weights.end(), []()
                  { return std::rand() / static_cast<float>(RAND_MAX); });

    for (int size = 1; size <= weights.size(); size <<= 1)
    {
        auto singleThreadedDuration = TimeIt(std::bind(SingleThreaded, std::ref(weights), size));
        auto multiThreadedDuration = TimeIt(std::bind(MultiThreaded, std::ref(weights), size));

        std::cout << "Size: " << size << std::endl;
        std::cout << "Speed up: " << singleThreadedDuration / multiThreadedDuration << std::endl;
    }
}

我在Win10上使用Mingw G ++编译并运行了上述代码，例如：

g ++ -O3 -static -fopenmp OpenMP.CPP; ./a.exe

输出（见下文）显示的最大速度约为4.2，在矢量大小为524288。这意味着多线程代码的速度比向量的单线程代码快4.2倍。大小为524288。

Size: 1
Speedup: 0.00614035
Size: 2
Speedup: 0.00138696
Size: 4
Speedup: 0.00264201
Size: 8
Speedup: 0.00324149
Size: 16
Speedup: 0.00316957
Size: 32
Speedup: 0.00315457
Size: 64
Speedup: 0.00297177
Size: 128
Speedup: 0.00569801
Size: 256
Speedup: 0.00596125
Size: 512
Speedup: 0.00979021
Size: 1024
Speedup: 0.019943
Size: 2048
Speedup: 0.0317662
Size: 4096
Speedup: 0.181818
Size: 8192
Speedup: 0.133713
Size: 16384
Speedup: 0.216568
Size: 32768
Speedup: 0.566396
Size: 65536
Speedup: 1.10169
Size: 131072
Speedup: 1.99395
Size: 262144
Speedup: 3.4772
Size: 524288
Speedup: 4.20111
Size: 1048576
Speedup: 2.82819
Size: 2097152
Speedup: 3.98878
Size: 4194304
Speedup: 4.00481
Size: 8388608
Speedup: 2.91028
Size: 16777216
Speedup: 3.85507

所以我的问题是：

为什么较小的矢量大小的多线程代码较慢？纯粹是因为创建线程并分发工作的开销，还是我做错了什么？
为什么在一定尺寸之后我会降低加速？
理论上我在使用的CPU（i7 7700k）上可以实现的最佳案例加速度是什么？
物理CPU内核与逻辑CPU内核之间的区别在加速方面是否重要？
我的代码中是否犯了任何公然错误？我可以改善一些东西吗？

原文

I'm trying to get into OpenMP and wrote up a small piece of code to get a feel for what to expect in terms of speedup:

#include <algorithm>
#include <chrono>
#include <functional>
#include <iostream>
#include <numeric>
#include <vector>
#include <random>

void SingleThreaded(std::vector<float> &weights, int size)
{
    auto totalWeight = 0.0f;

    for (int index = 0; index < size; index++)
    {
        totalWeight += weights[index];
    }

    for (int index = 0; index < size; index++)
    {
        weights[index] /= totalWeight;
    }
}

void MultiThreaded(std::vector<float> &weights, int size)
{
    auto totalWeight = 0.0f;

#pragma omp parallel shared(weights, size, totalWeight) default(none)
    {
        // clang-format off
#pragma omp for reduction(+ : totalWeight)
        // clang-format on
        for (int index = 0; index < size; index++)
        {
            totalWeight += weights[index];
        }

#pragma omp for
        for (int index = 0; index < size; index++)
        {
            weights[index] /= totalWeight;
        }
    }
}

float TimeIt(std::function<void(void)> function)
{
    auto startTime = std::chrono::high_resolution_clock::now().time_since_epoch();
    function();
    auto endTime = std::chrono::high_resolution_clock::now().time_since_epoch();
    std::chrono::duration<float> duration = endTime - startTime;

    return duration.count();
}

int main(int argc, char *argv[])
{
    std::vector<float> weights(1 << 24);
    std::srand(std::random_device{}());
    std::generate(weights.begin(), weights.end(), []()
                  { return std::rand() / static_cast<float>(RAND_MAX); });

    for (int size = 1; size <= weights.size(); size <<= 1)
    {
        auto singleThreadedDuration = TimeIt(std::bind(SingleThreaded, std::ref(weights), size));
        auto multiThreadedDuration = TimeIt(std::bind(MultiThreaded, std::ref(weights), size));

        std::cout << "Size: " << size << std::endl;
        std::cout << "Speed up: " << singleThreadedDuration / multiThreadedDuration << std::endl;
    }
}

I compiled and ran the above code with MinGW g++ on Win10 like so:

g++ -O3 -static -fopenmp OpenMP.cpp; ./a.exe

The output (see below) shows a maximum speedup of around 4.2 at a vector size of 524288. That means that the multi-threaded code ran 4.2 times faster than the single-threaded code for a vector size of 524288.

Size: 1
Speedup: 0.00614035
Size: 2
Speedup: 0.00138696
Size: 4
Speedup: 0.00264201
Size: 8
Speedup: 0.00324149
Size: 16
Speedup: 0.00316957
Size: 32
Speedup: 0.00315457
Size: 64
Speedup: 0.00297177
Size: 128
Speedup: 0.00569801
Size: 256
Speedup: 0.00596125
Size: 512
Speedup: 0.00979021
Size: 1024
Speedup: 0.019943
Size: 2048
Speedup: 0.0317662
Size: 4096
Speedup: 0.181818
Size: 8192
Speedup: 0.133713
Size: 16384
Speedup: 0.216568
Size: 32768
Speedup: 0.566396
Size: 65536
Speedup: 1.10169
Size: 131072
Speedup: 1.99395
Size: 262144
Speedup: 3.4772
Size: 524288
Speedup: 4.20111
Size: 1048576
Speedup: 2.82819
Size: 2097152
Speedup: 3.98878
Size: 4194304
Speedup: 4.00481
Size: 8388608
Speedup: 2.91028
Size: 16777216
Speedup: 3.85507

So my questions are:

Why is the multi-threaded code slower for a smaller vector size? Is it purely because of the overhead of creating the threads and distributing the work or am I doing something wrong?
Why does the speedup I get decrease after a certain size?
What would be the best case speedup I could theoretically achieve on the CPU I used (i7 7700k)?
Does the distinction between physical CPU cores and logical CPU cores matter in terms of speedup?
Did I make any blatant mistakes in my code? Can I improve something?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

匿名。 2025-02-05 14:22:53

我同意你的理论；这可能是设置问题的开销。
虽然您处理器上的CPU内核具有自己的L1和L2缓存，但它们都共享一个8M L3缓存，一旦向量太大而无法适应该L3缓存，就有线程互相驱逐彼此页面的风险缓存。
我假设“逻辑核心”您的意思是超线程吗？这些实际上不能并行计算，它们只能“填充”，而另一个线程被阻止等待内存。在缓存有效的计算界代码中，可能会大大限制其对并行性的潜力。
我不知道您的编译器在多大程度上对其编译的代码进行了验证；我将根据完全矢量化的实现（例如使用cblas_sasum和cblas_sscal从良好的BLAS实现实现）对您具有的两个函数进行基准测试。目前，您很有可能在桌子上留下很多单线性能。