为什么在一定的工作负载尺寸后与OpenMP并行相关的加速度下降?

发布于 2025-01-29 14:22:53 字数 3343 浏览 1 评论 0原文

我正试图进入OpenMP并写出一小部分代码,以感觉到对速度的期望:

#include <algorithm>
#include <chrono>
#include <functional>
#include <iostream>
#include <numeric>
#include <vector>
#include <random>

void SingleThreaded(std::vector<float> &weights, int size)
{
    auto totalWeight = 0.0f;

    for (int index = 0; index < size; index++)
    {
        totalWeight += weights[index];
    }

    for (int index = 0; index < size; index++)
    {
        weights[index] /= totalWeight;
    }
}

void MultiThreaded(std::vector<float> &weights, int size)
{
    auto totalWeight = 0.0f;

#pragma omp parallel shared(weights, size, totalWeight) default(none)
    {
        // clang-format off
#pragma omp for reduction(+ : totalWeight)
        // clang-format on
        for (int index = 0; index < size; index++)
        {
            totalWeight += weights[index];
        }

#pragma omp for
        for (int index = 0; index < size; index++)
        {
            weights[index] /= totalWeight;
        }
    }
}

float TimeIt(std::function<void(void)> function)
{
    auto startTime = std::chrono::high_resolution_clock::now().time_since_epoch();
    function();
    auto endTime = std::chrono::high_resolution_clock::now().time_since_epoch();
    std::chrono::duration<float> duration = endTime - startTime;

    return duration.count();
}

int main(int argc, char *argv[])
{
    std::vector<float> weights(1 << 24);
    std::srand(std::random_device{}());
    std::generate(weights.begin(), weights.end(), []()
                  { return std::rand() / static_cast<float>(RAND_MAX); });

    for (int size = 1; size <= weights.size(); size <<= 1)
    {
        auto singleThreadedDuration = TimeIt(std::bind(SingleThreaded, std::ref(weights), size));
        auto multiThreadedDuration = TimeIt(std::bind(MultiThreaded, std::ref(weights), size));

        std::cout << "Size: " << size << std::endl;
        std::cout << "Speed up: " << singleThreadedDuration / multiThreadedDuration << std::endl;
    }
}

我在Win10上使用Mingw G ++编译并运行了上述代码,例如:

g ++ -O3 -static -fopenmp OpenMP.CPP; ./a.exe

输出(见下文)显示的最大速度约为4.2,在矢量大小为524288。这意味着多线程代码的速度比向量的单线程代码快4.2倍。大小为524288。

Size: 1
Speedup: 0.00614035
Size: 2
Speedup: 0.00138696
Size: 4
Speedup: 0.00264201
Size: 8
Speedup: 0.00324149
Size: 16
Speedup: 0.00316957
Size: 32
Speedup: 0.00315457
Size: 64
Speedup: 0.00297177
Size: 128
Speedup: 0.00569801
Size: 256
Speedup: 0.00596125
Size: 512
Speedup: 0.00979021
Size: 1024
Speedup: 0.019943
Size: 2048
Speedup: 0.0317662
Size: 4096
Speedup: 0.181818
Size: 8192
Speedup: 0.133713
Size: 16384
Speedup: 0.216568
Size: 32768
Speedup: 0.566396
Size: 65536
Speedup: 1.10169
Size: 131072
Speedup: 1.99395
Size: 262144
Speedup: 3.4772
Size: 524288
Speedup: 4.20111
Size: 1048576
Speedup: 2.82819
Size: 2097152
Speedup: 3.98878
Size: 4194304
Speedup: 4.00481
Size: 8388608
Speedup: 2.91028
Size: 16777216
Speedup: 3.85507

所以我的问题是:

  1. 为什么较小的矢量大小的多线程代码较慢?纯粹是因为创建线程并分发工作的开销,还是我做错了什么?
  2. 为什么在一定尺寸之后我会降低加速?
  3. 理论上我在使用的CPU(i7 7700k)上可以实现的最佳案例加速度是什么?
  4. 物理CPU内核与逻辑CPU内核之间的区别在加速方面是否重要?
  5. 我的代码中是否犯了任何公然错误?我可以改善一些东西吗?

I'm trying to get into OpenMP and wrote up a small piece of code to get a feel for what to expect in terms of speedup:

#include <algorithm>
#include <chrono>
#include <functional>
#include <iostream>
#include <numeric>
#include <vector>
#include <random>

void SingleThreaded(std::vector<float> &weights, int size)
{
    auto totalWeight = 0.0f;

    for (int index = 0; index < size; index++)
    {
        totalWeight += weights[index];
    }

    for (int index = 0; index < size; index++)
    {
        weights[index] /= totalWeight;
    }
}

void MultiThreaded(std::vector<float> &weights, int size)
{
    auto totalWeight = 0.0f;

#pragma omp parallel shared(weights, size, totalWeight) default(none)
    {
        // clang-format off
#pragma omp for reduction(+ : totalWeight)
        // clang-format on
        for (int index = 0; index < size; index++)
        {
            totalWeight += weights[index];
        }

#pragma omp for
        for (int index = 0; index < size; index++)
        {
            weights[index] /= totalWeight;
        }
    }
}

float TimeIt(std::function<void(void)> function)
{
    auto startTime = std::chrono::high_resolution_clock::now().time_since_epoch();
    function();
    auto endTime = std::chrono::high_resolution_clock::now().time_since_epoch();
    std::chrono::duration<float> duration = endTime - startTime;

    return duration.count();
}

int main(int argc, char *argv[])
{
    std::vector<float> weights(1 << 24);
    std::srand(std::random_device{}());
    std::generate(weights.begin(), weights.end(), []()
                  { return std::rand() / static_cast<float>(RAND_MAX); });

    for (int size = 1; size <= weights.size(); size <<= 1)
    {
        auto singleThreadedDuration = TimeIt(std::bind(SingleThreaded, std::ref(weights), size));
        auto multiThreadedDuration = TimeIt(std::bind(MultiThreaded, std::ref(weights), size));

        std::cout << "Size: " << size << std::endl;
        std::cout << "Speed up: " << singleThreadedDuration / multiThreadedDuration << std::endl;
    }
}

I compiled and ran the above code with MinGW g++ on Win10 like so:

g++ -O3 -static -fopenmp OpenMP.cpp; ./a.exe

The output (see below) shows a maximum speedup of around 4.2 at a vector size of 524288. That means that the multi-threaded code ran 4.2 times faster than the single-threaded code for a vector size of 524288.

Size: 1
Speedup: 0.00614035
Size: 2
Speedup: 0.00138696
Size: 4
Speedup: 0.00264201
Size: 8
Speedup: 0.00324149
Size: 16
Speedup: 0.00316957
Size: 32
Speedup: 0.00315457
Size: 64
Speedup: 0.00297177
Size: 128
Speedup: 0.00569801
Size: 256
Speedup: 0.00596125
Size: 512
Speedup: 0.00979021
Size: 1024
Speedup: 0.019943
Size: 2048
Speedup: 0.0317662
Size: 4096
Speedup: 0.181818
Size: 8192
Speedup: 0.133713
Size: 16384
Speedup: 0.216568
Size: 32768
Speedup: 0.566396
Size: 65536
Speedup: 1.10169
Size: 131072
Speedup: 1.99395
Size: 262144
Speedup: 3.4772
Size: 524288
Speedup: 4.20111
Size: 1048576
Speedup: 2.82819
Size: 2097152
Speedup: 3.98878
Size: 4194304
Speedup: 4.00481
Size: 8388608
Speedup: 2.91028
Size: 16777216
Speedup: 3.85507

So my questions are:

  1. Why is the multi-threaded code slower for a smaller vector size? Is it purely because of the overhead of creating the threads and distributing the work or am I doing something wrong?
  2. Why does the speedup I get decrease after a certain size?
  3. What would be the best case speedup I could theoretically achieve on the CPU I used (i7 7700k)?
  4. Does the distinction between physical CPU cores and logical CPU cores matter in terms of speedup?
  5. Did I make any blatant mistakes in my code? Can I improve something?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

匿名。 2025-02-05 14:22:53
  1. 我同意你的理论;这可能是设置问题的开销。
  2. 虽然您处理器上的CPU内核具有自己的L1和L2缓存,但它们都共享一个8M L3缓存,一旦向量太大而无法适应该L3缓存,就有线程互相驱逐彼此页面的风险缓存。
  3. 我假设“逻辑核心”您的意思是超线程吗?这些实际上不能并行计算,它们只能“填充”,而另一个线程被阻止等待内存。在缓存有效的计算界代码中,可能会大大限制其对并行性的潜力。
  4. 我不知道您的编译器在多大程度上对其编译的代码进行了验证;我将根据完全矢量化的实现(例如使用cblas_sasumcblas_sscal从良好的BLAS实现实现)对您具有的两个函数进行基准测试。目前,您很有可能在桌子上留下很多单线性能。
  1. I agree with your theory; it's likely the overhead of setting things up.
  2. While the CPU cores on your processor have their own L1 and L2 caches, they all share an 8M L3 cache, and once the vector becomes too big to fit into that L3 cache, there is the risk of the threads mutually evicting each other's pages from the cache.
  3. I assume by "logical core" you mean a hyperthread? Those cannot actually compute in parallel, they can merely "fill in" while the other thread is e.g. blocked waiting for memory. In cache effective, compute bound code, that could limit their potential for parallelism considerably.
  4. I don't know to what extent your compiler vectorizes the code it compiles; I would benchmark the two functions you have against a fully vectorized implementation (e.g. using cblas_sasum and cblas_sscal from a good BLAS implementation). It's quite possible that you're leaving a lot of single thread performance on the table at the moment.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文