测量 OpenMP Fork/Join 延迟

发布于 2025-01-19 18:20:24 字数 7880 浏览 2 评论 0原文

由于MPI-3具有共享内存并行性功能，并且它似乎与我的应用程序完全匹配，因此我非常考虑将混合opemmp-MPI代码重写为纯MPI实现。

为了将最后一个指甲驱动到棺材中，我决定运行一个小程序来测试OpenMP叉/联接机构的延迟。这是代码（为Intel编译器编写）：

void action1(std::vector<double>& t1, std::vector<double>& t2)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size(); ++index)
    {
        t1.data()[index] = std::sin(t2.data()[index]) * std::cos(t2.data()[index]);
    }
}

void action2(std::vector<double>& t1, std::vector<double>& t2)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size(); ++index)
    {
        t1.data()[index] = t2.data()[index] * std::sin(t2.data()[index]);
    }
}

void action3(std::vector<double>& t1, std::vector<double>& t2)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size(); ++index)
    {
        t1.data()[index] = t2.data()[index] * t2.data()[index];
    }
}

void action4(std::vector<double>& t1, std::vector<double>& t2)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size(); ++index)
    {
        t1.data()[index] = std::sqrt(t2.data()[index]);
    }
}

void action5(std::vector<double>& t1, std::vector<double>& t2)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size(); ++index)
    {
        t1.data()[index] = t2.data()[index] * 2.0;
    }
}

void all_actions(std::vector<double>& t1, std::vector<double>& t2)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size(); ++index)
    {
        t1.data()[index] = std::sin(t2.data()[index]) * std::cos(t2.data()[index]);
        t1.data()[index] = t2.data()[index] * std::sin(t2.data()[index]);
        t1.data()[index] = t2.data()[index] * t2.data()[index];
        t1.data()[index] = std::sqrt(t2.data()[index]);
        t1.data()[index] = t2.data()[index] * 2.0;
    }
}


int main()
{
    // decide the process parameters
    const auto n = std::size_t{8000000};
    const auto test_count = std::size_t{500};
    
    // garbage data...
    auto t1 = std::vector<double>(n);
    auto t2 = std::vector<double>(n);
    
    //+/////////////////
    // perform actions one after the other
    //+/////////////////
    
    const auto sp = timer::spot_timer();
    const auto dur1 = sp.duration_in_us();
    for (auto index = std::size_t{}; index < test_count; ++index)
    {
        #pragma noinline
        action1(t1, t2);
        #pragma noinline
        action2(t1, t2);
        #pragma noinline
        action3(t1, t2);
        #pragma noinline
        action4(t1, t2);
        #pragma noinline
        action5(t1, t2);
    }
    const auto dur2 = sp.duration_in_us();
    
    //+/////////////////
    // perform all actions at once
    //+/////////////////
    const auto dur3 = sp.duration_in_us();
    for (auto index = std::size_t{}; index < test_count; ++index)
    {
        #pragma noinline
        all_actions(t1, t2);
    }
    const auto dur4 = sp.duration_in_us();
    
    const auto a = dur2 - dur1;
    const auto b = dur4 - dur3;
    if (a < b)
    {
        throw std::logic_error("negative_latency_error");
    }
    const auto fork_join_latency = (a - b) / (test_count * 4);
    
    // report
    std::cout << "Ran the program with " << omp_get_max_threads() << ", the calculated fork/join latency is: " << fork_join_latency << " us" << std::endl;
    
    return 0;
}

如您所见，想法是分别执行一组操作（每个操作）（每个操作）并计算平均持续时间，然后一起执行所有这些操作（在相同的OpenMP循环中），并计算其平均持续时间。然后，我们在两个变量中有一个线性系统的线性系统，其中一个是叉/联接机制的延迟，可以求解以获得值。

问题：

我忽略了什么吗？
目前，我正在使用“ -O0”来防止智能裤编译器从事有趣的业务。我应该使用哪些编译器优化，这些编译器还会影响延迟本身等吗？
在带有6个核心的咖啡湖处理器上，我测量了约850年的潜伏期。这听起来对吗？

编辑3

）我在 @paleonix的建议中包括一个热身计算，
）我已经减少了动作数量，以简单起见，
）我切换到“ OMP_GET_WTIME”以使其普遍理解。

我现在正在使用flag -o3运行以下代码：

void action1(std::vector<double>& t1)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size(); ++index)
    {
        t1.data()[index] = std::sin(t1.data()[index]);
    }
}

void action2(std::vector<double>& t1)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size(); ++index)
    {
        t1.data()[index] =  std::cos(t1.data()[index]);
    }
}

void action3(std::vector<double>& t1)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size(); ++index)
    {
        t1.data()[index] = std::atan(t1.data()[index]);
    }
}

void all_actions(std::vector<double>& t1, std::vector<double>& t2, std::vector<double>& t3)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size(); ++index)
    {
        #pragma optimize("", off)
        t1.data()[index] = std::sin(t1.data()[index]);
        t2.data()[index] = std::cos(t2.data()[index]);
        t3.data()[index] = std::atan(t3.data()[index]);
        #pragma optimize("", on)
    }
}


int main()
{
    // decide the process parameters
    const auto n = std::size_t{1500000}; // 12 MB (way too big for any cache)
    const auto experiment_count = std::size_t{1000};
    
    // garbage data...
    auto t1 = std::vector<double>(n);
    auto t2 = std::vector<double>(n);
    auto t3 = std::vector<double>(n);
    auto t4 = std::vector<double>(n);
    auto t5 = std::vector<double>(n);
    auto t6 = std::vector<double>(n);
    auto t7 = std::vector<double>(n);
    auto t8 = std::vector<double>(n);
    auto t9 = std::vector<double>(n);
    
    //+/////////////////
    // warum-up, initialization of threads etc.
    //+/////////////////
    for (auto index = std::size_t{}; index < experiment_count / 10; ++index)
    {
        all_actions(t1, t2, t3);
    }
    
    //+/////////////////
    // perform actions (part A)
    //+/////////////////
    
    const auto dur1 = omp_get_wtime();
    for (auto index = std::size_t{}; index < experiment_count; ++index)
    {
        action1(t4);
        action2(t5);
        action3(t6);
    }
    const auto dur2 = omp_get_wtime();
    
    //+/////////////////
    // perform all actions at once (part B)
    //+/////////////////

    const auto dur3 = omp_get_wtime();
    #pragma nofusion
    for (auto index = std::size_t{}; index < experiment_count; ++index)
    {
        all_actions(t7, t8, t9);
    }
    const auto dur4 = omp_get_wtime();
    
    const auto a = dur2 - dur1;
    const auto b = dur4 - dur3;
    const auto fork_join_latency = (a - b) / (experiment_count * 2);
    
    // report
    std::cout << "Ran the program with " << omp_get_max_threads() << ", the calculated fork/join latency is: "
        << fork_join_latency * 1E+6 << " us" << std::endl;
    
    return 0;
}

这样，测得的延迟现在为115 US。令人困惑的是，当动作更改时，此值更改。根据我的逻辑，由于我在A部分和B部分都采取了相同的操作，因此实际上应该没有任何改变。为什么会发生这种情况？

原文

Since MPI-3 comes with functionality for shared memory parallelism, and it seems to be perfectly matched for my application, I'm critically considering rewriting my hybrid OpemMP-MPI code into a pure MPI implementation.

In order to drive the last nail into the coffin, I decided to run a small program to test the latency of the OpenMP fork/join mechanism. Here's the code (written for Intel compiler):

void action1(std::vector<double>& t1, std::vector<double>& t2)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size(); ++index)
    {
        t1.data()[index] = std::sin(t2.data()[index]) * std::cos(t2.data()[index]);
    }
}

void action2(std::vector<double>& t1, std::vector<double>& t2)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size(); ++index)
    {
        t1.data()[index] = t2.data()[index] * std::sin(t2.data()[index]);
    }
}

void action3(std::vector<double>& t1, std::vector<double>& t2)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size(); ++index)
    {
        t1.data()[index] = t2.data()[index] * t2.data()[index];
    }
}

void action4(std::vector<double>& t1, std::vector<double>& t2)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size(); ++index)
    {
        t1.data()[index] = std::sqrt(t2.data()[index]);
    }
}

void action5(std::vector<double>& t1, std::vector<double>& t2)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size(); ++index)
    {
        t1.data()[index] = t2.data()[index] * 2.0;
    }
}

void all_actions(std::vector<double>& t1, std::vector<double>& t2)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size(); ++index)
    {
        t1.data()[index] = std::sin(t2.data()[index]) * std::cos(t2.data()[index]);
        t1.data()[index] = t2.data()[index] * std::sin(t2.data()[index]);
        t1.data()[index] = t2.data()[index] * t2.data()[index];
        t1.data()[index] = std::sqrt(t2.data()[index]);
        t1.data()[index] = t2.data()[index] * 2.0;
    }
}


int main()
{
    // decide the process parameters
    const auto n = std::size_t{8000000};
    const auto test_count = std::size_t{500};
    
    // garbage data...
    auto t1 = std::vector<double>(n);
    auto t2 = std::vector<double>(n);
    
    //+/////////////////
    // perform actions one after the other
    //+/////////////////
    
    const auto sp = timer::spot_timer();
    const auto dur1 = sp.duration_in_us();
    for (auto index = std::size_t{}; index < test_count; ++index)
    {
        #pragma noinline
        action1(t1, t2);
        #pragma noinline
        action2(t1, t2);
        #pragma noinline
        action3(t1, t2);
        #pragma noinline
        action4(t1, t2);
        #pragma noinline
        action5(t1, t2);
    }
    const auto dur2 = sp.duration_in_us();
    
    //+/////////////////
    // perform all actions at once
    //+/////////////////
    const auto dur3 = sp.duration_in_us();
    for (auto index = std::size_t{}; index < test_count; ++index)
    {
        #pragma noinline
        all_actions(t1, t2);
    }
    const auto dur4 = sp.duration_in_us();
    
    const auto a = dur2 - dur1;
    const auto b = dur4 - dur3;
    if (a < b)
    {
        throw std::logic_error("negative_latency_error");
    }
    const auto fork_join_latency = (a - b) / (test_count * 4);
    
    // report
    std::cout << "Ran the program with " << omp_get_max_threads() << ", the calculated fork/join latency is: " << fork_join_latency << " us" << std::endl;
    
    return 0;
}

As you can see, the idea is to perform a set of actions separately (each within an OpenMP loop) and to calculate the average duration of this, and then to perform all these actions together (within the same OpenMP loop) and to calculate the average duration of that. Then we have a linear system of equations in two variables, one of which is the latency of the fork/join mechanism, which can be solved to obtain the value.

Questions:

Am I overlooking something?
Currently, I am using "-O0" to prevent smarty-pants compiler from doing its funny business. Which compiler optimizations should I use, would these also have an effect on the latency itself etc etc?
On my Coffee Lake processor with 6 cores, I measured a latency of ~850 us. Does this sound about right?

Edit 3

) I've included a warm-up calculation in the beginning upon @paleonix's suggestion,
) I've reduced the number of actions for simplicity, and,
) I've switched to 'omp_get_wtime' to make it universally understandable.

I am now running the following code with flag -O3:

void action1(std::vector<double>& t1)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size(); ++index)
    {
        t1.data()[index] = std::sin(t1.data()[index]);
    }
}

void action2(std::vector<double>& t1)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size(); ++index)
    {
        t1.data()[index] =  std::cos(t1.data()[index]);
    }
}

void action3(std::vector<double>& t1)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size(); ++index)
    {
        t1.data()[index] = std::atan(t1.data()[index]);
    }
}

void all_actions(std::vector<double>& t1, std::vector<double>& t2, std::vector<double>& t3)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size(); ++index)
    {
        #pragma optimize("", off)
        t1.data()[index] = std::sin(t1.data()[index]);
        t2.data()[index] = std::cos(t2.data()[index]);
        t3.data()[index] = std::atan(t3.data()[index]);
        #pragma optimize("", on)
    }
}


int main()
{
    // decide the process parameters
    const auto n = std::size_t{1500000}; // 12 MB (way too big for any cache)
    const auto experiment_count = std::size_t{1000};
    
    // garbage data...
    auto t1 = std::vector<double>(n);
    auto t2 = std::vector<double>(n);
    auto t3 = std::vector<double>(n);
    auto t4 = std::vector<double>(n);
    auto t5 = std::vector<double>(n);
    auto t6 = std::vector<double>(n);
    auto t7 = std::vector<double>(n);
    auto t8 = std::vector<double>(n);
    auto t9 = std::vector<double>(n);
    
    //+/////////////////
    // warum-up, initialization of threads etc.
    //+/////////////////
    for (auto index = std::size_t{}; index < experiment_count / 10; ++index)
    {
        all_actions(t1, t2, t3);
    }
    
    //+/////////////////
    // perform actions (part A)
    //+/////////////////
    
    const auto dur1 = omp_get_wtime();
    for (auto index = std::size_t{}; index < experiment_count; ++index)
    {
        action1(t4);
        action2(t5);
        action3(t6);
    }
    const auto dur2 = omp_get_wtime();
    
    //+/////////////////
    // perform all actions at once (part B)
    //+/////////////////

    const auto dur3 = omp_get_wtime();
    #pragma nofusion
    for (auto index = std::size_t{}; index < experiment_count; ++index)
    {
        all_actions(t7, t8, t9);
    }
    const auto dur4 = omp_get_wtime();
    
    const auto a = dur2 - dur1;
    const auto b = dur4 - dur3;
    const auto fork_join_latency = (a - b) / (experiment_count * 2);
    
    // report
    std::cout << "Ran the program with " << omp_get_max_threads() << ", the calculated fork/join latency is: "
        << fork_join_latency * 1E+6 << " us" << std::endl;
    
    return 0;
}

With this, the measured latency is now 115 us. What's puzzling me now is that this value changes when the actions are changed. According to my logic, since I'm doing the same action in both parts A and B, there should actually be no change. Why is this happening?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

我早已燃尽 2025-01-26 18:20:24

这是我测量 fork-join 开销的尝试：

#include <iostream>
#include <string>

#include <omp.h>

constexpr int n_warmup = 10'000;
constexpr int n_measurement = 100'000;
constexpr int n_spins = 1'000;

void spin() {
    volatile bool flag = false;
    for (int i = 0; i < n_spins; ++i) {
        if (flag) {
            break;
        }
    }
}

void bench_fork_join(int num_threads) {
    omp_set_num_threads(num_threads);

    // create threads, warmup
    for (int i = 0; i < n_warmup; ++i) {
        #pragma omp parallel
        spin();
    }

    double const start = omp_get_wtime();
    for (int i = 0; i < n_measurement; ++i) {
        #pragma omp parallel
        spin();
    }
    double const stop = omp_get_wtime();
    double const ptime = (stop - start) * 1e6 / n_measurement;

    // warmup
    for (int i = 0; i < n_warmup; ++i) {
        spin();
    }
    double const sstart = omp_get_wtime();
    for (int i = 0; i < n_measurement; ++i) {
        spin();
    }
    double const sstop = omp_get_wtime();
    double const stime = (sstop - sstart) * 1e6 / n_measurement;

    std::cout << ptime << " us\t- " << stime << " us\t= " << ptime - stime << " us\n";
}

int main(int argc, char **argv) {
    auto const params = argc - 1;
    std::cout << "parallel\t- sequential\t= overhead\n";

    for (int j = 0; j < params; ++j) {
        auto num_threads = std::stoi(argv[1 + j]);
        std::cout << "---------------- num_threads = " << num_threads << " ----------------\n";
        bench_fork_join(num_threads);
    }

    return 0;
}

您可以使用多个不同数量的线程来调用它，这些线程数量不应高于计算机上的核心数量才能给出合理的结果。在我的 6 核机器上并使用 gcc 11.2 进行编译时，我得到

$ g++ -fopenmp -O3 -DNDEBUG -o bench-omp-fork-join bench-omp-fork-join.cpp
$ ./bench-omp-fork-join 6 4 2 1
parallel        - sequential    = overhead
---------------- num_threads = 6 ----------------
1.51439 us      - 0.273195 us   = 1.24119 us
---------------- num_threads = 4 ----------------
1.24683 us      - 0.276122 us   = 0.970708 us
---------------- num_threads = 2 ----------------
1.10637 us      - 0.270865 us   = 0.835501 us
---------------- num_threads = 1 ----------------
0.708679 us     - 0.269508 us   = 0.439171 us

在每一行中，第一个数字是带线程的平均值（超过 100'000 次迭代），第二个数字是不带线程的平均值。最后一个数字是前两个数字之间的差值，并且应该是 fork-join 开销的上限。

确保每行中间列（无线程）中的数字大致相同，因为它们应该独立于线程数。如果不是，请确保计算机上没有运行任何其他内容和/或增加测量和/或预热运行的次数。

关于将 OpenMP 替换为 MPI，请记住 MPI 仍然是多处理而不是多线程。您可能会付出大量内存开销，因为进程往往比线程大得多。

编辑：

修改基准以在易失性标志上使用旋转而不是休眠（感谢@Jérôme Richard）。正如 Jérôme Richard 在他的回答中提到的，测量的开销随着 n_spins 的增加而增长。将 n_spins 设置为低于 1000 并没有显着改变我的测量结果，所以这就是我测量的位置。正如上面所看到的，测量的开销远低于基准测试的早期版本测量的开销。

睡眠的不准确是一个问题，特别是因为人们总是会测量睡眠时间最长的线程，因此会偏向更长的时间，即使睡眠时间本身会围绕输入时间对称分布。

Here is my attempt at measuring fork-join overhead:

#include <iostream>
#include <string>

#include <omp.h>

constexpr int n_warmup = 10'000;
constexpr int n_measurement = 100'000;
constexpr int n_spins = 1'000;

void spin() {
    volatile bool flag = false;
    for (int i = 0; i < n_spins; ++i) {
        if (flag) {
            break;
        }
    }
}

void bench_fork_join(int num_threads) {
    omp_set_num_threads(num_threads);

    // create threads, warmup
    for (int i = 0; i < n_warmup; ++i) {
        #pragma omp parallel
        spin();
    }

    double const start = omp_get_wtime();
    for (int i = 0; i < n_measurement; ++i) {
        #pragma omp parallel
        spin();
    }
    double const stop = omp_get_wtime();
    double const ptime = (stop - start) * 1e6 / n_measurement;

    // warmup
    for (int i = 0; i < n_warmup; ++i) {
        spin();
    }
    double const sstart = omp_get_wtime();
    for (int i = 0; i < n_measurement; ++i) {
        spin();
    }
    double const sstop = omp_get_wtime();
    double const stime = (sstop - sstart) * 1e6 / n_measurement;

    std::cout << ptime << " us\t- " << stime << " us\t= " << ptime - stime << " us\n";
}

int main(int argc, char **argv) {
    auto const params = argc - 1;
    std::cout << "parallel\t- sequential\t= overhead\n";

    for (int j = 0; j < params; ++j) {
        auto num_threads = std::stoi(argv[1 + j]);
        std::cout << "---------------- num_threads = " << num_threads << " ----------------\n";
        bench_fork_join(num_threads);
    }

    return 0;
}

You can call it with multiple different numbers of threads which should not be higher then the number of cores on your machine to give reasonable results. On my machine with 6 cores and compiling with gcc 11.2, I get

$ g++ -fopenmp -O3 -DNDEBUG -o bench-omp-fork-join bench-omp-fork-join.cpp
$ ./bench-omp-fork-join 6 4 2 1
parallel        - sequential    = overhead
---------------- num_threads = 6 ----------------
1.51439 us      - 0.273195 us   = 1.24119 us
---------------- num_threads = 4 ----------------
1.24683 us      - 0.276122 us   = 0.970708 us
---------------- num_threads = 2 ----------------
1.10637 us      - 0.270865 us   = 0.835501 us
---------------- num_threads = 1 ----------------
0.708679 us     - 0.269508 us   = 0.439171 us

In each line the first number is the average (over 100'000 iterations) with threads and the second number is the average without threads. The last number is the difference between the first two and should be an upper bound on the fork-join overhead.

Make sure that the numbers in the middle column (no threads) are approximately the same in every row, as they should be independent of the number of threads. If they aren't, make sure there is nothing else running on the computer and/or increase the number of measurements and/or warmup runs.

In regard to exchanging OpenMP for MPI, keep in mind that MPI is still multiprocessing and not multithreading. You might pay a lot of memory overhead because processes tend to be much bigger than threads.

EDIT:

Revised benchmark to use spinning on a volatile flag instead of sleeping (Thanks @Jérôme Richard). As Jérôme Richard mentioned in his answer, the measured overhead grows with n_spins. Setting n_spins below 1000 didn't significantly change the measurement for me, so that is where I measured. As one can see above, the measured overhead is way lower than what the earlier version of the benchmark measured.

The inaccuracy of sleeping is a problem especially because one will always measure the thread that sleeps the longest and therefore get a bias to longer times, even if sleep times themselves would be distributed symmetrically around the input time.

回复收藏 0 原文

垂暮老矣 2025-01-26 18:20:24

TL;DR：由于动态频率缩放，内核不会以完全相同的速度运行，并且存在大量噪声，可能会影响执行，从而导致昂贵的同步。您的基准测试主要衡量这种同步开销。使用独特的平行部分应该可以解决这个问题。

该基准存在相当大的缺陷。此代码实际上并不测量 OpenMP 分叉/连接部分的“延迟”。它测量多种开销的组合，包括：

负载平衡和同步：分割循环比大合并循环执行更频繁的同步（多 5 倍）。同步是昂贵的，不是因为通信开销，而是因为本质上不同步的内核之间的实际同步。事实上，线程之间轻微的工作不平衡会导致其他线程等待最慢线程的完成。您可能认为由于静态调度，这种情况不应该发生，但是上下文切换和动态频率缩放会导致某些线程比其他线程慢。如果线程未绑定到核心或者如果某些程序在计算期间由操作系统调度，则上下文切换尤其重要。 动态频率缩放（例如英特尔睿频加速）会导致某些（线程组）在工作负载、每个核心和整体封装的温度、数量方面变得更快活动核心数量、估计功耗等。核心数量越多，同步开销越高。请注意，此开销取决于循环所用的时间。欲了解更多信息，请阅读下面的分析。

循环分割的性能：将 5 个循环合并为一个唯一的循环会影响生成的汇编代码（因为需要更少的指令），也会影响缓存中的加载/存储（因为内存访问模式有点不同）不同的）。更不用说它理论上可能会影响矢量化，尽管 ICC 不会对此特定代码进行矢量化。话虽这么说，这似乎并不是我的机器上的主要实际问题，因为我无法通过顺序运行程序来重现 Clang 的问题，而我可以使用许多线程。

为了解决这个问题，您可以使用独特的平行部分。 omp for 循环必须使用 nowait 子句，以免引入同步。或者，基于任务的构造（例如带有 nogroup 的 taskloop）也可以帮助实现相同的目标。在这两种情况下，您都应该小心依赖关系，因为多个 for-loop/task-loos 可以并行运行。这在您当前的代码中很好。

分析

分析由执行噪声（上下文切换、频率缩放、缓存效果、操作系统中断等）引起的短同步的影响非常困难，因为在您的情况下，同步期间最慢的线程可能永远不是同一个线程（工作线程之间相当平衡，但它们的速度并不完全相等）。

话虽这么说，如果这个假设成立，fork_join_latency 应该依赖于 n。因此，增加n也会增加fork_join_latency。这是我可以在我的 6 核 i5-9600KF 处理器上使用 Clang 13 + IOMP（使用 -fopenmp -O3）得到的结果：

n=   80'000    fork_join_latency<0.000001
n=  800'000    fork_join_latency=0.000036
n= 8'000'000   fork_join_latency=0.000288
n=80'000'000   fork_join_latency=0.003236

请注意，fork_join_latency 计时不是很实践中稳定，但行为非常明显：测量的开销取决于n。

更好的解决方案是通过测量每个线程的循环时间并累积最小和最大时间之间的差值来测量同步时间。下面是一个代码示例：

double totalSyncTime = 0.0;

void action1(std::vector<double>& t1)
{
    constexpr int threadCount = 6;
    double timePerThread[threadCount] = {0};

    #pragma omp parallel
    {
        const double start = omp_get_wtime();
        #pragma omp for nowait schedule(static) //num_threads(std::thread::hardware_concurrency())
        #pragma nounroll
        for (auto index = std::size_t{}; index < t1.size(); ++index)
        {
            t1.data()[index] = std::sin(t1.data()[index]);
        }
        const double stop = omp_get_wtime();
        const double threadLoopTime = (stop - start);
        timePerThread[omp_get_thread_num()] = threadLoopTime;
    }

    const double mini = *std::min_element(timePerThread, timePerThread+threadCount);
    const double maxi = *std::max_element(timePerThread, timePerThread+threadCount);
    const double syncTime = maxi - mini;
    totalSyncTime += syncTime;
}

然后，您可以按照与 fork_join_latency 相同的方式划分 totalSyncTime 并打印结果。我得到 0.000284 和 fork_join_latency=0.000398 （使用 n=8'000'000），这几乎证明了大部分开销是由于同步，尤其是由于线程执行速度略有不同。请注意，此开销不包括 OpenMP 并行部分末尾的隐式屏障。

TL;DR: cores do not operate exactly at the same speed due to dynamic frequency scaling and there is a lot of noise that can impact the execution resulting in expensive synchronizations. Your benchmark mostly measures this synchronization overhead. Using a unique parallel section should solve this problem.

The benchmark is rather flawed. This code does not actually measure the "latency" of OpenMP fork/join sections. It measures a combinations of many overheads including:

load balancing and synchronizations: the split loops perform more frequent synchronizations (5 times more) than the big merged one. Synchronizations are expensive, not because of the communication overhead, but because of the actual synchronization between cores that are inherently not synchronized. Indeed, a slight work-imbalance between the threads results in other thread waiting for the completion of the slowest thread. You might think this should not happen because of static scheduling, but context switches and dynamic frequency scaling cause some threads to be slower than others. Context switches are especially important if threads are not bound to core or if some programs are scheduled by the OS during the computation. Dynamic frequency scaling (eg. Intel turbo boost) causes some (group of threads) to be faster regarding the workload, the temperature of each core and the overall package, the number of active cores, the estimated power consumption, etc. The higher the number of core, the higher the synchronization overheads. Note that this overhead is dependent of the time taken by the loops. For more information, please read the below analysis.

Performance of loop splitting: merging the 5 loops into a unique one impacts the generated assembly code (since less instructions are needed) and also impacts load/store in the cache (since the memory access pattern is a bit different). Not to mention it could theoretically impact vectorization although ICC does not vectorize this specific code. That being said, this does not appear to be the main actual issue on my machine since I am not able to reproduce the problem with Clang by running the program in sequential while I can with many threads.

To solve this problem you can use a unique parallel section. omp for loops must use the nowait clause so not to introduce synchronizations. Alternatively, task-based construct like taskloop with the nogroup can help to achieve the same thing. In both case you should be careful about dependencies since multiple for-loop/task-loos can run in parallel. This is fine in you current code.

Analysis

Analyzing the effect of short synchronizations caused by execution noise (context switches, frequency scaling, cache effect, OS interruptions, etc.) is pretty hard since it is likely never the same thread that is the slowest during synchronizations in your case (the work between thread is quite balance but they velocity is not perfectly equal).

That being said, if this hypothesis is true fork_join_latency should be dependent of n. Thus, increasing n also increase fork_join_latency. Here is was I can get with Clang 13 + IOMP (using -fopenmp -O3) on my 6-core i5-9600KF processor:

n=   80'000    fork_join_latency<0.000001
n=  800'000    fork_join_latency=0.000036
n= 8'000'000   fork_join_latency=0.000288
n=80'000'000   fork_join_latency=0.003236

Note that the fork_join_latency timings are not very stable in practice but the behavior is pretty visible: the measured overhead is dependent n.

A better solution is to measure the synchronization time by measuring the time of the loop for each thread and accumulate the difference between the minimum and maximum time. Here is an example of code:

double totalSyncTime = 0.0;

void action1(std::vector<double>& t1)
{
    constexpr int threadCount = 6;
    double timePerThread[threadCount] = {0};

    #pragma omp parallel
    {
        const double start = omp_get_wtime();
        #pragma omp for nowait schedule(static) //num_threads(std::thread::hardware_concurrency())
        #pragma nounroll
        for (auto index = std::size_t{}; index < t1.size(); ++index)
        {
            t1.data()[index] = std::sin(t1.data()[index]);
        }
        const double stop = omp_get_wtime();
        const double threadLoopTime = (stop - start);
        timePerThread[omp_get_thread_num()] = threadLoopTime;
    }

    const double mini = *std::min_element(timePerThread, timePerThread+threadCount);
    const double maxi = *std::max_element(timePerThread, timePerThread+threadCount);
    const double syncTime = maxi - mini;
    totalSyncTime += syncTime;
}

You can then divide totalSyncTime the same way you did for fork_join_latency and print the result. I get 0.000284 with fork_join_latency=0.000398 (with n=8'000'000) which almost proves that a major part of the overhead is due to synchronizations and more especially due to a slightly different thread execution velocity. Note that this overhead does not include the implicit barrier at the end of the OpenMP parallel section.

回复收藏 0 原文