在 OpenMP 并行块内运行从 Python 接收的函数时，运行时间随线程数变化

发布于 2025-01-12 17:03:50 字数 2872 浏览 7 评论 0原文

这是用于测试的文件。

# CMakeLists.txt

cmake_minimum_required(VERSION 3.16)
project(CALLBACK_TEST)

set(CMAKE_CXX_STANDARD 17)
add_compile_options(-O3 -fopenmp -fPIC)
add_link_options(-fopenmp)

add_subdirectory(pybind11)
pybind11_add_module(callback callback.cpp)

add_custom_command(TARGET callback POST_BUILD
    COMMAND ${CMAKE_COMMAND} -E create_symlink $<TARGET_FILE:callback> ${CMAKE_CURRENT_SOURCE_DIR}/callback.so
)

// callback.cpp

#include <cmath>
#include <functional>
#include <vector>

#include <pybind11/pybind11.h>
#include <pybind11/functional.h>

namespace py = pybind11;

class C
{
public:
  C(std::function<float(float)> f, size_t s) : f_(f), v_(s, 1) {}
  void apply()
  {
#pragma omp parallel for
    for (size_t i = 0; i < v_.size(); i++)
      v_[i] = f_(v_[i]);
  }
  void apply_direct()
  {
#pragma omp parallel for
    for (size_t i = 0; i < v_.size(); i++)
      v_[i] = log(1 + v_[i]);
  }

private:
  std::vector<float> v_;
  std::function<float(float)> f_;
};

PYBIND11_MODULE(callback, m)
{
  py::class_<C>(m, "C")
      .def(py::init<std::function<float(float)>, size_t>())
      .def("apply", &C::apply, py::call_guard<py::gil_scoped_release>())
      .def("apply_direct", &C::apply_direct);
  m.def("log1p", [](float x) -> float
        { return log(1 + x); });
}

# callback.py

import math
import time

from callback import C, log1p


def run(n, func):
    start = time.time()
    if func:
        for _ in range(n):
            c = C(func, 1000)
            c.apply()
    else:
        for _ in range(n):
            c = C(func, 1000)
            c.apply_direct()
    end = time.time()
    print(end - start)


if __name__ == "__main__":
    n = 1000
    one = 1
    print("Python")
    run(n, lambda x: math.log(x + 1))
    print("C++")
    run(n, log1p)
    print("Direct")
    run(n, None)

我在具有 48 个 CPU 核心的服务器上运行 Python 脚本。这是运行时间。它显示 1. 当 OMP_NUM_THREADS 增加时，运行时间会增加，特别是在接受来自 Python 的 Python/C++ 回调时，2. 将所有内容保留在 C++ 中要快得多，这似乎与“无开销”声明相矛盾如文档中所示。

$ python callback.py
Python
19.612852573394775
C++
19.268250226974487
Direct
0.04382634162902832
$ OMP_NUM_THREADS=4 python callback.py
Python
6.042902708053589
C++
5.48648738861084
Direct
0.03322458267211914
$ OMP_NUM_THREADS=1 python callback.py
Python
0.5964927673339844
C++
0.38849639892578125
Direct
0.020793914794921875

当 OpenMP 关闭时：

$ python callback.py
Python
0.8492450714111328
C++
0.26660943031311035
Direct
0.010872125625610352

那么这里出了什么问题呢？

原文

Here are the files for test.

# CMakeLists.txt

cmake_minimum_required(VERSION 3.16)
project(CALLBACK_TEST)

set(CMAKE_CXX_STANDARD 17)
add_compile_options(-O3 -fopenmp -fPIC)
add_link_options(-fopenmp)

add_subdirectory(pybind11)
pybind11_add_module(callback callback.cpp)

add_custom_command(TARGET callback POST_BUILD
    COMMAND ${CMAKE_COMMAND} -E create_symlink lt;TARGET_FILE:callback> ${CMAKE_CURRENT_SOURCE_DIR}/callback.so
)

// callback.cpp

#include <cmath>
#include <functional>
#include <vector>

#include <pybind11/pybind11.h>
#include <pybind11/functional.h>

namespace py = pybind11;

class C
{
public:
  C(std::function<float(float)> f, size_t s) : f_(f), v_(s, 1) {}
  void apply()
  {
#pragma omp parallel for
    for (size_t i = 0; i < v_.size(); i++)
      v_[i] = f_(v_[i]);
  }
  void apply_direct()
  {
#pragma omp parallel for
    for (size_t i = 0; i < v_.size(); i++)
      v_[i] = log(1 + v_[i]);
  }

private:
  std::vector<float> v_;
  std::function<float(float)> f_;
};

PYBIND11_MODULE(callback, m)
{
  py::class_<C>(m, "C")
      .def(py::init<std::function<float(float)>, size_t>())
      .def("apply", &C::apply, py::call_guard<py::gil_scoped_release>())
      .def("apply_direct", &C::apply_direct);
  m.def("log1p", [](float x) -> float
        { return log(1 + x); });
}

# callback.py

import math
import time

from callback import C, log1p


def run(n, func):
    start = time.time()
    if func:
        for _ in range(n):
            c = C(func, 1000)
            c.apply()
    else:
        for _ in range(n):
            c = C(func, 1000)
            c.apply_direct()
    end = time.time()
    print(end - start)


if __name__ == "__main__":
    n = 1000
    one = 1
    print("Python")
    run(n, lambda x: math.log(x + 1))
    print("C++")
    run(n, log1p)
    print("Direct")
    run(n, None)

I run the Python script on a server with 48 CPU cores. Here is the running time. It shows 1. the running time increases when OMP_NUM_THREADS increases especially when accepting the Python/C++ callback from Python, and 2. keeping everything inside C++ is much faster, which seems to contradict the "no overhead" claim as in the documentation.

$ python callback.py
Python
19.612852573394775
C++
19.268250226974487
Direct
0.04382634162902832
$ OMP_NUM_THREADS=4 python callback.py
Python
6.042902708053589
C++
5.48648738861084
Direct
0.03322458267211914
$ OMP_NUM_THREADS=1 python callback.py
Python
0.5964927673339844
C++
0.38849639892578125
Direct
0.020793914794921875

And when OpenMP is turned off:

$ python callback.py
Python
0.8492450714111328
C++
0.26660943031311035
Direct
0.010872125625610352

So what goes wrong here?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

尘世孤行 2025-01-19 17:03:50

您的代码中有几个问题。

首先，OpenMP 并行区域在这里应该有很大的开销，因为它需要在 48 个线程之间共享工作。在某些平台上，就调度策略而言，这种工作共享的成本可能相当高。您需要使用 schedule(static) 来最小化此开销。在最坏的情况下，运行时可能会创建 48 个线程并每次连接它们，这是昂贵的。创建/加入 48*1000 线程将非常昂贵（至少需要几秒钟）。线程数越多，程序越慢。话虽如此，大多数运行时都会尝试保持活动的线程池。尽管如此，这并不总是可能的（这是一种优化，不是规范所要求的）。请注意，大多数 OpenMP 运行时都会检测 OMP_NUM_THREADS 设置为 1 的情况，因此在这种情况下开销非常低。一般经验法则是避免在非常短的操作中使用多线程，例如耗时少于 1 毫秒的操作。

此外，并行 for 循环会受到错误共享的影响。事实上，1000 个浮点项的向量将在内存中占用 4000 字节，并且在主流平台上它将分布在 63 个 64 字节的缓存行中。对于 48 个线程，几乎所有缓存线都必须在核心之间移动，这与完成的计算相比是昂贵的。当在相邻缓存行上工作的两个线程进行交错执行时，缓存行可能会在几次迭代中反弹多次。在 NUMA 架构上，这甚至更加昂贵，因为缓存行必须在 NUMA 节点之间移动。这样做 1000 次是非常昂贵的。

此外，据我所知，从并行上下文调用 python 函数要么不安全，要么由于全局解释器锁（GIL）而无法加速。我所说的不安全是指 CPython 解释器数据结构可能会被损坏，从而导致不确定的崩溃。这就是 GIL 存在的原因。只要未释放，GIL 就会阻止所有代码在多线程上扩展。释放 GIL 的时间太短也会导致缓存行弹跳，这对性能不利（比使用顺序代码更不利）。

最后，“C++”和 Python 的开销比“直接”方法大得多，因为它们调用无法由编译器内联或向量化的动态定义函数。由于 CPython 解释器的存在，Python 函数特别慢。如果您想进行公平的基准测试，则需要将 PyBind 解决方案与使用 std::function 的解决方案进行比较（但要注意巧妙的编译器优化）。

There are several issues in your code.

First of all, the OpenMP parallel region should have a significant overhead here since it needs to share the work between 48 threads. This work-sharing can be quite expensive on some platform regarding the scheduling policy. You need to use schedule(static) to minimize this overhead. In the worst case, a runtime could create 48 threads and join them every time which is expensive. Creating/Joining 48*1000 threads would be very expensive (it should take at least several seconds). The higher the number of thread, the slower the program. That being said, most runtimes try to keep an active pool of threads. Still, this is not always possible (and this is an optimization, not required by the specification). Note that most OpenMP runtimes detect the case where OMP_NUM_THREADS is set to 1 so to have a very low overhead in this case. The general rule of thumb is to avoid using multithreading for very short operations like one taking less than 1 ms.

Moreover, the parallel for loop is subject to false sharing. Indeed, the vector of 1000 float items will take 4000 bytes in memory and it will be spread in 63 cache lines of 64 bytes on mainstream platforms. With 48 threads, almost all cache lines have to move between cores which is expensive compared to the computation done. When two threads working on adjacent cache line have an interleaved execution, a cache line can bounce many times for just few iteration. On NUMA architecture, this is even more expensive since cache lines have to move between NUMA nodes. Doing this 1000 times is very expensive.

Additionally, AFAIK calling a python function from a parallel context is either not safe, or is subject to no speed-up because of the global interpreter lock (GIL). By not safe, I mean that the CPython interpreter data structure can be corrupted causing non-deterministic crashes. This is why the GIL exists. The GIL prevent all code to scale on multiple thread as long as it is not released. Releasing a GIL for a too short period also cause cache line bouncing which is detrimental for performance (more than using a sequential code).

Finally, the "C++" and Python have a much bigger overhead than the "direct" method because they are calling dynamically-defined functions that cannot be inlined or vectorized by the compiler. Python functions are especially slow because of the CPython interpreter. If you want to make a fair benchmark you need to compare the PyBind solution with one that use std::function (be careful about clever compiler optimizations though).

回复收藏 0 原文

~没有更多了~