oneapi/mkl/blas/cblas_dgemm(和cblas_daxpy)内部的控制线

发布于 2025-02-06 14:12:00 字数 8629 浏览 2 评论 0原文

我正在测量具有BLAS函数嵌套的多个多线程方案的时间绩效。更具体地说,以下电话:

cblas_dgemm(cblascolmajor,cblastrans,cblasnotrans, PHR,PHR,LDA,ALPHA,A,LDA,B,LDB,BETA,C,LDC); cblas_daxpy(n,alpha,x,incx,y,incy);

该问题包括计算元素的局部贡献,然后将它们组装成全局矩阵。因此,每个dgemm呼叫由少量操作组成。每当DGEMM或DAXPY调用并行化时,模拟就需要更长的时间才能执行,因此,这些功能应以序列化的方式执行。请注意,对于非常小的操作,Blas不会并行化DGEMM/DAXPY调用,因此,这里的矩阵足够大,可以通过Blas调用并平行于Blas调用,但不足以证明使用其他线程的使用是合理的。

使用多线程过程来计算每个元素贡献(称为这些blas函数)并将本地矩阵组装到全局矩阵中。评估了三个方案的最佳绩效,每个方案接下来都描述。

OMP方案

它遵循OMP方案。函数ComputingCalcStiffAndAssembling负责计算本地贡献(称为Blas)并将它们组装到全局矩阵中。使用颜色策略或atomic_add功能的使用,可确保操作保持线程安全。静态核心不符合此应用程序,因此使用动态块,但没有评估动态块的大小,也可能不是最佳的。

> omp_set_num_threads(nthread);
>     #pragma omp parallel for schedule(dynamic,1)
>     for (int64_t iel = 0; iel < nelem; iel++){
>         {
>             TPZCompEl *el = fMesh->Element(iel);
>             if (!el) continue;
>                         
>             ComputingCalcstiffAndAssembling(stiffness,rhs,el);
>         }
>     }

TBB方案

它遵循TBB调用。循环的主体与OMP方案中描述的循环相似。该库支持AtomicAdd tbb范式的功能,因此,通过着色或通过此类调用来确保线程安全性。

tbb::global_control global_limit(tbb::global_control::max_allowed_parallelism, nthread);
        tbb::parallel_for( tbb::blocked_range<int64_t>(0,nelem),
                          [&](tbb::blocked_range<int64_t> r){
        for (int64_t iel = r.begin(); iel < r.end(); iel++)
        {
            TPZCompEl *el = fMesh->Element(iel);
            if (!el) continue;
            
            ComputingCalcstiffAndAssembling(stiffness,rhs,el);

        }
        });

std ::线程方案

该方案基于生产者消费者算法。 线程Workwork使用多个线程计算本地元素的贡献,但为threadAssembly保留一个额外的线程,该线程将每个贡献汇编为全局矩阵。静音和信号量的使用确保操作仍然是线程安全。

std::vector<std::thread> allthreads;
int itr;
for (itr = 0; itr < numthreads; itr++) {
  allthreads.push_back(std::thread(ThreadData::ThreadWork, &threaddata));
}

ThreadData::ThreadAssembly(&threaddata);

for (itr = 0; itr < numthreads; itr++) {
  allthreads[itr].join();
}

控制MKL #Threads

为了防止BLAS函数使用多线程执行,测试了以下调用以控制MKL的线程数:

mkl_domain_set_num_threads(1, MKL_DOMAIN_BLAS);

此功能应该限制BLAS调用的线程数量。同样,

mkl_set_num_threads_local(1);

在平行方案之前被调用。该功能应该限制所有MKL执行的线程数,并且应该比mkl_domain_set_num_threads呼叫更强的预定能力,但这并不总是会发生。功能mkl_set_num_threads在测试中的mkl_set_num_threads_local中的预定较低,因此在这里不考虑它。

MKL_THREAD_MODEL = OMP

MKL具有对OMP和TBB模型的支持。到目前为止,对于mkl_thread_model = OMP,对于所有并行方案,可以在单个线程上执行BLAS函数。然后,提出了一个新的测试:控制BLA的线程数和并行方案。可以控制OMP方案的Blas #threads,但不能控制TBB方案。

是否可以控制嵌套在TBB多线程环内的CBLA调用的#threads?

嵌套在TBB多线程环内的CBLA

是否有一种方法可以控制 使用TBB线程配置MKL时执行。

是否有一种方法可以限制CBLA的嵌套调用。 当mkl_thread_model = tbb?

时多线程

如果是这样,要对CBLA函数进行更多控制#threads,

有没有一种方法来控制cblas嵌套呼叫#threads,当mkl_thread_model = tbb?

评估每个设置的处理器使用量

cpu处理器的使用情况和仿真时间均可测量每个设置,并显示在下表中,用于mkl_thread_model = op,,

< em> mkl_thread_model = OMP

组装范式mkl_control#threads%cpu持续时间(s)注释
oumplocal220035.7预期
tbb预期域220035.5TBB
local 2 1311local2131162.9预期的
220235.6预期
serial serial thecial serial serial serialserial serial serial serial serial serial serial serial serial110070.0预期的
串行110069.1预期的
std ::螺纹局部2241580.0预期的
std ::螺纹220839.1期望

这些模拟按预期进行,但两个模拟除外。 TBB和STD ::线程方案无法通过通过mkl_set_num_threads_local(1)设置线程数来将CBLA函数限制为串行执行。这一发现违反了英特尔的建议,即优先考虑mkl_domain_set_num_threads

https ://www.intel.com/content/www/us/en/develops/documentation/onemkl-linux-developer-guide/top/managing-performance-managing-performance--memory/improving-performance-with-performance-with-threading-with-threading/techniques-to--- -set-the-number-of-threads.html

为什么mkl_set_num_threads_local(1)不优先于mkl_domain_set_num_threads

此外,在TBB方案中,%CPU不到1600%,表明执行是在单个处理器上运行的(请参阅下一节中的技术数据),而对于STD ::线程方案,%CPU超过1600% ,表明多个处理器正在同时使用。

超线程选项已在BIOS中启用,但是我们无法确保执行过程中的路线。有没有办法检查在特定执行期间是否使用了高线程?

测量,结果显示在表中,

mkl_thread_model = tbb

paradigmmkl_control_controlmkl_thread_model =进行了相同
tbbcourteble22550101.5预期的
22865124.6预计tbb
local220239.4预期的
tbb220147.3不期望
串行序列局部110069.6预期
串行12526247.9未预期的
std :: threadlocal22995124.9std
std std std 224.9 ::线程22946124.1不可能预期,

对于大多数方案而言,无法将CBLA的执行限制为单个线程。即使对于TBB方案,我们设法限制了并行线程的数量,仿真时间也不是最佳的,并且时间从一个执行到下一个。 TBB似乎使用了适当的线程,但是在平行方案或CBLAS执行中使用的天气尚未清除。

技术信息

在Ubuntu 18.04.3 LTS操作系统上,实验运行到32个处理器上。每个处理器都有以下技术数据通过命令CAT/PROC/CPUINFO

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 85
model name      : Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz
stepping        : 4
microcode       : 0x2000064
cpu MHz         : 1000.431
cache size      : 22528 KB
physical id     : 0
siblings        : 32
core id         : 0
cpu cores       : 16
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single 

超线程功能处于活动状态,可以在选项“ thread(s)perion(s)via conterm 中可以看出。 LSCPU

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  2
Core(s) per socket:  16
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz
Stepping:            4
CPU MHz:             1000.709
CPU max MHz:         3700,0000
CPU min MHz:         1000,0000
BogoMIPS:            4200.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            22528K

上一节中详细介绍的%CPU是通过命令top -d -d 1

是否有比top更合适的工具检查%CPU,更具体地说,一个能够确定超线程是否在当然中或是否同时工作?

I'm measuring the time-performance of multiple multi-threaded schemes nested with BLAS functions. More specifically, the following calls:

cblas_dgemm(CblasColMajor, CblasTrans, CblasNoTrans,
phr, phr, LDA,alpha , A, LDA, B, LDB, beta, C, LDC);
cblas_daxpy(N,alpha,X,incX, Y,incY);

The problem consist of computing local contribution of elements and then assemble those into a global matrix. Hence, each dgemm call consist of a small number of operations. Whenever dgemm or daxpy calls are parallelized, the simulation takes longer to execute, and therefore, these functions should be executed in a serialized manner. Note that for very small operations, BLAS does not parallelize dgemm/daxpy calls, hence the matrices here are big enough to be parallelized by default by BLAS calls, but not big enough to justify the usage of additional threads.

A multi-threaded procedure is used in order to compute each element contribution (that calls those BLAS functions) and assemble the local matrices into a global one. Three schemes are evaluated for the best time-performance, each of which is described next.

OMP scheme

It follows the omp scheme. The function ComputingCalcStiffAndAssembling is responsible for computing the local contribution (where BLAS is called) and assembling those into a global matrix. The usage of either a color strategy or atomic_add functions, ensures the operation remains thread safe. A static schecule does not fit this application and therefore the usage of a dynamic one, but the size of the dynamic block was not evaluated and may not be optimal.

> omp_set_num_threads(nthread);
>     #pragma omp parallel for schedule(dynamic,1)
>     for (int64_t iel = 0; iel < nelem; iel++){
>         {
>             TPZCompEl *el = fMesh->Element(iel);
>             if (!el) continue;
>                         
>             ComputingCalcstiffAndAssembling(stiffness,rhs,el);
>         }
>     }

TBB scheme

It follows the TBB call. The body of the loop is similar to the one described in the OMP scheme. The library have support to an atomicAdd function for TBB paradigm, hence the thread safety is ensured either by coloring or via such calls.

tbb::global_control global_limit(tbb::global_control::max_allowed_parallelism, nthread);
        tbb::parallel_for( tbb::blocked_range<int64_t>(0,nelem),
                          [&](tbb::blocked_range<int64_t> r){
        for (int64_t iel = r.begin(); iel < r.end(); iel++)
        {
            TPZCompEl *el = fMesh->Element(iel);
            if (!el) continue;
            
            ComputingCalcstiffAndAssembling(stiffness,rhs,el);

        }
        });

std::thread scheme

This scheme is based on the producer-consumer algorithm. While Threadwork computes the contribution of local elements using multiple threads, an additional thread is reserved for ThreadAssembly that assembles each contribution into a global matrix. The usage of mutex' and semaphores ensures the operation remains thread-safe.

std::vector<std::thread> allthreads;
int itr;
for (itr = 0; itr < numthreads; itr++) {
  allthreads.push_back(std::thread(ThreadData::ThreadWork, &threaddata));
}

ThreadData::ThreadAssembly(&threaddata);

for (itr = 0; itr < numthreads; itr++) {
  allthreads[itr].join();
}

Controlling MKL #Threads

In order to prevent BLAS functions from executing with multiple-threads, the following calls were tested to control MKL number of threads:

mkl_domain_set_num_threads(1, MKL_DOMAIN_BLAS);

This function is supposed to limit the number of threads of BLAS calls. Also,

mkl_set_num_threads_local(1);

Is called before the parallel schemes. This function is supposed to limit the number of threads of all MKL execution, and it is supposed to have a stronger precendence over mkl_domain_set_num_threads call, but that does not always happens. The function mkl_set_num_threads had inferior precendence over mkl_set_num_threads_local on the tests, and it is not taken into account here.

MKL_THREAD_MODEL = OMP

MKL has support for either OMP and TBB models. So far, executing BLAS functions on a single thread was possible for MKL_THREAD_MODEL = OMP for all parallel schemes. Then, a new test was proposed: to control the number of threads of BLAS and the parallel scheme. It was possible to control blas #threads for OMP schemes but not for TBB schemes.

Is there a way of controlling the #threads for cblas calls nested inside a TBB multi-threaded loop?

MKL_THREAD_MODEL = TBB

So far, we could not restrict cblas nested calls to serial execution when MKL is configured with TBB threads.

Is there a way to restrict cblas nested calls from using
multiple-threads when MKL_THREAD_MODEL = TBB?

If so, towards having more control over cblas functions #threads,

Is there a way to control cblas nested calls #threads when MKL_THREAD_MODEL = TBB?

Evaluating processor usage for each set-up

The cpu processor usage and the simulation time is measured for each set-up and is displayed in the following table for MKL_THREAD_MODEL=OMP,

MKL_THREAD_MODEL=OMP

Assemble paradigmMKL_control#Threads%CPUDuration(s)Comments
OMPLocal220035.7Expected
OMPDomain220035.5Expected
TBBLocal2131162.9Not expected
TBBDomain220235.6Expected
SerialLocal110070.0Expected
SerialDomain110069.1Expected
std::threadLocal2241580.0Not Expected
std::threadDomain220839.1Expected

these simulations ran as expected, with the exception of two. The TBB and std::thread schemes are unable to restrain cblas functions to serial execution by setting the number of threads by mkl_set_num_threads_local(1). This finding goes against Intel's suggestion to give preference to this call over mkl_domain_set_num_threads stated in

https://www.intel.com/content/www/us/en/develop/documentation/onemkl-linux-developer-guide/top/managing-performance-and-memory/improving-performance-with-threading/techniques-to-set-the-number-of-threads.html.

Why mkl_set_num_threads_local(1) does not take precedence over mkl_domain_set_num_threads?

Moreover, in the TBB scheme, the %CPU is under 1600%, indicating that the execution is run on a single processor (see technical data in the next section), while for the std::thread scheme, %CPU is over 1600%, indicating that multiple processors are working concomitantly.

Hyperthreading option is enabled in the BIOS, but we can't make sure it is on course during execution. Is there a way to check if hyperthreading is being employed during a particular execution?

The same measurements are made for MKL_THREAD_MODEL=TBB, and the results are shown in the table,

MKL_THREAD_MODEL=TBB

Assemble paradigmMKL_control#Threads%CPUDuration(s)Comments
OMPLocal22550101.5Not expected
OMPDomain22865124.6Not expected
TBBLocal220239.4Not expected
TBBDomain220147.3Not expected
SerialLocal110069.6Expected
SerialDomain12526247.9Not expected
std::threadLocal22995124.9Not Expected
std::threadDomain22946124.1Not Expected

It was not possible to limit CBLAS execution to a single thread for most of the schemes. Even for the TBB scheme, where we managed to limit the number of parallel threads, the simulation time is not optimal and the time is changing from one execution to the next. It seems that TBB is employing the right number of threads, but weather they are employed on the parallel scheme or on CBLAS execution is not cleared.

Technical information

The experiments are ran onto a 32 processors machine upon a Ubuntu 18.04.3 LTS operational system. Each processor has the following technical data obtained via command cat /proc/cpuinfo:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 85
model name      : Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz
stepping        : 4
microcode       : 0x2000064
cpu MHz         : 1000.431
cache size      : 22528 KB
physical id     : 0
siblings        : 32
core id         : 0
cpu cores       : 16
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single 

The hyper-threading function is active, as can be seen in option "Thread(s) per core" via command lscpu,

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  2
Core(s) per socket:  16
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz
Stepping:            4
CPU MHz:             1000.709
CPU max MHz:         3700,0000
CPU min MHz:         1000,0000
BogoMIPS:            4200.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            22528K

The %CPU detailed in the previous section is the maximum %CPU observed via command top -d 1,

Is there a more appropriate tool to check %CPU than top, more specifically, one capable of telling if hyperthreading is in course or if multiple processors are working at the same time?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文