oneapi/mkl/blas/cblas_dgemm（和cblas_daxpy）内部的控制线

发布于 2025-02-06 14:12:00 字数 8629 浏览 2 评论 0原文

我正在测量具有BLAS函数嵌套的多个多线程方案的时间绩效。更具体地说，以下电话：

cblas_dgemm（cblascolmajor，cblastrans，cblasnotrans， PHR，PHR，LDA，ALPHA，A，LDA，B，LDB，BETA，C，LDC）; cblas_daxpy（n，alpha，x，incx，y，incy）;

该问题包括计算元素的局部贡献，然后将它们组装成全局矩阵。因此，每个dgemm呼叫由少量操作组成。每当DGEMM或DAXPY调用并行化时，模拟就需要更长的时间才能执行，因此，这些功能应以序列化的方式执行。请注意，对于非常小的操作，Blas不会并行化DGEMM/DAXPY调用，因此，这里的矩阵足够大，可以通过Blas调用并平行于Blas调用，但不足以证明使用其他线程的使用是合理的。

使用多线程过程来计算每个元素贡献（称为这些blas函数）并将本地矩阵组装到全局矩阵中。评估了三个方案的最佳绩效，每个方案接下来都描述。

OMP方案

它遵循OMP方案。函数ComputingCalcStiffAndAssembling负责计算本地贡献（称为Blas）并将它们组装到全局矩阵中。使用颜色策略或atomic_add功能的使用，可确保操作保持线程安全。静态核心不符合此应用程序，因此使用动态块，但没有评估动态块的大小，也可能不是最佳的。

> omp_set_num_threads(nthread);
>     #pragma omp parallel for schedule(dynamic,1)
>     for (int64_t iel = 0; iel < nelem; iel++){
>         {
>             TPZCompEl *el = fMesh->Element(iel);
>             if (!el) continue;
>                         
>             ComputingCalcstiffAndAssembling(stiffness,rhs,el);
>         }
>     }

TBB方案

它遵循TBB调用。循环的主体与OMP方案中描述的循环相似。该库支持AtomicAdd tbb范式的功能，因此，通过着色或通过此类调用来确保线程安全性。

tbb::global_control global_limit(tbb::global_control::max_allowed_parallelism, nthread);
        tbb::parallel_for( tbb::blocked_range<int64_t>(0,nelem),
                          [&](tbb::blocked_range<int64_t> r){
        for (int64_t iel = r.begin(); iel < r.end(); iel++)
        {
            TPZCompEl *el = fMesh->Element(iel);
            if (!el) continue;
            
            ComputingCalcstiffAndAssembling(stiffness,rhs,el);

        }
        });

std ::线程方案

该方案基于生产者消费者算法。 线程Workwork使用多个线程计算本地元素的贡献，但为threadAssembly保留一个额外的线程，该线程将每个贡献汇编为全局矩阵。静音和信号量的使用确保操作仍然是线程安全。

std::vector<std::thread> allthreads;
int itr;
for (itr = 0; itr < numthreads; itr++) {
  allthreads.push_back(std::thread(ThreadData::ThreadWork, &threaddata));
}

ThreadData::ThreadAssembly(&threaddata);

for (itr = 0; itr < numthreads; itr++) {
  allthreads[itr].join();
}

控制MKL #Threads

为了防止BLAS函数使用多线程执行，测试了以下调用以控制MKL的线程数：

mkl_domain_set_num_threads(1, MKL_DOMAIN_BLAS);

此功能应该限制BLAS调用的线程数量。同样，

mkl_set_num_threads_local(1);

在平行方案之前被调用。该功能应该限制所有MKL执行的线程数，并且应该比mkl_domain_set_num_threads呼叫更强的预定能力，但这并不总是会发生。功能mkl_set_num_threads在测试中的mkl_set_num_threads_local在中的预定较低，因此在这里不考虑它。

MKL_THREAD_MODEL = OMP

MKL具有对OMP和TBB模型的支持。到目前为止，对于mkl_thread_model = OMP，对于所有并行方案，可以在单个线程上执行BLAS函数。然后，提出了一个新的测试：控制BLA的线程数和并行方案。可以控制OMP方案的Blas #threads，但不能控制TBB方案。

是否可以控制嵌套在TBB多线程环内的CBLA调用的#threads？

嵌套在TBB多线程环内的CBLA

是否有一种方法可以控制使用TBB线程配置MKL时执行。

是否有一种方法可以限制CBLA的嵌套调用。当mkl_thread_model = tbb？
时多线程

如果是这样，要对CBLA函数进行更多控制#threads，

有没有一种方法来控制cblas嵌套呼叫#threads，当mkl_thread_model = tbb？
时

评估每个设置的处理器使用量

cpu处理器的使用情况和仿真时间均可测量每个设置，并显示在下表中，用于mkl_thread_model = op，，

< em> mkl_thread_model = OMP

组装范式	mkl_control	#threads	％cpu	持续时间（s）	注释
oump	local	2	200	35.7	预期
tbb	预期域	2	200	35.5	TBB
local 2 1311	local	2	1311	62.9	预期的
域	2	202	35.6	预期	的
serial serial thecial serial serial serial	serial serial serial serial serial serial serial serial serial	1	100	70.0	预期的
串行	域	1	100	69.1	预期的
std ::螺纹	局部	2	2415	80.0	预期的
std ::螺纹	域	2	208	39.1	期望

这些模拟按预期进行，但两个模拟除外。 TBB和STD ::线程方案无法通过通过mkl_set_num_threads_local（1）设置线程数来将CBLA函数限制为串行执行。这一发现违反了英特尔的建议，即优先考虑mkl_domain_set_num_threads

https ：//www.intel.com/content/www/us/en/develops/documentation/onemkl-linux-developer-guide/top/managing-performance-managing-performance--memory/improving-performance-with-performance-with-threading-with-threading/techniques-to--- -set-the-number-of-threads.html 。

为什么mkl_set_num_threads_local（1）不优先于mkl_domain_set_num_threads？

此外，在TBB方案中，％CPU不到1600％，表明执行是在单个处理器上运行的（请参阅下一节中的技术数据），而对于STD ::线程方案，％CPU超过1600％，表明多个处理器正在同时使用。

超线程选项已在BIOS中启用，但是我们无法确保执行过程中的路线。有没有办法检查在特定执行期间是否使用了高线程？

测量，结果显示在表中，

mkl_thread_model = tbb

paradigm	mkl_control_control	＃	对`mkl_thread_model =`	进行了相同	的
tbb	courteble	2	2550	101.5	预期的
域	2	2865	124.6	预计	tbb
local	2	202	39.4	未	预期的
tbb	域	2	201	47.3	不期望
串行序列	局部	1	100	69.6	预期
串行	域	1	2526	247.9	未预期的
std :: thread	local	2	2995	124.9	std
std std std 224.9 ::线程	域	2	2946	124.1	不可能预期，

对于大多数方案而言，无法将CBLA的执行限制为单个线程。即使对于TBB方案，我们设法限制了并行线程的数量，仿真时间也不是最佳的，并且时间从一个执行到下一个。 TBB似乎使用了适当的线程，但是在平行方案或CBLAS执行中使用的天气尚未清除。

技术信息

在Ubuntu 18.04.3 LTS操作系统上，实验运行到32个处理器上。每个处理器都有以下技术数据通过命令CAT/PROC/CPUINFO：

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 85
model name      : Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz
stepping        : 4
microcode       : 0x2000064
cpu MHz         : 1000.431
cache size      : 22528 KB
physical id     : 0
siblings        : 32
core id         : 0
cpu cores       : 16
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single

超线程功能处于活动状态，可以在选项“ thread（s）perion（s）via conterm 中可以看出。 LSCPU，

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  2
Core(s) per socket:  16
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz
Stepping:            4
CPU MHz:             1000.709
CPU max MHz:         3700,0000
CPU min MHz:         1000,0000
BogoMIPS:            4200.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            22528K

上一节中详细介绍的％CPU是通过命令top -d -d 1，

是否有比top更合适的工具检查％CPU，更具体地说，一个能够确定超线程是否在当然中或是否同时工作？

原文

I'm measuring the time-performance of multiple multi-threaded schemes nested with BLAS functions. More specifically, the following calls:

cblas_dgemm(CblasColMajor, CblasTrans, CblasNoTrans,
phr, phr, LDA,alpha , A, LDA, B, LDB, beta, C, LDC);
cblas_daxpy(N,alpha,X,incX, Y,incY);

The problem consist of computing local contribution of elements and then assemble those into a global matrix. Hence, each dgemm call consist of a small number of operations. Whenever dgemm or daxpy calls are parallelized, the simulation takes longer to execute, and therefore, these functions should be executed in a serialized manner. Note that for very small operations, BLAS does not parallelize dgemm/daxpy calls, hence the matrices here are big enough to be parallelized by default by BLAS calls, but not big enough to justify the usage of additional threads.

A multi-threaded procedure is used in order to compute each element contribution (that calls those BLAS functions) and assemble the local matrices into a global one. Three schemes are evaluated for the best time-performance, each of which is described next.

OMP scheme

It follows the omp scheme. The function ComputingCalcStiffAndAssembling is responsible for computing the local contribution (where BLAS is called) and assembling those into a global matrix. The usage of either a color strategy or atomic_add functions, ensures the operation remains thread safe. A static schecule does not fit this application and therefore the usage of a dynamic one, but the size of the dynamic block was not evaluated and may not be optimal.

> omp_set_num_threads(nthread);
>     #pragma omp parallel for schedule(dynamic,1)
>     for (int64_t iel = 0; iel < nelem; iel++){
>         {
>             TPZCompEl *el = fMesh->Element(iel);
>             if (!el) continue;
>                         
>             ComputingCalcstiffAndAssembling(stiffness,rhs,el);
>         }
>     }

TBB scheme

It follows the TBB call. The body of the loop is similar to the one described in the OMP scheme. The library have support to an atomicAdd function for TBB paradigm, hence the thread safety is ensured either by coloring or via such calls.

tbb::global_control global_limit(tbb::global_control::max_allowed_parallelism, nthread);
        tbb::parallel_for( tbb::blocked_range<int64_t>(0,nelem),
                          [&](tbb::blocked_range<int64_t> r){
        for (int64_t iel = r.begin(); iel < r.end(); iel++)
        {
            TPZCompEl *el = fMesh->Element(iel);
            if (!el) continue;
            
            ComputingCalcstiffAndAssembling(stiffness,rhs,el);

        }
        });

std::thread scheme

This scheme is based on the producer-consumer algorithm. While Threadwork computes the contribution of local elements using multiple threads, an additional thread is reserved for ThreadAssembly that assembles each contribution into a global matrix. The usage of mutex' and semaphores ensures the operation remains thread-safe.

std::vector<std::thread> allthreads;
int itr;
for (itr = 0; itr < numthreads; itr++) {
  allthreads.push_back(std::thread(ThreadData::ThreadWork, &threaddata));
}

ThreadData::ThreadAssembly(&threaddata);

for (itr = 0; itr < numthreads; itr++) {
  allthreads[itr].join();
}

Controlling MKL #Threads

In order to prevent BLAS functions from executing with multiple-threads, the following calls were tested to control MKL number of threads:

mkl_domain_set_num_threads(1, MKL_DOMAIN_BLAS);

This function is supposed to limit the number of threads of BLAS calls. Also,

mkl_set_num_threads_local(1);

Is called before the parallel schemes. This function is supposed to limit the number of threads of all MKL execution, and it is supposed to have a stronger precendence over mkl_domain_set_num_threads call, but that does not always happens. The function mkl_set_num_threads had inferior precendence over mkl_set_num_threads_local on the tests, and it is not taken into account here.

MKL_THREAD_MODEL = OMP

MKL has support for either OMP and TBB models. So far, executing BLAS functions on a single thread was possible for MKL_THREAD_MODEL = OMP for all parallel schemes. Then, a new test was proposed: to control the number of threads of BLAS and the parallel scheme. It was possible to control blas #threads for OMP schemes but not for TBB schemes.

Is there a way of controlling the #threads for cblas calls nested inside a TBB multi-threaded loop?

MKL_THREAD_MODEL = TBB

So far, we could not restrict cblas nested calls to serial execution when MKL is configured with TBB threads.

Is there a way to restrict cblas nested calls from using
multiple-threads when MKL_THREAD_MODEL = TBB?

If so, towards having more control over cblas functions #threads,

Is there a way to control cblas nested calls #threads when MKL_THREAD_MODEL = TBB?

Evaluating processor usage for each set-up

The cpu processor usage and the simulation time is measured for each set-up and is displayed in the following table for MKL_THREAD_MODEL=OMP,

MKL_THREAD_MODEL=OMP

Assemble paradigm	MKL_control	#Threads	%CPU	Duration(s)	Comments
OMP	Local	2	200	35.7	Expected
OMP	Domain	2	200	35.5	Expected
TBB	Local	2	1311	62.9	Not expected
TBB	Domain	2	202	35.6	Expected
Serial	Local	1	100	70.0	Expected
Serial	Domain	1	100	69.1	Expected
std::thread	Local	2	2415	80.0	Not Expected
std::thread	Domain	2	208	39.1	Expected

these simulations ran as expected, with the exception of two. The TBB and std::thread schemes are unable to restrain cblas functions to serial execution by setting the number of threads by mkl_set_num_threads_local(1). This finding goes against Intel's suggestion to give preference to this call over mkl_domain_set_num_threads stated in

https://www.intel.com/content/www/us/en/develop/documentation/onemkl-linux-developer-guide/top/managing-performance-and-memory/improving-performance-with-threading/techniques-to-set-the-number-of-threads.html.

Why mkl_set_num_threads_local(1) does not take precedence over mkl_domain_set_num_threads?

Moreover, in the TBB scheme, the %CPU is under 1600%, indicating that the execution is run on a single processor (see technical data in the next section), while for the std::thread scheme, %CPU is over 1600%, indicating that multiple processors are working concomitantly.

Hyperthreading option is enabled in the BIOS, but we can't make sure it is on course during execution. Is there a way to check if hyperthreading is being employed during a particular execution?

The same measurements are made for MKL_THREAD_MODEL=TBB, and the results are shown in the table,

MKL_THREAD_MODEL=TBB

Assemble paradigm	MKL_control	#Threads	%CPU	Duration(s)	Comments
OMP	Local	2	2550	101.5	Not expected
OMP	Domain	2	2865	124.6	Not expected
TBB	Local	2	202	39.4	Not expected
TBB	Domain	2	201	47.3	Not expected
Serial	Local	1	100	69.6	Expected
Serial	Domain	1	2526	247.9	Not expected
std::thread	Local	2	2995	124.9	Not Expected
std::thread	Domain	2	2946	124.1	Not Expected

It was not possible to limit CBLAS execution to a single thread for most of the schemes. Even for the TBB scheme, where we managed to limit the number of parallel threads, the simulation time is not optimal and the time is changing from one execution to the next. It seems that TBB is employing the right number of threads, but weather they are employed on the parallel scheme or on CBLAS execution is not cleared.

Technical information

The experiments are ran onto a 32 processors machine upon a Ubuntu 18.04.3 LTS operational system. Each processor has the following technical data obtained via command cat /proc/cpuinfo:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 85
model name      : Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz
stepping        : 4
microcode       : 0x2000064
cpu MHz         : 1000.431
cache size      : 22528 KB
physical id     : 0
siblings        : 32
core id         : 0
cpu cores       : 16
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single

The hyper-threading function is active, as can be seen in option "Thread(s) per core" via command lscpu,

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  2
Core(s) per socket:  16
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz
Stepping:            4
CPU MHz:             1000.709
CPU max MHz:         3700,0000
CPU min MHz:         1000,0000
BogoMIPS:            4200.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            22528K

The %CPU detailed in the previous section is the maximum %CPU observed via command top -d 1,

Is there a more appropriate tool to check %CPU than top, more specifically, one capable of telling if hyperthreading is in course or if multiple processors are working at the same time?

分享到QQ

分享到微博