OpenMP阵列初始化影响
我正在与OpenMP并行工作(工作零件)。如果我以前并行初始化数组,那么我的工作部分需要18 ms。如果我在没有OpenMP的情况下串行初始化数组,则我的工作部分需要58毫秒。是什么导致表现较差?
系统:
- Intel(R)Xeon(R)CPU E5-2697 V3(28核 / 56个线程,2个插座)
示例代码:
unsigned long sum = 0;
long* array = (long*)malloc(sizeof(long) * 160000000);
// Initialisation
#pragma omp parallel for num_threads(56) schedule(static)
for(unsigned int i = 0; i < array_length; i++){
array[i]= i%10;
}
// Time start
// Work
#pragma omp parallel for num_threads(56) shared(array, 160000000) reduction(+: sum)
for (unsigned long i = 0; i < array_length; i++)
{
if (array[i] < 4)
{
sum += array[i];
}
}
// Time End
I am working in parallel with OpenMP on an array (working part). If I initialize the array in parallel before, then my working part takes 18 ms. If I initialize the array serially without OpenMP, then my working part takes 58 ms. What causes the worse performance?
The system:
- Intel(R) Xeon(R) CPU E5-2697 v3 (28 cores / 56 threads, 2 Sockets)
Example code:
unsigned long sum = 0;
long* array = (long*)malloc(sizeof(long) * 160000000);
// Initialisation
#pragma omp parallel for num_threads(56) schedule(static)
for(unsigned int i = 0; i < array_length; i++){
array[i]= i%10;
}
// Time start
// Work
#pragma omp parallel for num_threads(56) shared(array, 160000000) reduction(+: sum)
for (unsigned long i = 0; i < array_length; i++)
{
if (array[i] < 4)
{
sum += array[i];
}
}
// Time End
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这里有两个方面:
NUMA系统中的NUMA分配
,内存页面可以是CPU或远程的本地。默认情况下,Linux将存储器分配在第一次触摸策略中,这意味着对内存页面的第一个写入访问确定在哪个节点上进行了物理分配。
如果您的Malloc足够大以至于从OS请求新的内存(而不是重复使用现有的堆内存),则首次触摸将在初始化中发生。因为您使用openMP使用静态调度,所以同一线程将使用初始化它的内存。因此,除非线程迁移到其他CPU(不太可能的CPU),否则内存将是局部的。
如果您不并行化初始化,则内存最终将局部到主线程,这对于不同插座上的线程会更糟。
请注意,Windows不使用第一键式策略(AFAIK)。因此,此行为无法便携。
缓存
缓存与上述相同的 也适用于缓存。初始化将使数组元素执行CPU的缓存。如果相同的CPU在第二阶段访问内存,则将进行缓存并准备使用。
There are two aspects at work here:
NUMA allocation
In a NUMA system, memory pages can be local to a CPU or remote. By default Linux allocates memory in a first-touch policy, meaning the first write access to a memory page determines on which node the page is physically allocated.
If your malloc is large enough that new memory is requested from the OS (instead of reusing existing heap memory), this first touch will happen in the initialization. Because you use static scheduling for OpenMP, the same thread will use the memory that initialized it. Therefore, unless the thread gets migrated to a different CPU, which is unlikely, the memory will be local.
If you don't parallelize the initialization, the memory will end up local to the main thread which will be worse for threads that are on different sockets.
Note that Windows doesn't use a first-touch policy (AFAIK). So this behavior is not portable.
Caching
The same as above also applies to caches. The initialization will put array elements into the cache of the CPU doing it. If the same CPU accesses the memory during the second phase, it will be cache-hot and ready to use.
首先, @homer512的解释是完全正确的。
现在,我注意到您将这个问题标记为“ C ++”,但是您正在使用
malloc
用于数组。在C ++中,这是不好的样式:您应该为简单容器使用std :: vector
,std :: array
对于小的容器>。然后您有一个大问题,因为
std :: vector
使用“值初始化”:整个数组自动填充零,并且您无法与OpenMP并行完成此操作。这是一个大技巧:
First of all, the explanation by @Homer512 is completely correct.
Now I note that you marked this question "C++", but you're using
malloc
for your array. That is bad style in C++: you should usestd::vector
for your simple containers,std::array
for small enough ones.And then you have a big problem because
std::vector
uses "value initialization": the whole array is automatically filled with zeroes, and there is no way you can let this be done in parallel with OpenMP.Here is a big trick: