OpenMP阵列初始化影响

发布于 2025-01-24 20:34:33 字数 691 浏览 2 评论 0原文

我正在与OpenMP并行工作(工作零件)。如果我以前并行初始化数组,那么我的工作部分需要18 ms。如果我在没有OpenMP的情况下串行初始化数组,则我的工作部分需要58毫秒。是什么导致表现较差?

系统:

  • Intel(R)Xeon(R)CPU E5-2697 V3(28核 / 56个线程,2个插座)

示例代码:

unsigned long sum = 0;
long* array = (long*)malloc(sizeof(long) * 160000000);

// Initialisation
#pragma omp parallel for num_threads(56) schedule(static)
for(unsigned int i = 0; i < array_length; i++){
    array[i]= i%10;
}


// Time start

// Work
#pragma omp parallel for num_threads(56) shared(array, 160000000) reduction(+: sum)
for (unsigned long i = 0; i < array_length; i++)
{
    if (array[i] < 4)
    {
        sum += array[i];
    }
}

// Time End

I am working in parallel with OpenMP on an array (working part). If I initialize the array in parallel before, then my working part takes 18 ms. If I initialize the array serially without OpenMP, then my working part takes 58 ms. What causes the worse performance?

The system:

  • Intel(R) Xeon(R) CPU E5-2697 v3 (28 cores / 56 threads, 2 Sockets)

Example code:

unsigned long sum = 0;
long* array = (long*)malloc(sizeof(long) * 160000000);

// Initialisation
#pragma omp parallel for num_threads(56) schedule(static)
for(unsigned int i = 0; i < array_length; i++){
    array[i]= i%10;
}


// Time start

// Work
#pragma omp parallel for num_threads(56) shared(array, 160000000) reduction(+: sum)
for (unsigned long i = 0; i < array_length; i++)
{
    if (array[i] < 4)
    {
        sum += array[i];
    }
}

// Time End

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

倒数 2025-01-31 20:34:33

这里有两个方面:

NUMA系统中的NUMA分配

,内存页面可以是CPU或远程的本地。默认情况下,Linux将存储器分配在第一次触摸策略中,这意味着对内存页面的第一个写入访问确定在哪个节点上进行了物理分配。

如果您的Malloc足够大以至于从OS请求新的内存(而不是重复使用现有的堆内存),则首次触摸将在初始化中发生。因为您使用openMP使用静态调度,所以同一线程将使用初始化它的内存。因此,除非线程迁移到其他CPU(不太可能的CPU),否则内存将是局部的。

如果您不并行化初始化,则内存最终将局部到主线程,这对于不同插座上的线程会更糟。

请注意,Windows不使用第一键式策略(AFAIK)。因此,此行为无法便携。

缓存

缓存与上述相同的 也适用于缓存。初始化将使数组元素执行CPU的缓存。如果相同的CPU在第二阶段访问内存,则将进行缓存并准备使用。

There are two aspects at work here:

NUMA allocation

In a NUMA system, memory pages can be local to a CPU or remote. By default Linux allocates memory in a first-touch policy, meaning the first write access to a memory page determines on which node the page is physically allocated.

If your malloc is large enough that new memory is requested from the OS (instead of reusing existing heap memory), this first touch will happen in the initialization. Because you use static scheduling for OpenMP, the same thread will use the memory that initialized it. Therefore, unless the thread gets migrated to a different CPU, which is unlikely, the memory will be local.

If you don't parallelize the initialization, the memory will end up local to the main thread which will be worse for threads that are on different sockets.

Note that Windows doesn't use a first-touch policy (AFAIK). So this behavior is not portable.

Caching

The same as above also applies to caches. The initialization will put array elements into the cache of the CPU doing it. If the same CPU accesses the memory during the second phase, it will be cache-hot and ready to use.

怎言笑 2025-01-31 20:34:33

首先, @homer512的解释是完全正确的。

现在,我注意到您将这个问题标记为“ C ++”,但是您正在使用malloc用于数组。在C ++中,这是不好的样式:您应该为简单容器使用std :: vectorstd :: array对于小的容器>。

然后您有一个大问题,因为std :: vector使用“值初始化”:整个数组自动填充零,并且您无法与OpenMP并行完成此操作。

这是一个大技巧:

template<typename T> 
struct uninitialized {
  uninitialized() {};
  T val;
  constexpr operator T() const {return val;};
  double operator=( const T&& v ) { val = v; return val; };
};

vector<uninitialized<double>> x(N),y(N);

#pragma omp parallel for 
for (int i=0; i<N; i++)
  y[i] = x[i] = 0.; 
x[0] = 0; x[N-1] = 1.;

First of all, the explanation by @Homer512 is completely correct.

Now I note that you marked this question "C++", but you're using malloc for your array. That is bad style in C++: you should use std::vector for your simple containers, std::array for small enough ones.

And then you have a big problem because std::vector uses "value initialization": the whole array is automatically filled with zeroes, and there is no way you can let this be done in parallel with OpenMP.

Here is a big trick:

template<typename T> 
struct uninitialized {
  uninitialized() {};
  T val;
  constexpr operator T() const {return val;};
  double operator=( const T&& v ) { val = v; return val; };
};

vector<uninitialized<double>> x(N),y(N);

#pragma omp parallel for 
for (int i=0; i<N; i++)
  y[i] = x[i] = 0.; 
x[0] = 0; x[N-1] = 1.;
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文