OpenMP阵列初始化影响

发布于 2025-01-24 20:34:33 字数 691 浏览 2 评论 0原文

我正在与OpenMP并行工作（工作零件）。如果我以前并行初始化数组，那么我的工作部分需要18 ms。如果我在没有OpenMP的情况下串行初始化数组，则我的工作部分需要58毫秒。是什么导致表现较差？

系统：

Intel（R）Xeon（R）CPU E5-2697 V3（28核 / 56个线程，2个插座）

示例代码：

unsigned long sum = 0;
long* array = (long*)malloc(sizeof(long) * 160000000);

// Initialisation
#pragma omp parallel for num_threads(56) schedule(static)
for(unsigned int i = 0; i < array_length; i++){
    array[i]= i%10;
}


// Time start

// Work
#pragma omp parallel for num_threads(56) shared(array, 160000000) reduction(+: sum)
for (unsigned long i = 0; i < array_length; i++)
{
    if (array[i] < 4)
    {
        sum += array[i];
    }
}

// Time End

原文

I am working in parallel with OpenMP on an array (working part). If I initialize the array in parallel before, then my working part takes 18 ms. If I initialize the array serially without OpenMP, then my working part takes 58 ms. What causes the worse performance?

The system:

Intel(R) Xeon(R) CPU E5-2697 v3 (28 cores / 56 threads, 2 Sockets)

Example code:

unsigned long sum = 0;
long* array = (long*)malloc(sizeof(long) * 160000000);

// Initialisation
#pragma omp parallel for num_threads(56) schedule(static)
for(unsigned int i = 0; i < array_length; i++){
    array[i]= i%10;
}


// Time start

// Work
#pragma omp parallel for num_threads(56) shared(array, 160000000) reduction(+: sum)
for (unsigned long i = 0; i < array_length; i++)
{
    if (array[i] < 4)
    {
        sum += array[i];
    }
}

// Time End

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

倒数 2025-01-31 20:34:33

这里有两个方面：

NUMA系统中的NUMA分配

，内存页面可以是CPU或远程的本地。默认情况下，Linux将存储器分配在第一次触摸策略中，这意味着对内存页面的第一个写入访问确定在哪个节点上进行了物理分配。

如果您的Malloc足够大以至于从OS请求新的内存（而不是重复使用现有的堆内存），则首次触摸将在初始化中发生。因为您使用openMP使用静态调度，所以同一线程将使用初始化它的内存。因此，除非线程迁移到其他CPU（不太可能的CPU），否则内存将是局部的。

如果您不并行化初始化，则内存最终将局部到主线程，这对于不同插座上的线程会更糟。

请注意，Windows不使用第一键式策略（AFAIK）。因此，此行为无法便携。

缓存

缓存与上述相同的也适用于缓存。初始化将使数组元素执行CPU的缓存。如果相同的CPU在第二阶段访问内存，则将进行缓存并准备使用。

回复收藏 0 原文

怎言笑 2025-01-31 20:34:33

首先， @homer512的解释是完全正确的。

现在，我注意到您将这个问题标记为“ C ++”，但是您正在使用malloc用于数组。在C ++中，这是不好的样式：您应该为简单容器使用std :: vector，std :: array对于小的容器>。

然后您有一个大问题，因为std :: vector使用“值初始化”：整个数组自动填充零，并且您无法与OpenMP并行完成此操作。

这是一个大技巧：

template<typename T> 
struct uninitialized {
  uninitialized() {};
  T val;
  constexpr operator T() const {return val;};
  double operator=( const T&& v ) { val = v; return val; };
};

vector<uninitialized<double>> x(N),y(N);

#pragma omp parallel for 
for (int i=0; i<N; i++)
  y[i] = x[i] = 0.; 
x[0] = 0; x[N-1] = 1.;

First of all, the explanation by @Homer512 is completely correct.

Now I note that you marked this question "C++", but you're using malloc for your array. That is bad style in C++: you should use std::vector for your simple containers, std::array for small enough ones.

And then you have a big problem because std::vector uses "value initialization": the whole array is automatically filled with zeroes, and there is no way you can let this be done in parallel with OpenMP.

Here is a big trick:

template<typename T> 
struct uninitialized {
  uninitialized() {};
  T val;
  constexpr operator T() const {return val;};
  double operator=( const T&& v ) { val = v; return val; };
};

vector<uninitialized<double>> x(N),y(N);

#pragma omp parallel for 
for (int i=0; i<N; i++)
  y[i] = x[i] = 0.; 
x[0] = 0; x[N-1] = 1.;

回复收藏 0 原文

~没有更多了~