循环时有效地平行沉重任务吗？

发布于 2025-01-20 13:16:19 字数 3374 浏览 2 评论 0原文

开发了粒子模拟代码的串行版本，现在我想仅在时间步进期间最繁重的任务上加快一点速度。基本上在一个时间步执行 3 个不同的任务（A、B、C）：

A: 1) update particles contained in a sub-domain (cell)
   2) then update particle's neighbors (particles)

B: 1) update potential contact pairs between particles (close enough)
   2) loop surface points (10-20k per particle) of each contact pair: find contact point

C: Integration: update each particle's position, velocity, etc.

最重的任务是 B.2：通常高达 50~70% CPU 时间。

所以我的第一个想法是并行化 B.2 并让其余的进行串行计算。

...
int N_every_neighbors = 1000;
int N_every_nodes = 100;

while (time())
{
  // update neighbors
  if (curr_steps % N_every_neighbors == 0)
  {
    A.update_cell_sub_rigids();  // light task
    A.update_neighbor_list();    // light task

    B.update_contact_pairs();    // moderate task
    B.update_node_neighbors(check_all);    // heaviest task!
  }

  if (curr_steps % N_every_nodes == 0) 
  {
    B.update_node_neighbors(not_check_all); // second heaviest
  }

  // update particle position, contact forces
  C.integration.initial_integrate();     // light task
  C.integration.update_contact_forces(); // moderate task
  C.integration.final_integrate();       // light task
}
...

问题是任务 A、B、C 必须顺序执行才能得到正确的结果，即它们不是独立的任务。

<代码>A.1 ---> A.2 ===> B.1---> B.2 ===> C.1 ---> C.2 ---> C.3

所以我首先要做的是让繁重的任务 B.2 B.update_node_neighbors() 并行运行，因为这个函数中有嵌套循环。

由于我对 OpenMP 还很陌生，所以只是做了一些简单的优化。

int N_threads = 8;
omp_set_num_threads(N);

#pragma omp parallel 
#pragma omp single

while (time())
{
  // do tasks A ---> B ---> C;
}


void B::update_node_neighbors (bool check_all)
{
  int All_contact_pairs = this->contact_pairs.size();

  #pragma omp for
  for (int i=0; i<All_contact_pairs; i++)
  {
    auto& particle_i_contacts = this->contact_pairs[i];
    int N_contacts_i = particle_i_contacts.size();
    
    // loop over all contacts for particel i
    for (int j=0; j<N_contacts_i; j++)
    {
      auto& pair_ij = particle_i_contacts[j];
      
      // really heavy computation here
      ...
    }
  }
}

通过这样做，我发现性能没有显着提高。请问有并行计算经验的人，有没有更好的方法让函数B.2在每个时间步并行运行，而让其余任务以串行方式运行。

更新1：

仅对繁重的任务B.2做了一些简单的测试

while (time())
{
  if (condition_0)
  {
    A.1; 
    A.2

    B.1;
    B.2(true); // heavy task!
  }

  if (condition_1) 
  {
    B.2(false);  // second heaviest
  }

  C.1;
  C.2;
  C.3;
}

B.2的实际内容如下：

void B::update_node_neighbors(bool check_all)
{
  ...
  
  int N_threads = 6;
  omp_set_num_threads(N_threads);
  
  #pragma omp parallel for schedule(static)
  for (int i=0; i<N_contacts; i++)
  {
    ...

    // particle-particle contacts
    for (int j=0; j<N_contacts_pp; j++)
    {
      for(int pt_id ...)
      {
        // check all particle_i's surface points to particle_j
        // do_the_actual_work
      }
    }
    
    // particle-wall contacts
    for (int k=0; k<N_contacts_pw; k++)
    {
      for(int pt_id ...)
      {
        // check all particle_i's surface points to wall_k
        // do_the_actual_work
      }
    }
}

尝试N_threads = 1 ,2,4,6,8,10,12;对于恒定的时间步长，CPU 时间或多或少是相同的。为什么 B.2 中最外循环的 OpenMP 并行不起作用？无法弄清楚:(

原文

Developed a serial version of particle simulation code, now I want to speed up a bit Only on the heaviest task during time-stepping. Basically 3 different tasks (A, B, C) performed during one time-step:

A: 1) update particles contained in a sub-domain (cell)
   2) then update particle's neighbors (particles)

B: 1) update potential contact pairs between particles (close enough)
   2) loop surface points (10-20k per particle) of each contact pair: find contact point

C: Integration: update each particle's position, velocity, etc.

The heaviest task is B.2: normally up to 50~70% CPU time.

So my first idea is to parallelize B.2 and let the rest do serial computation.

...
int N_every_neighbors = 1000;
int N_every_nodes = 100;

while (time())
{
  // update neighbors
  if (curr_steps % N_every_neighbors == 0)
  {
    A.update_cell_sub_rigids();  // light task
    A.update_neighbor_list();    // light task

    B.update_contact_pairs();    // moderate task
    B.update_node_neighbors(check_all);    // heaviest task!
  }

  if (curr_steps % N_every_nodes == 0) 
  {
    B.update_node_neighbors(not_check_all); // second heaviest
  }

  // update particle position, contact forces
  C.integration.initial_integrate();     // light task
  C.integration.update_contact_forces(); // moderate task
  C.integration.final_integrate();       // light task
}
...

The problem is that tasks A, B, C have to be executed sequentially for correct result, i.e. they are NOT independent tasks.

A.1 ---> A.2 ===> B.1 ---> B.2 ===> C.1 ---> C.2 ---> C.3

So what I want to do first is to make the heavy task B.2 B.update_node_neighbors() run in parallel, as there are nested loops in this function.

As I am quite new to OpenMP, so just did some simple optimization.

int N_threads = 8;
omp_set_num_threads(N);

#pragma omp parallel 
#pragma omp single

while (time())
{
  // do tasks A ---> B ---> C;
}


void B::update_node_neighbors (bool check_all)
{
  int All_contact_pairs = this->contact_pairs.size();

  #pragma omp for
  for (int i=0; i<All_contact_pairs; i++)
  {
    auto& particle_i_contacts = this->contact_pairs[i];
    int N_contacts_i = particle_i_contacts.size();
    
    // loop over all contacts for particel i
    for (int j=0; j<N_contacts_i; j++)
    {
      auto& pair_ij = particle_i_contacts[j];
      
      // really heavy computation here
      ...
    }
  }
}

By doing this, I found no significant performance increase. I would like to ask those who are experienced on parallel computation, is there any better way to make the function B.2 run in parallel at each time-step, and let the rest tasks run in serial fashion.

Update 1:

Did some simple test only on the heavy task B.2

while (time())
{
  if (condition_0)
  {
    A.1; 
    A.2

    B.1;
    B.2(true); // heavy task!
  }

  if (condition_1) 
  {
    B.2(false);  // second heaviest
  }

  C.1;
  C.2;
  C.3;
}

The actual content of B.2 is like:

void B::update_node_neighbors(bool check_all)
{
  ...
  
  int N_threads = 6;
  omp_set_num_threads(N_threads);
  
  #pragma omp parallel for schedule(static)
  for (int i=0; i<N_contacts; i++)
  {
    ...

    // particle-particle contacts
    for (int j=0; j<N_contacts_pp; j++)
    {
      for(int pt_id ...)
      {
        // check all particle_i's surface points to particle_j
        // do_the_actual_work
      }
    }
    
    // particle-wall contacts
    for (int k=0; k<N_contacts_pw; k++)
    {
      for(int pt_id ...)
      {
        // check all particle_i's surface points to wall_k
        // do_the_actual_work
      }
    }
}

Tried N_threads = 1,2,4,6,8,10,12; for constant time-steps, the CPU time is more or less the same. Why OpenMP parallel on the out-most loop in B.2 not working? could not figure out:(

分享到QQ

分享到微博