使用OpenMP原子捕获操作进行粒子的3D直方图并做出索引的种族条件

发布于 2025-02-06 04:25:07 字数 1520 浏览 2 评论 0原文

我的完整代码中有一个代码：

const unsigned int GL=8000000;
const int cuba=8;
const int cubn=cuba+cuba;
const int cub3=cubn*cubn*cubn;
int Length[cub3];
int Begin[cub3];
int Counter[cub3];
int MIndex[GL];
struct Particle{
  int ix,jy,kz;
  int ip;
};
Particle particles[GL];
int GetIndex(const Particle & p){return (p.ix+cuba+cubn*(p.jy+cuba+cubn*(p.kz+cuba)));}    
...
#pragma omp parallel for
for(int i=0; i<cub3; ++i) Length[i]=Counter[i]=0;
#pragma omp parallel for
for(int i=0; i<N; ++i)
{
  int ic=GetIndex(particles[i]);
  #pragma omp atomic update
  Length[ic]++;
}
Begin[0]=0;
#pragma omp single
for(int i=1; i<cub3; ++i) Begin[i]=Begin[i-1]+Length[i-1];
#pragma omp parallel for
for(int i=0; i<N; ++i)
{
  if(particles[i].ip==3)
  {
    int ic=GetIndex(particles[i]);
    if(ic>cub3 || ic<0) printf("ic=%d out of range!\n",ic);
    int cnt=0;
  #pragma omp atomic capture
    cnt=Counter[ic]++;
    MIndex[Begin[ic]+cnt]=i;
  }
}

如果要删除

#pragma omp parallel for

代码正常，并且输出结果始终相同。但是，使用此巴格马，代码中有一些不确定的行为/种族条件，因为每次给出不同的输出结果。如何解决此问题？

更新：任务如下。具有许多随机坐标的粒子。需要输出到粒子阵列颗粒中的索引 mindex ，每个粒子 在每个单元格中（例如，笛卡尔立方体，例如1×1×1 cm）坐标系。因此，在 Mindex的开头的开头，在坐标系的第一个单元格中的粒子数粒子中应该有索引，然后 - 在第二个中，然后 - 在第三个等等。 Mindex Mindex 中给定单元内的索引顺序并不重要，可能是任意的。如果可能的话，需要并行进行此操作，则可能正在使用原子操作。

有一个直接的方法：在平行的所有坐标细胞中遍历所有粒子的坐标。但是对于大量的细胞和颗粒，这似乎很慢。有更快的方法吗？是否可以使用原子操作填充 mindex 数组，可以在粒子数组中行驶，而使用原子操作，就像上面的代码文章中写的那样吗？

原文

I have a piece of code in my full code:

const unsigned int GL=8000000;
const int cuba=8;
const int cubn=cuba+cuba;
const int cub3=cubn*cubn*cubn;
int Length[cub3];
int Begin[cub3];
int Counter[cub3];
int MIndex[GL];
struct Particle{
  int ix,jy,kz;
  int ip;
};
Particle particles[GL];
int GetIndex(const Particle & p){return (p.ix+cuba+cubn*(p.jy+cuba+cubn*(p.kz+cuba)));}    
...
#pragma omp parallel for
for(int i=0; i<cub3; ++i) Length[i]=Counter[i]=0;
#pragma omp parallel for
for(int i=0; i<N; ++i)
{
  int ic=GetIndex(particles[i]);
  #pragma omp atomic update
  Length[ic]++;
}
Begin[0]=0;
#pragma omp single
for(int i=1; i<cub3; ++i) Begin[i]=Begin[i-1]+Length[i-1];
#pragma omp parallel for
for(int i=0; i<N; ++i)
{
  if(particles[i].ip==3)
  {
    int ic=GetIndex(particles[i]);
    if(ic>cub3 || ic<0) printf("ic=%d out of range!\n",ic);
    int cnt=0;
  #pragma omp atomic capture
    cnt=Counter[ic]++;
    MIndex[Begin[ic]+cnt]=i;
  }
}

If to remove

#pragma omp parallel for

the code works properly and the output results are always the same.
But with this pragma there is some undefined behaviour/race condition in the code, because each time it gives different output results.
How to fix this issue?

Update: The task is the following. Have lots of particles with some random coordinates. Need to output to the array MIndex the indices in the array particles of the particles, which are in each cell (cartesian cube, for example, 1×1×1 cm) of the coordinate system. So, in the beginning of MIndex there should be the indices in the array particles of the particles in the 1st cell of the coordinate system, then - in the 2nd, then - in the 3rd and so on. The order of indices within given cell in the area MIndex is not important, may be arbitrary. If it is possible, need to make this in parallel, may be using atomic operations.

There is a straight way: to traverse across all the coordinate cells in parallel and in each cell check the coordinates of all the particles. But for large number of cells and particles this seems to be slow. Is there a faster approach? Is it possible to travel across the particles array only once in parallel and fill MIndex array using atomic operations, something like written in the code piece above?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

櫻之舞 2025-02-13 04:25:08

如果您想要可以有效工作的算法（而无需共享计数器上的原子RMW，那将是灾难，请参见下面，请参见下面）。但是，您也许可以使用OpenMP作为启动线程并获取线程ID的一种方式。

将每个线程计数阵列从初始直方图中进行，在第二次通过中使用

（更新：这可能不起作用：我没有注意到if（preations [i] .ip == 3） in the the来源。或一些东西。
时进行检查，然后没关系。

但是正如Laci指出的那样，也许您需要在首先计算长度在已知i值的已知范围内工作。即使您对它们越过它们并进行前缀和构建[]。

因此，长度[thread] [ic]是该立方体中的粒子数，超出了该线程所做的i值的范围。（并将在第二个循环中再次循环：关键是我们以两次相同的方式将粒子划分。理想情况下，相同的线在相同范围内工作，因此在L1D缓存中可能仍然很热。）

预处理到每线程开始[] [] [] [] array，，因此每个线程都知道Mindex中的位置可以从每个存储桶中放置数据。

// pseudo-code, fairly close to actual C
for(ic < cub3) {
   // perhaps do this "vertical" sum into a temporary array
   // or prefix-sum within Length before combining across threads?
   int pos = sum(Length[0..nthreads-1][ic-1]) + Begin[0][ic-1];

   Begin[0][ic] = pos;
   for (int t = 1 ; t<nthreads ; t++) {
       pos += Length[t][ic];   // prefix-sum across threads for this cube bucket
       Begin[t][ic] = pos;
   }
}

这具有非常可怕的缓存访问模式，尤其是cuba = 8制作长度[t] [0]和length [t+1] [0] < /代码> 4096字节彼此隔开。（因此，4K混叠是一个可能的问题，Cache冲突错过了）。

也许每个线程都可以将其自己的长度切成片段的片段为1。对于缓存访问模式（以及既然它只是写下这些长度），而2。要获得该工作的一些并行性。

然后，在带有Mindex的最终循环中，每个线程都可以执行int pos = - length [t] [ic] [ic]以从长度衍生出唯一的ID。（就像您使用count []

一样计数。使用正确计算的开始[T] [IC]位置，Mindex [...] = I商店不会冲突。 false 共享仍然是可能的，但是它是一个足够大的数组，可以散布在周围。

不要用线程数量 ，尤其是cuba大于8。最好只为无关的线程或任务保留一些CPU即可完成一些吞吐量。 OTOH，带有cuba = 8的含义，这意味着每个每线阵列仅为4096字节（太小而无法并行化零，btw），实际上并不多。

（在您的编辑之前的上一个答案更清楚地发生了什么。）

这基本上是直方图吗？如果每个线程都有自己的计数数组，则可以在末尾将它们总结在一起（您可能需要手动执行此操作，而不是为您提供OpenMP）。但是看来您还需要在每个体素中唯一的数字，才能正确更新Mindex？这可能是一种展示式策划者，就像需要调整每个Mindex输入，即使有可能。

更新后，您将直方图分为长度[]，因此可以加速该部分。

原子RMW对于您的代码是必要的，

共享计数器的性能灾难原子量的增量将较慢，并且在X86上可能会破坏内存级别的并行性。在X86上，每个原子RMW也是一个完整的内存屏障，在发生之前会排出存储缓冲区，并阻止后来的负载从开始到发生后。

与单个线程相反，该线程可能会使用非原子访问权限，可以使多个counter counter ，开始 mindex 元素。（多亏了订购的exec，下一个迭代的load / inc / store for counter [ic] ++ < / code>可以执行加载，而在begin [IC Mindex []商店。

] 和/或用于（同样，OpenMP可能无法为您做到这一点。）

即使在X86上，使用足够（逻辑）内核，您仍然可能会得到一些加速，尤其是当Counter访问足够分散，它们的核心不会经常在相同的缓存线上战斗。但是，您仍然会在内核之间弹跳许多缓存线，而不是在L1D或L2中保持热。（false共享是一个问题，

也许软件预取可能会有所帮助，例如prefetchw（write-quefetching）以后5或10 i迭代的计数器。

它将' t确定哪个点，即使使用memory_order_seq_cst增量，thread incorments counter [ic]首先是关联的 > cnt 带有i

代码

。 . So the set of Counter[] elements that any given thread touches is only touched by that thread, so the increments can be non-atomic.

Filtering by 因为那是索引中最大的乘数，因此每个线程“拥有” counter []的连续范围。

p.kz范围可能是最有意义的，并非统一分布，您需要知道如何分解物品，直到大致分配工作。而且您不能仅仅将其分割得更细微（例如OMP计划动态），因为每个线程都将通过 all 扫描：这将乘以过滤的量工作。

也许有几个固定分区将是一个很好的权衡，即在不引入大量额外工作的情况下获得一些并行性。

回复：您的编辑

您已经循环循环整个要点长度[IC] ++;的积分？再次使用Counter [IC] ++;再次进行相同的直方直方图工作似乎是多余的，但并不明显地避免它。

计数阵列很小，但是如果完成后不需要两者，则可能只需减小长度即可为体素中的每个点分配唯一的索引。至少第一个直方图可以从平行化的每个线程中并行化，而在末尾垂直添加。由于计数阵列足够小，因此应与线程完美缩放。

btw， for（）长度[i] = counter [i] = 0;太小，不值得并行化。对于cuba = 8，它是8*8*16*sizeof（int） = 4096字节，只有一个页面，所以它只是两个小模板。

（当然，如果每个线程都有自己的独立长度阵列，则每个线程都需要使其为零）。这足够小，甚至可以考虑在每个线程中进行2个计数阵列存储/重新加载串行依赖性如果长序列都在同一存储桶中。结合计数阵列在末尾是#pragma op simd或仅与gcc -o3 -march -march =本机的正常自动矢量化的作业，因为它是整数工作。

对于最终循环，您可以将点数组分为一半（将一半分配给每个线程），并且通过从- 长度[i] ，另一个从计数器[i] ++中的0计数。随着不同的线程查看不同点，这可能会给您2个速度。（Modulo for Mindex商店的争夺。）

要做的不仅仅是向上计数，您还需要从总体长度阵列中没有的信息...但是您确实暂时拥有。查看顶部的部分

You probably can't get a compiler to auto-parallelize scalar code for you if you want an algorithm that can work efficiently (without needing atomic RMWs on shared counters which would be a disaster, see below). But you might be able to use OpenMP as a way to start threads and get thread IDs.

Keep per-thread count arrays from the initial histogram, use in 2nd pass

(Update: this might not work: I didn't notice the if(particles[i].ip==3) in the source before. I was assuming that Count[ic] will go as high as Length[ic] in the serial version. If that's not the case, this strategy might leave gaps or something.
But as Laci points out, perhaps you want that check when calculating Length in the first place, then it would be fine.)

Manually multi-thread the first histogram (into Length[]), with each thread working on a known range of i values. Keep those per-thread lengths around, even as you sum across them and prefix-sum to build Begin[].

So Length[thread][ic] is the number of particles in that cube, out of the range of i values that this thread worked on. (And will loop over again in the 2nd loop: the key is that we divide the particles between threads the same way twice. Ideally with the same thread working on the same range, so things may still be hot in L1d cache.)

Pre-process that into a per-thread Begin[][] array, so each thread knows where in MIndex to put data from each bucket.

// pseudo-code, fairly close to actual C
for(ic < cub3) {
   // perhaps do this "vertical" sum into a temporary array
   // or prefix-sum within Length before combining across threads?
   int pos = sum(Length[0..nthreads-1][ic-1]) + Begin[0][ic-1];

   Begin[0][ic] = pos;
   for (int t = 1 ; t<nthreads ; t++) {
       pos += Length[t][ic];   // prefix-sum across threads for this cube bucket
       Begin[t][ic] = pos;
   }
}

This has a pretty terrible cache access pattern, especially with cuba=8 making Length[t][0] and Length[t+1][0] 4096 bytes apart from each other. (So 4k aliasing is a possible problem, as are cache conflict misses).

Perhaps each thread can prefix-sum its own slice of Length into that slice of Begin, 1. for cache access pattern (and locality since it just wrote those Lengths), and 2. to get some parallelism for that work.

Then in the final loop with MIndex, each thread can do int pos = --Length[t][ic] to derive a unique ID from the Length. (Like you were doing with Count[], but without introducing another per-thread array to zero.)

Each element of Length will return to zero, because the same thread is looking at the same points it just counted. With correctly-calculated Begin[t][ic] positions, MIndex[...] = i stores won't conflict. False sharing is still possible, but it's a large enough array that points will tend to be scattered around.

Don't overdo it with number of threads, especially if cuba is greater than 8. The amount of Length / Begin pre-processing work scales with number of threads, so it may be better to just leave some CPUs free for unrelated threads or tasks to get some throughput done. OTOH, with cuba=8 meaning each per-thread array is only 4096 bytes (too small to parallelize the zeroing of, BTW), it's really not that much.

(Previous answer before your edit made it clearer what was going on.)

Is this basically a histogram? If each thread has its own array of counts, you can sum them together at the end (you might need to do that manually, not have OpenMP do it for you). But it seems you also need this count to be unique within each voxel, to have MIndex updated properly? That might be a showstopper, like requiring adjusting every MIndex entry, if it's even possible.

After your update, you are doing a histogram into Length[], so that part can be sped up.

Atomic RMWs would be necessary for your code as-is, performance disaster

Atomic increments of shared counters would be slower, and on x86 might destroy the memory-level parallelism too badly. On x86, every atomic RMW is also a full memory barrier, draining the store buffer before it happens, and blocking later loads from starting until after it happens.

As opposed to a single thread which can have cache misses to multiple Counter, Begin and MIndex elements outstanding, using non-atomic accesses. (Thanks to out-of-order exec, the next iteration's load / inc / store for Counter[ic]++ can be doing the load while there are cache misses outstanding for Begin[ic] and/or for Mindex[] stores.)

ISAs that allow relaxed-atomic increment might be able to do this efficiently, like AArch64. (Again, OpenMP might not be able to do that for you.)

Even on x86, with enough (logical) cores, you might still get some speedup, especially if the Counter accesses are scattered enough they cores aren't constantly fighting over the same cache lines. You'd still get a lot of cache lines bouncing between cores, though, instead of staying hot in L1d or L2. (False sharing is a problem,

Perhaps software prefetch can help, like prefetchw (write-prefetching) the counter for 5 or 10 i iterations later.

It wouldn't be deterministic which point went in which order, even with memory_order_seq_cst increments, though. Whichever thread increments Counter[ic] first is the one that associates that cnt with that i.

Alternative access patterns

Perhaps have each thread scan all points, but only process a subset of them, with disjoint subsets. So the set of Counter[] elements that any given thread touches is only touched by that thread, so the increments can be non-atomic.

Filtering by p.kz ranges maybe makes the most sense since that's the largest multiplier in the indexing, so each thread "owns" a contiguous range of Counter[].

But if your points aren't uniformly distributed, you'd need to know how to break things up to approximately equally divide the work. And you can't just divide it more finely (like OMP schedule dynamic), since each thread is going to scan through all the points: that would multiply the amount of filtering work.

Maybe a couple fixed partitions would be a good tradeoff to gain some parallelism without introducing a lot of extra work.

Re: your edit

You already loop over the whole array of points doing Length[ic]++;? Seems redundant to do the same histogramming work again with Counter[ic]++;, but not obvious how to avoid it.

The count arrays are small, but if you don't need both when you're done, you could maybe just decrement Length to assign unique indices to each point in a voxel. At least the first histogram could benefit from parallelizing with different count arrays for each thread, and just vertically adding at the end. Should scale perfectly with threads since the count array is small enough for L1d cache.

BTW, for() Length[i]=Counter[i]=0; is too small to be worth parallelizing. For cuba=8, it's 8*8*16 * sizeof(int) = 4096 bytes, just one page, so it's just two small memsets.

(Of course if each thread has their own separate Length array, they each need to zero it). That's small enough to even consider unrolling with maybe 2 count arrays per thread to hide store/reload serial dependencies if a long sequence of points are all in the same bucket. Combining count arrays at the end is a job for #pragma omp simd or just normal auto-vectorization with gcc -O3 -march=native since it's integer work.

For the final loop, you could split your points array in half (assign half to each thread), and have one thread get unique IDs by counting down from --Length[i], and another counting up from 0 in Counter[i]++. With different threads looking at different points, this could give you a factor of 2 speedup. (Modulo contention for MIndex stores.)

To do more than just count up and down, you'd need info you don't have from just the overall Length array... but which you did have temporarily. See the section at the top

回复收藏 0 原文

鹊巢 2025-02-13 04:25:08

您可以进行更新counter [ic] ++ aromic，但是下一行还有一个其他问题：mindex [begin [ic]+cnt] = i; 不同的迭代可以在此处写入同一位置，除非您有数学证明Mindex的结构绝不是这种情况。因此，您也必须制作该线原子。然后，您的循环中几乎没有剩下的平行工作，因此如果可能会变得糟糕的话，您的速度也会加快。

但是，编辑第二行的原子操作不是正确的形式，因此您必须将其制作crigite。这将使性能变得更糟。

另外，@laci是正确的，因为这是一个覆盖声明，因此并行计划的顺序将影响结果。因此，要么与这个事实生活在一起，要么接受不能并行的。

回复收藏 0 原文

~没有更多了~