C++使用 `.reserve()` 填充 `std::vector` 作为防止多线程缓存失效和错误共享的一种方法
我有一个程序,其总体结构如下所示。基本上,我有一个对象向量。每个对象都有成员向量,其中之一是包含更多向量的结构向量。通过多线程,对象可以并行操作,进行涉及大量访问和修改成员向量元素的计算。一个对象一次只能由一个线程访问,并被复制到该线程的堆栈中进行处理。
问题是该程序无法扩展到 16 个核心。我怀疑并被告知问题可能是错误共享和/或缓存失效。如果这是真的,那么原因似乎一定是向量分配的内存彼此距离太近,因为我的理解是,这两个问题(简单来说)都是由不同处理器同时访问的近端内存地址引起的。这个推理有道理吗?这种情况有可能发生吗?如果是这样,我似乎可以通过使用 .reserve() 填充成员向量来添加额外的容量来解决这个问题,从而在向量数组之间留下大量的空内存空间。那么,这一切有意义吗?我完全出去吃午饭了吗?
struct str{
vector <float> a; vector <int> b; vector <bool> c; };
class objects{
vector <str> a; vector <int> b; vector <float> c;
//more vectors, etc ...
void DoWork(); //heavy use of vectors
};
main(){
vector <object> objs;
vector <object> p_objs = &objs;
//...make `thread_list` and `attr`
for(int q=0; q<NUM_THREADS; q++)
pthread_create(&thread_list[q], &attr, Consumer, p_objs );
//...
}
void* Consumer(void* argument){
vector <object>* p_objs = (vector <object>*) argument ;
while(1){
index = queued++; //imagine queued is thread-safe global
object obj = (*p_objs)[index]
obj.DoWork();
(*p_objs)[index] = obj;
}
I have a program with the general structure shown below. Basically, I have a vector of objects. Each object has member vectors, and one of those is a vector of structs that contain more vectors. By multithreading, the objects are operated on in parallel, doing computation that involves much accessing and modifying of member vector elements. One object is acessed by only one thread at a time, and is copied to that thread's stack for processing.
The problem is that the program fails to scale up to 16 cores. I suspect and am advised that the issue may be false sharing and/or cache invalidation. If this is true, it seems that the cause must be vectors allocating memory too close to each other, as it is my understanding that both problems are (in simple terms) caused by proximal memory addresses being accessed simultaneously by different processors. Does this reasoning make sense, is it likely that this could happen? If so, it seems that I can solve this problem by padding the member vectors using .reserve() to add extra capacity, leaving large spaces of empty memory between vector arrays. So, does all this make any sense? Am I totally out to lunch here?
struct str{
vector <float> a; vector <int> b; vector <bool> c; };
class objects{
vector <str> a; vector <int> b; vector <float> c;
//more vectors, etc ...
void DoWork(); //heavy use of vectors
};
main(){
vector <object> objs;
vector <object> p_objs = &objs;
//...make `thread_list` and `attr`
for(int q=0; q<NUM_THREADS; q++)
pthread_create(&thread_list[q], &attr, Consumer, p_objs );
//...
}
void* Consumer(void* argument){
vector <object>* p_objs = (vector <object>*) argument ;
while(1){
index = queued++; //imagine queued is thread-safe global
object obj = (*p_objs)[index]
obj.DoWork();
(*p_objs)[index] = obj;
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
好吧,线程 0 中复制的最后一个向量是 objs[0].c。线程 1 中复制的第一个向量是 objs[1].a[0].a。因此,如果它们的两个分配数据块碰巧都占用相同的缓存行(64 字节,或该 CPU 的实际值),则会出现错误共享。
当然,涉及的任何两个向量也是如此,但为了举一个具体的例子,我假设线程 0 首先运行并在线程 1 开始分配之前进行分配,并且分配器倾向于使连续分配相邻。
reserve()
可能会阻止您实际操作的块的部分占用相同的缓存行。另一种选择是每线程内存分配——如果这些向量的块是从不同的池分配的,那么它们不可能占用同一行,除非池这样做。如果您没有每线程分配器,并且
DoWork
多次重新分配向量,则问题可能是内存分配器的争用。或者可能是对DoWork
使用的任何其他共享资源的争用。基本上,想象每个线程花费 1/K 的时间来做一些需要全局独占访问的事情。然后,它可能看起来相当好地并行化到某个数量 J <= K,此时获取独占访问会显着影响加速,因为内核花费了很大一部分时间空闲。除了 K 个核心之外,额外的核心几乎没有任何改进,因为共享资源无法更快地运行。在这种荒谬的结局中,想象一些工作花费 1/K 的时间来持有全局锁,并花费 (K-1)/K 的时间来等待 I/O。那么问题似乎是令人尴尬地并行几乎达到 K 个线程(无论核心数量多少),此时它就停止了。
因此,在排除真实共享之前,不要关注虚假共享;-)
Well, the last vector copied in thread 0 is
objs[0].c
. The first vector copied in thread 1 isobjs[1].a[0].a
. So if their two blocks of allocated data happen to both occupy the same cache line (64 bytes, or whatever it actually is for that CPU), you'd have false sharing.And of course the same is true of any two vectors involved, but just for the sake of a concrete example I have pretended that thread 0 runs first and does its allocation before thread 1 starts allocating, and that the allocator tends to make consecutive allocations adjacent.
reserve()
might prevent the parts of that block that you're actually acting on, from occupying the same cache line. Another option would be per-thread memory allocation -- if those vectors' blocks are allocated from different pools then they can't possibly occupy the same line unless the pools do.If you don't have per-thread allocators, the problem could be contention on the memory allocator, if
DoWork
reallocates the vectors a lot. Or it could be contention on any other shared resource used byDoWork
. Basically, imagine that each thread spends 1/K of its time doing something that requires global exclusive access. Then it might appear to parallelize reasonably well up to a certain number J <= K, at which point acquiring the exclusive access significantly eats into the speed-up because cores are spending a significant proportion of time idle. Beyond K cores there's approximately no improvement at all with extra cores, because the shared resource cannot work any faster.At the absurd end of this, imagine some work that spends 1/K of its time holding a global lock, and (K-1)/K of its time waiting on I/O. Then the problem appears to be embarrassingly parallel almost up to K threads (irrespective of the number of cores), at which point it stops dead.
So, don't focus on false sharing until you've ruled out true sharing ;-)