高斯消除的内存管理
矩阵在处理器 0 中创建并分散到其他处理器。矩阵是对称稠密矩阵。这就是它在处理器 0 中初始化的原因。
矩阵以这种方式创建:
A=malloc(sizeof(double)*N*N);
for (i=0; i<N; i++)
for(j=0; j<N; j++)
A(i,j)=rand()%10; // The code will be changed.
A(i,j) 定义为:
#define A(i,j) A[i*N+j]
N 必须为 100,000 才能测试算法。
这里的问题是:如果 N=100,000 那么所需的内存大约是 76GB。您建议如何存储 A 矩阵?
PS:当 N<20.000 并且集群是受干扰的内存系统(每个处理器 2GB RAM)时,算法运行得很好
A matrix is created in processor 0 and scattered to other processors. A matrix is a symmetric dense matrix. That's why it is initialized in processor 0.
A matrix is created in this way:
A=malloc(sizeof(double)*N*N);
for (i=0; i<N; i++)
for(j=0; j<N; j++)
A(i,j)=rand()%10; // The code will be changed.
A(i,j) is defined as:
#define A(i,j) A[i*N+j]
and N has to be 100,000 to test the algorithm.
The problem here is: if N=100,000 then the memory needed is approximately 76GB. What do you suggest to store the A matrix?
PS: Algorithm works very well when N<20.000 and the cluster is a distrubed memory system(2GB RAM per processor)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果您这样做(如评论中所述)是为了进行缩放测试,那么 Oli Charlesworth 是完全正确的;你所做的任何事情都会使这成为苹果与橘子的比较,因为你的节点没有 76GB 可用。这很好;使用 MPI 的重要原因之一是解决无法适应单个节点的问题。但是,如果试图将 76GB 的数据硬塞到一个处理器上,那么您所做的比较就没有任何意义。正如 Oli Charlesworth 和 caf 所提到的,通过各种方法,您可以使用磁盘而不是 RAM,但是您的 1 处理器答案将无法直接与从大量节点获得的 RAM 适配数进行比较,所以你需要做很多工作才能得到一个实际上没有任何意义的数字。
如果您希望在此类问题上获得扩展结果,您可以从适合该问题的最少节点数开始,并在越来越多的处理器上获取数据,或者您可以弱缩放,而不是 强扩展测试 - 在扩展处理器数量的同时保持每个处理器的工作量不变,而不是总工作量恒定。
顺便说一句,无论您如何进行测量,如果正如 Oli Charlesworth 所建议的那样,让每个处理器生成自己的数据,而不是通过让 0 级生成矩阵然后再进行串行瓶颈,您最终会得到更好的结果所有的加工商都会收到他们的零件。
If you are doing this, as stated in comments, to do a scaling test, then Oli Charlesworth is completely right; anything you do is going to make this an apples-to-oranges comparison, because your node doesn't have 76GB to use. Which is fine; one of the big reasons to use MPI is to tackle problems that couldn't fit on one node. But by trying to shoehorn 76GB of data onto one processor, the comparison you're doing isn't going to make any sense. As mentioned by both Oli Charlesworth and caf, through various methods you can use disk instead of RAM, but then your 1 processor answer is going not going to be directly comparable to the fits-in-RAM numbers you get from larger number of nodes, so you're going to be going to a lot of work to get a number which won't actually mean anything.
If you want scaling results on this sort of problem, you either start with the lowest number of nodes that the problem does fit on, and take data at increasing numbers of processors, or you do weak scaling, rather than strong scaling tests -- you keep the work-per-processor constant while scaling up the number of processors, rather than the total work being constant.
Incidentally, however you do the measurements, you'll end up with better results if, as Oli Charlesworth suggests, you have each procesor generate its own data rather than have a serial bottleneck by having rank 0 do the generation of the matrix and then have all the processors receive their parts.
如果您在具有足够虚拟地址空间的 POSIX 系统(实际上意味着 64 位系统)上进行编程,则可以使用
mmap()
。创建所需大小的匿名映射(这将是交换支持的,这意味着您将需要至少 76GB 的交换),或者创建所需大小的真实文件并映射它。
文件支持的解决方案的优点是,如果您的集群具有共享文件系统,则无需显式地将矩阵传输到每个处理器 - 您只需在创建矩阵后对其进行
msync()
即可,然后在每个处理器上映射正确的区域。If you are programming on a POSIX system with sufficient virtual address space (which in practice will mean a 64 bit system), you can use
mmap()
.Either create an anonymous mapping of the required size (this will be swap-backed, which will mean you'll need at least 76GB of swap), or create a real file of the required size and map that.
The file-backed solution has the advantage that if your cluster has a shared file system, you don't need to explicitly transfer the matrix to each processor - you can simply
msync()
it after creating it, and then map the right region on each processor.如果您可以切换到 C++,您可能会考虑 STXXL,它是一个 STL专为大型数据集设计的实现,具有透明的磁盘支持等。
If you can switch to C++, you might look into STXXL, which is an STL implementation specifically designed for huge datasets, with transparent disk-backed support, etc.