当数组达到一定尺寸时,OpenMP会变得明显较慢
我正在尝试通过并行计算解决一个作业示例。这是给定的代码段:
for (int i=0; i < n-1; i++) {
x[i] = (y[i] + x[i+1]) / 7;
}
我们必须分析有关依赖项的代码段,并尝试并行化给定的摘要并使用Propper样本量进行基准测试。我的第一次尝试只是复制 x
并在此处删除反依赖性:
double *x_old = malloc(sizeof(*x_old) * n); // the plain "copy" of x
memcpy(x_old, &x[1], (n - 1) * sizeof(*x)); // to prevent i + 1 in loop we start x[1] and copy n - 1 elements
#pragma omp parallel for
for (int i = 0; i < n-1; i++) {
x[i] = (y[i] + x_old[i]) / 7;
}
此代码有效,我检查了数组的串行和并行校验和,它们都是相同的。对于 1.6-2 ,对于 100_000_00-400_000_000_000_000数组元素而言,无需计算 memcpy()
的加速度。但是,如果我进一步增加阵列大小(在ARROUND 470_000_000元素 )时,加速度会大大减少,在 500_000_000数数元素 0.375 。
我使用多少个线程是无关紧要的。
我在3个不同的设备上对其进行了基准测试(速度明智):
Lenovo IdeaPad 5
desktop pc
以及在我们的群集节点pc
net/3ub5u.png“ rel =“ nofollow noreferrer”>
我在此处提供了一个最小示例:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <string.h>
#include "omp.h"
void init(double *arr, int size);
double sum(double *arr, int size);
int main(int argc, char **argv){
if(argc != 3){
printf("Wrong use of program! (1) Size of array, (2) num threads\n");
return EXIT_FAILURE;
}
int n = atoi(argv[1]);
int threads = atoi(argv[2]);
omp_set_num_threads(threads);
double *x = malloc(sizeof(*x) * n);
double *y = malloc(sizeof(*y) * n);
init(x, n);
init(y, n);
double start = omp_get_wtime();
for (int i = 0; i < n-1; i++) {
x[i] = (y[i] + x[i+1]) / 7;
}
double end = omp_get_wtime();
double elapsed_s = end - start;
printf("single elapsed: %fs\n", elapsed_s);
double checksum_s = sum(x, n);
printf("single checksum: %f\n", checksum_s);
init(x, n); //reinit
init(y, n); //reinit
double *x_old = malloc(sizeof(*x_old) * n); // the plain "copy" of x
memcpy(x_old, &x[1], (n - 1) * sizeof(*x)); // to prevent i + 1 in loop we start x[1] and copy n - 1 elements
start = omp_get_wtime();
#pragma omp parallel for
for (int i = 0; i < n-1; i++) {
x[i] = (y[i] + x_old[i]) / 7;
}
end = omp_get_wtime();
double elapsed_p = end - start;
printf("omp threads %d Thread(s) elapsed: %fs\n", omp_get_max_threads(), elapsed_p);
double checksum_p = sum(x, n);
printf("omp threads %d Thread(s) single checksum: %f\n", omp_get_max_threads(), checksum_p);
printf("%s\n", fabs((checksum_p - checksum_s)) < __DBL_EPSILON__ ? "OK" : "FAILED");
printf("Speedup: %.3f\n", elapsed_s / elapsed_p);
}
void init(double *arr, int size){
for(int i = 0; i < size; ++i){
arr[i] = 1.0;
}
}
double sum(double *arr, int size){
double sum = 0.0;
for(int i = 0; i < size; ++i){
sum += arr[i];
}
return sum;
}
在这里:示例基准:
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
问题当然来自交换内存。实际上,该代码分配了3个数组的500_000_000项目,导致每个数组需要4个GIB RAM,总计12个GIB。事实是,两台基准计算机只有16个GIB RAM可用,并且运行应用程序通常已经需要1-4 GIB RAM。当可用物理内存的数量较小时,OS开始在称为交换内存的空间上移动内存页面,该空间通常在存储设备上映射,该存储设备比通常的RAM慢得多。请注意,一些像Windows这样的操作系统已经使用ZRAM,因此当可用的内存数量较小时,可以压缩数据。主流OS(Windows,Linux和Mac还需要一些额外的缓冲区空间(例如,存储设备缓冲区,网络等)。当内存量变小时,内存页面的分配也可以变慢。还请注意,请注意页面是仅在首次触摸主流系统中分配在RAM中。 。
a > 操作受到通常较低的RAM的吞吐量,您可以通过在块上进行操作。线程可以在每个核心的L1缓存中完成操作。 。实际上,笔记本电脑/桌面计算机的L1 CPU缓存应为100 gib/core,导致累积的吞吐量&gt; 600 GIB,而RAM的吞吐量在实践中不应超过40 GIB/s(练习慢15x gib/s )。
对于群集节点,它会受到numa效应的影响,因此更为复杂。 Numa架构非常复杂。有关更多信息,我建议您阅读本文=“ https://stackoverflow.com/questions/72229540/Explanation-for-why-why-why-why-why-why-dram-bandwidth-reduces-upon-adding-cpus/72241660#722241660” /a>(包括相关链接)和通过实验将OpenMP线程分类为NUMA节点的问题。简而言之,您需要关心页面分配的位置(分配政策)。关键是控制页面上的第一个触摸(在大多数平台上)。
The problem certainly comes from swap memory. Indeed, the code allocates 3 array of 500_000_000 items resulting in 4 GiB of RAM required for each array and 12 GiB overall. The thing is two of the benchmarking machines have only 16 GiB of RAM available and the OS as well as running applications generally require already 1-4 GiB of RAM. When the amount of available physical memory becomes small, the OS starts to move memory pages on a space called swap memory which is typically mapped on your storage device that is far much slower than usual RAM. Note that some operating systems like Windows already use ZRAM so to compress data when the available amount of memory becomes small. Mainstream OS (Windows, Linux and MAC also need some additional space for buffers (eg. storage device buffers, networking, etc.). The allocation of memory pages can also becomes slower when the amount of memory becomes small. Also note that pages are allocated in RAM only during the first touch on mainstream systems. For more information about this, please read this article.
Allocating a huge
x_old
array is not efficient (in both time and space). Indeed, the memcopy will take a lot of time to copy data in memory and this operation is limited by the throughput of the RAM which is generally low. You can speed up the operation by operating on chunks. The idea is to copy a small data block (of let say 16 KiB) in each thread so the operation can be done in the L1 cache of each core. This means that the cost of thememcpy
will nearly vanish since CPU caches are much faster than the RAM (both the latency and the throughput). In fact, the L1 CPU cache of the notebook/desktop computer should be >100 GiB/core, resulting in a cumulated throughput >600 GiB, while the throughput of the RAM should not exceed 40 GiB/s in practice (15x slower).For the cluster node, it is a bit more complex since it is subject to NUMA effects. NUMA architectures are pretty complex. For more information, I advise you to read this article first, then read Explanation for why effective DRAM bandwidth reduces upon adding CPUs (including the related links) and Problem of sorting OpenMP threads into NUMA nodes by experiment. Put it shortly, you need to care about where the pages are allocated (allocation policy). The key is to control the first touch on pages (on most platforms).
一些问题...
x_old
并行(例如x,y,x_old
)使用3个数组而不是2(例如x,y
)在内存控制器和缓存上增加了额外的压力,因此比较是 有效的。并行版本处于不同的劣势。在那里,扬声器基准类似。他的结论是,有4个以上的线程将淹没内存总线(即实际上降低性能)。由于开关线程的开销,缓存错过并导致参考的局部性较小,因此附加线程只需增加额外的开销。
例如,我只是使用以下程序运行您的程序:
1000000 8
并得到:当我使用它运行时:
1000000 4
,我得到了:当我再运行它时,请使用:<代码> 1000000 8 ,我得到:
所以,4比8更快。
更新:
我对程序进行了重构,以运行多个测试:
这是:
1000000 16
:A few issues ...
x_old
for parallel (e.g.x, y, x_old
) uses 3 arrays instead of 2 (e.g.x, y
) puts an additional strain on the memory controller and cache, so the comparison is not valid. The parallel version is at a distinct disadvantage.There the speaker benchmarks similar things. His conclusion was that any more than 4 threads will swamp the memory bus (i.e. any more actually reduces performance). The additional threads just add extra overhead due to the overhead of switching threads, cache misses, and causing less locality of reference.
For example, I just ran your program with:
1000000 8
and got:When I ran it with:
1000000 4
, I got:When I ran it another time with:
1000000 8
, I got:So, 4 is faster than 8. And, two runs of 8 produced [wildly] differing results.
UPDATE:
I refactored the program to run multiple tests as mentioned above:
Here is the test output for:
1000000 16
: