向量的并行求和

发布于 2024-11-14 04:24:54 字数 765 浏览 1 评论 0原文

有人可以提供一些关于如何通过多线程减少以下 for 循环的运行时间的建议吗?假设我还有两个向量,分别称为“a”和“b”。

for (int j = 0; j < 8000; j++){
    // Perform an operation and store in the vector 'a'
    // Add 'a' to 'b' coefficient wise
}

这个 for 循环在我的程序中执行了很多次。上面for循环中的两个操作已经优化了,但是它们只运行在一个核心上。不过,我有 16 个可用核心,并且想利用它们。

我尝试按如下方式修改循环。我没有向量“a”,而是有 16 个向量,并假设第 i 个向量称为 a[i]。我的 for 循环现在看起来就像

for (int j = 0; j < 500; j++){
    for (int i = 0; i < 16; i++){
        // Perform an operation and store in the vector 'a[i]'
    }
    for (int i = 0; i < 16; i++){
        // Add 'a[i]' to 'b' coefficient wise
    }

}

我在每个内部循环之前添加“#pragma omp parallel for”,在每个内部循环上使用 OpenMp。我的所有处理器都在使用,但我的运行时间只会显着增加。有人对如何减少该循环的运行时间有任何建议吗?先感谢您。

Could someone please provide some suggestions on how I can decrease the following for loop's runtime through multithreading? Suppose I also have two vectors called 'a' and 'b'.

for (int j = 0; j < 8000; j++){
    // Perform an operation and store in the vector 'a'
    // Add 'a' to 'b' coefficient wise
}

This for loop is executed many times in my program. The two operations in the for loop above are already optimized, but they only run on one core. However, I have 16 cores available and would like to make use of them.

I've tried modifying the loop as follows. Instead of having the vector 'a', I have 16 vectors, and suppose that the i-th one is called a[i]. My for loop now looks like

for (int j = 0; j < 500; j++){
    for (int i = 0; i < 16; i++){
        // Perform an operation and store in the vector 'a[i]'
    }
    for (int i = 0; i < 16; i++){
        // Add 'a[i]' to 'b' coefficient wise
    }

}

I use the OpenMp on each of the for loops inside by adding '#pragma omp parallel for' before each of the inner loops. All of my processors are in use but my runtime only increases significantly. Does anyone have any suggestions on how I can decrease the runtime of this loop? Thank You in Advance.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

回首观望 2024-11-21 04:24:54

无论您在何处插入 pragma 标记,omp 都会为您的程序创建线程,因此它会为内部标记创建线程,但问题是创建了 16 个线程,每个线程执行 1 个操作,然后使用您的方法将所有线程销毁。创建和销毁线程需要花费大量时间,因此您使用的方法会增加进程的总体时间,尽管它使用了全部 16 个核心。您不必创建内部 fors,只需在 8000 循环之前放置 #pragma omp parallel for 标记,由 omp 来分隔踏板之间的值,因此您创建第二个循环所做的工作是 omp 的工作。这样,omp 仅创建一次线程,然后使用每个线程处理 500 个数字,然后结束所有这些数字(少使用 499 个线程创建和销毁)

omp creates threads for your program whereever you insert pragma tag, so it's createing threads for inner tags but the problem is 16 threads are created, each one does 1 operation and then all of them are destroyed using your method. creating and destroying threads take a lot of time so the method you used increases the overal time of your process although it uses all 16 cores. you didn't have to create inner fors just put #pragma omp parallel for tag before your 8000 loop it's up to omp to seperate values between treads so what you did to create the second loop, is omp's job. that way omp create threads only once and then process 500 numbers useing that each thread and end all of them after that (using 499 less thread creation and destruction)

春花秋月 2024-11-21 04:24:54

事实上,我将把这些评论放在一个答案中。

为琐碎的操作分叉线程只会增加开销。

首先,确保您的编译器使用向量指令来实现循环。 (如果它不知道如何执行此操作,您可能必须自己使用向量指令进行编码;尝试搜索“SSE instrinsics”。但是对于这种简单的向量相加,自动向量化应该是可能的。)

假设您的编译器是一个相当现代的 GCC,通过以下方式调用它:

gcc -O3 -march=native ...

添加 -ftree-vectorizer-verbose=2 以了解它是否自动矢量化您的循环以及原因。

如果您已经在使用向量指令,那么您的内存带宽可能会饱和。现代 CPU 内核速度相当快...如果是这样,您需要在更高级别上进行重组,以便在循环的每次迭代中获得更多操作,找到对适合 L1 缓存内部的块执行大量操作的方法。

Actually, I am going to put these comments in an answer.

Forking threads for trivial operations just adds overhead.

First, make sure your compiler is using vector instructions to implement your loop. (If it does not know how to do this, you might have to code with vector instructions yourself; try searching for "SSE instrinsics". But for this sort of simple addition of vectors, automatic vectorization ought to be possible.)

Assuming your compiler is a reasonably modern GCC, invoke it with:

gcc -O3 -march=native ...

Add -ftree-vectorizer-verbose=2 to find out whether or not it auto-vectorized your loop and why.

If you are already using vector instructions, then it is possible you are saturating your memory bandwidth. Modern CPU cores are pretty fast... If so, you need to restructure at a higher level to get more operations inside each iteration of the loop, finding ways to perform lots of operations on blocks that fit inside the L1 cache.

如果没有 2024-11-21 04:24:54

有人对如何减少此循环的运行时间有任何建议吗?

for (int j = 0; j < 500; j++){  // outer loop
  for (int i = 0; i < 16; i++){  // inner loop

始终尝试使外循环迭代少于内循环。这将使您免于多次内循环初始化。在上面的代码中,内部循环 i = 0; 被初始化 500 次。现在,

for (int i = 0; j < 16; i++){  // outer loop
  for (int j = 0; j < 500; j++){  // inner loop

现在,内部循环 j = 0; 仅初始化了 16 次!
如果有任何影响,请尝试相应地修改您的代码。

Does anyone have any suggestions on how I can decrease the runtime of this loop?

for (int j = 0; j < 500; j++){  // outer loop
  for (int i = 0; i < 16; i++){  // inner loop

Always try to make outer loop iterations lesser than inner loop. This will save you from inner loop initializations that many times. In above code inner loop i = 0; is initialized 500 times. Now,

for (int i = 0; j < 16; i++){  // outer loop
  for (int j = 0; j < 500; j++){  // inner loop

Now, inner loop j = 0; is initialized only 16 times !
Give a try by modifying your code accordingly, if it makes any impact.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文