向量的并行求和
有人可以提供一些关于如何通过多线程减少以下 for 循环的运行时间的建议吗?假设我还有两个向量,分别称为“a”和“b”。
for (int j = 0; j < 8000; j++){
// Perform an operation and store in the vector 'a'
// Add 'a' to 'b' coefficient wise
}
这个 for 循环在我的程序中执行了很多次。上面for循环中的两个操作已经优化了,但是它们只运行在一个核心上。不过,我有 16 个可用核心,并且想利用它们。
我尝试按如下方式修改循环。我没有向量“a”,而是有 16 个向量,并假设第 i 个向量称为 a[i]。我的 for 循环现在看起来就像
for (int j = 0; j < 500; j++){
for (int i = 0; i < 16; i++){
// Perform an operation and store in the vector 'a[i]'
}
for (int i = 0; i < 16; i++){
// Add 'a[i]' to 'b' coefficient wise
}
}
我在每个内部循环之前添加“#pragma omp parallel for”,在每个内部循环上使用 OpenMp。我的所有处理器都在使用,但我的运行时间只会显着增加。有人对如何减少该循环的运行时间有任何建议吗?先感谢您。
Could someone please provide some suggestions on how I can decrease the following for loop's runtime through multithreading? Suppose I also have two vectors called 'a' and 'b'.
for (int j = 0; j < 8000; j++){
// Perform an operation and store in the vector 'a'
// Add 'a' to 'b' coefficient wise
}
This for loop is executed many times in my program. The two operations in the for loop above are already optimized, but they only run on one core. However, I have 16 cores available and would like to make use of them.
I've tried modifying the loop as follows. Instead of having the vector 'a', I have 16 vectors, and suppose that the i-th one is called a[i]. My for loop now looks like
for (int j = 0; j < 500; j++){
for (int i = 0; i < 16; i++){
// Perform an operation and store in the vector 'a[i]'
}
for (int i = 0; i < 16; i++){
// Add 'a[i]' to 'b' coefficient wise
}
}
I use the OpenMp on each of the for loops inside by adding '#pragma omp parallel for' before each of the inner loops. All of my processors are in use but my runtime only increases significantly. Does anyone have any suggestions on how I can decrease the runtime of this loop? Thank You in Advance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
无论您在何处插入 pragma 标记,omp 都会为您的程序创建线程,因此它会为内部标记创建线程,但问题是创建了 16 个线程,每个线程执行 1 个操作,然后使用您的方法将所有线程销毁。创建和销毁线程需要花费大量时间,因此您使用的方法会增加进程的总体时间,尽管它使用了全部 16 个核心。您不必创建内部 fors,只需在 8000 循环之前放置
#pragma omp parallel for
标记,由 omp 来分隔踏板之间的值,因此您创建第二个循环所做的工作是 omp 的工作。这样,omp 仅创建一次线程,然后使用每个线程处理 500 个数字,然后结束所有这些数字(少使用 499 个线程创建和销毁)omp creates threads for your program whereever you insert pragma tag, so it's createing threads for inner tags but the problem is 16 threads are created, each one does 1 operation and then all of them are destroyed using your method. creating and destroying threads take a lot of time so the method you used increases the overal time of your process although it uses all 16 cores. you didn't have to create inner fors just put
#pragma omp parallel for
tag before your 8000 loop it's up to omp to seperate values between treads so what you did to create the second loop, is omp's job. that way omp create threads only once and then process 500 numbers useing that each thread and end all of them after that (using 499 less thread creation and destruction)事实上,我将把这些评论放在一个答案中。
为琐碎的操作分叉线程只会增加开销。
首先,确保您的编译器使用向量指令来实现循环。 (如果它不知道如何执行此操作,您可能必须自己使用向量指令进行编码;尝试搜索“SSE instrinsics”。但是对于这种简单的向量相加,自动向量化应该是可能的。)
假设您的编译器是一个相当现代的 GCC,通过以下方式调用它:
添加
-ftree-vectorizer-verbose=2
以了解它是否自动矢量化您的循环以及原因。如果您已经在使用向量指令,那么您的内存带宽可能会饱和。现代 CPU 内核速度相当快...如果是这样,您需要在更高级别上进行重组,以便在循环的每次迭代中获得更多操作,找到对适合 L1 缓存内部的块执行大量操作的方法。
Actually, I am going to put these comments in an answer.
Forking threads for trivial operations just adds overhead.
First, make sure your compiler is using vector instructions to implement your loop. (If it does not know how to do this, you might have to code with vector instructions yourself; try searching for "SSE instrinsics". But for this sort of simple addition of vectors, automatic vectorization ought to be possible.)
Assuming your compiler is a reasonably modern GCC, invoke it with:
Add
-ftree-vectorizer-verbose=2
to find out whether or not it auto-vectorized your loop and why.If you are already using vector instructions, then it is possible you are saturating your memory bandwidth. Modern CPU cores are pretty fast... If so, you need to restructure at a higher level to get more operations inside each iteration of the loop, finding ways to perform lots of operations on blocks that fit inside the L1 cache.
有人对如何减少此循环的运行时间有任何建议吗?
始终尝试使外循环迭代少于内循环。这将使您免于多次内循环初始化。在上面的代码中,内部循环
i = 0;
被初始化500
次。现在,现在,内部循环
j = 0;
仅初始化了 16 次!如果有任何影响,请尝试相应地修改您的代码。
Does anyone have any suggestions on how I can decrease the runtime of this loop?
Always try to make outer loop iterations lesser than inner loop. This will save you from inner loop initializations that many times. In above code inner loop
i = 0;
is initialized500
times. Now,Now, inner loop
j = 0;
is initialized only 16 times !Give a try by modifying your code accordingly, if it makes any impact.