多线程效率
假设我有这样的代码,
for(i = 0; i < i_max; i++)
for(j = 0; j < j_max; j++)
// do something
并且我想通过使用不同的线程来完成此操作(假设 //do some 任务彼此独立,例如考虑蒙特卡洛模拟)。我的问题是:为 i 的每个值创建一个线程是否一定比为 j 的每个值创建一个线程更好? 另外还有这样的事情
for(i = 0; i < i_max; i++)
create_thread(j_max);
:合适的线程数是多少?我应该创建 i_max 线程,还是使用 k < 的信号量? i_max 线程在任何给定时间同时运行。
谢谢你,
suppose I have a code like this
for(i = 0; i < i_max; i++)
for(j = 0; j < j_max; j++)
// do something
and I want to do this by using different threads (assuming the //do something tasks are independent from each other, think about montecarlo simulations for instance). My question is this: is it necessarily better to create a thread for each value of i, than creating a thread for each value of j? Something like this
for(i = 0; i < i_max; i++)
create_thread(j_max);
additionally: what would a suitable number of threads? Shall I just create i_max threads or, perhaps, use a semaphore with k < i_max threads running concurrently at any given time.
Thank you,
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
分配工作量的最佳方式取决于工作量。
广泛而言 - 对于可并行工作负载,请使用 OpenMP;对于异构工作负载,请使用线程池。如果可以的话,避免管理自己的线程。
蒙特卡洛模拟应该是真正并行代码而不是线程池的良好候选者。
顺便说一句 - 如果您使用的是 Visual C++,Visual C++ v10 中有一个有趣的新
The best way to apportion the workload is workload-dependent.
Broadly - for parallelizable workload, use OpenMP; for heterogeneous workload, use a thread pool. Avoid managing your own threads if you can.
Monte Carlo simulation should be a good candidate for truly parallel code rather than thread pool.
By the way - in case you are on Visual C++, there is in Visual C++ v10 an interesting new Concurrency Runtime for precisely this type of problem. This is somewhat analogous to the Task Parallel Library that was added to .Net Framework 4 to ease the implementation of multicore/multi-CPU code.
避免创建线程,除非你能让它们忙碌!
如果您的场景受计算限制,那么您应该将生成的线程数最小化到您期望代码运行的核心数。如果您创建的线程多于内核数量,则操作系统必须浪费时间和资源来调度线程在可用内核上执行。
如果您的场景是 IO 绑定的,那么您应该考虑使用排队的异步 IO 操作,并在异步结果返回后检查响应代码。同样,在这种情况下,为每个 IO 操作生成一个线程是非常浪费的,因为您将导致操作系统不得不浪费时间来调度停滞的线程。
Avoid creating threads unless you can keep them busy!
If your scenario is compute-bound, then you should minimize the number of threads you spawn to the number of cores you expect your code to run on. If you create more threads than you have cores, then the OS has to waste time and resources scheduling the threads to execute on the available cores.
If your scenario is IO-bound, then you should consider using async IO operations that are queued and which you check the response codes from after the async result is returned. Again, in this case, spawning a thread per IO operation is hugely wasteful as you'll cause the OS to have to waste time scheduling threads that are stalled.
这里的每个人基本上都是对的,但是这里有一个快速而肮脏的方法来分割工作并使所有处理器保持忙碌。当 1) 与迭代中完成的工作相比,创建线程成本较高时,此方法效果最佳。 2) 大多数迭代需要大约相同的时间才能完成。
首先,为每个处理器/核心创建 1 个线程。这些是您的工作线程。他们无所事事,直到被告知去做某事。
现在,分割您的工作,使同时需要的数据紧密结合在一起。我的意思是,如果您在双处理器计算机上处理一个十元素数组,您会将其拆分,以便一组是元素 1,2,3,4,5,另一组是 6,7 ,8,9,10。您可能想将其拆分为 1,3,5,7,9 和 2,4,6,8,10,但是这样您将导致更多错误共享 (http://en.wikipedia.org/ wiki/False_sharing)在您的缓存中。
现在,每个处理器都有一个线程,每个线程都有一组数据,您只需告诉每个线程处理一组独立的数据即可。
所以在你的情况下我会做这样的事情。
当然,我遗漏了诸如处理数据不是处理器数量的整数倍之类的事情,但这些很容易修复。
另外,如果您不反对第 3 方库,英特尔的 TBB(线程构建块)可以很好地从您那里抽象出来,让您开始真正想做的工作。
Everyone here is basically right, but here's a quick-and-dirty way to split up the work and keep all of the processors busy. This works best when 1) creating threads is expensive compared to the work done in an iteration 2) most iterations take about the same amount of time to complete
First, create 1 thread per processor/core. These are your worker threads. They sit idle until they're told to do something.
Now, split up your work such that work that data that is needed at the same time is close together. What I mean by that is that if you were processing a ten-element array on a two processor machine, you'd split it up so that one group is elements 1,2,3,4,5 and the other is 6,7,8,9,10. You may be tempted to split it up 1,3,5,7,9 and 2,4,6,8,10, but then you're going to cause more false sharing (http://en.wikipedia.org/wiki/False_sharing) in your cache.
So now that you have a thread per processor and a group of data for each thread, you just tell each thread to work on an independent group of that data.
So in your case I'd do something like this.
Of course I left out things like dealing with your data not being an integer multiple of the number of processors, but those are easily fixed.
Also, if you're not adverse to 3rd party libraries, Intel's TBB (threading building blocks) does a great job of abstracting this from you and letting you get to the real work you want to do.
围绕创建和调用线程的一切都相对昂贵,因此您希望尽可能少地这样做。
如果并行化内部循环而不是外部循环,则对于外部循环的每次迭代,都会创建 j_max 线程。比并行化外循环的情况要多 i_max 的数量级。
也就是说,最好的并行化取决于您的实际问题。根据这一点,并行化内部循环实际上是有意义的。
Everything around creating and calling threads is relatively expensive so you want to do that as little as possible.
If you parallelize your inner loop instead of the outer loop, then for each iteration of the outer loop j_max threads are created. An order of i_max more than if you parallelized the outer loop instead.
That said, the best parallelization depends on your actual problem. Depending on that, it can actually make sense to parallelize the inner loop instead.
取决于任务以及您要模拟的平台。例如,在 CUDA 架构上,您可以将任务拆分,以便每个 i,j,1 单独完成。
您仍然有时间考虑将数据加载到卡上。
使用for循环和OpenMP/MPI/你自己的线程机制之类的东西,你基本上可以选择。在一种情况下,并行线程被分解,j 在每个线程上顺序循环。另外,依次处理一个循环,并且在每次并行化中打破一个循环。
并行化(断开线程)的成本很高。请记住,您需要设置 n 个线程,然后同步 n 个线程。这表示超出例程运行时间的成本,其本身可以使并行处理的总时间比单线程模式下的总时间更长。这取决于所讨论的问题;通常,存在一个临界大小,超过该大小并行速度会更快。
我建议在第一个 for 循环中突破到并行区域会更快。如果在内循环上执行此操作,则每次运行循环时都必须 fork/join,这会给代码速度增加很大的开销。理想情况下,您只需要创建一次线程。
Depends on the tasks and on what platform you're about to simulate on. For example, on CUDA's architecture you can split the tasks up so that each i,j,1 is done individually.
You still have the time to load data onto the card to consider.
Using for loops and something like OpenMP/MPI/your own threading mechanism, you can basically choose. In one scenario, parallel threads are broken out and j is looped sequentially on each thread. In the ohter, a loop is processed sequentially, and a loop is broken out in each parallelisation.
Parallelisation (breaking out threads) is costly. Remember that you have the cost of setting up n threads, then synchronising n threads. This represents a cost c over and above the runtime of the routines which in and of itself can make the total time greater for parallel processing than in single threaded mode. It depends on the problem in question; often, there's a critical size beyond which parallel is faster.
I would suggest breaking out into the parallel zone in the first for loop would be faster. If you do so on the inner loop, you must fork/join each time the loop runs, adding a large overhead to the speed of the code. You want, ideally, to have to create threads only once.