如何在不同处理器之间分配负载

发布于 2024-10-31 14:55:53 字数 104 浏览 5 评论 0原文

我正在一台机器上运行一些并行代码,该机器有 4 个英特尔处理器,每个处理器有 8 个核心。我正在使用 TBB。假设给定的循环(我并行化)有 X 次迭代,我应该如何选择我的粒度以确保负载均匀分配?

I am running some parallel code on a machine which has 4 intel processors and 8 cores on each .I am using TBB.Suppose a given loop(that I parallelize ) has X iterations how should I choose my grainsize to ensure the load is evenly divided?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

情感失落者 2024-11-07 14:55:53

假设您有 N 个同样强大的 CPU。

如果没有循环携带依赖项(例如,迭代 i 中没有任何内容被后续迭代使用),那么您可以简单地在 CPU 1 上运行循环迭代 0..X/N,以及迭代 (X/N)+1..( 2*X/N) 在 CPU 2 等上,假设每次迭代花费完全相同的时间,或者至少平均时间变化不大。

如果有循环进行
依赖关系,如果迭代 i 依赖于所有先前的迭代,您可能会遇到问题。如果它仅取决于之前的 k 次迭代,则可以让 CPU1 执行迭代 0..X/N,CPU2 执行迭代 X/Nk..(2*X/N),浪费一些工作,但允许 CPU2 收集所有处理器所需的结果等。

如果迭代所花费的时间差异很大,那么您最好设置一个包含迭代的工作列表,
并让 CPU 在完成之前的迭代时从工作列表中获取迭代。这样,随着需求的出现,工作就被分配了。你必须确保每单位工作所花费的时间远大于完成工作所付出的努力,否则你将无法获得同等的优势;实现此目的的一种方法是从工作列表中获取小范围的迭代,使得该范围内的总工作显着超过调度开销。

Assume you have N equally powerful CPUs.

If there are no loop carried dependencies (e.g, nothing in iteration i is used by following iterations), then you can simply run loop iterations 0..X/N on CPU 1, and iterations (X/N)+1..(2*X/N) on CPU 2, etc, assuming that each iteration takes exactly the same amount of time, or at least an average amount of that doesn't vary wildly.

If there are loop carried
dependencies, you may have a problem if iteration i depends on all previous iterations. If it only dependes on the the previous k iterations, you can have CPU1 do iterations 0..X/N, and CPU2 do iterations X/N-k..(2*X/N), wasting some work but allowing CPU2 to collect the results it needs, etc. for all processors.

If iterations take wildly varying amounts of time, you're better off setting up a worklist containing the iterations,
and have the CPUs grab iterations from the workslist as they complete previous iterations. This way the work is divided up as demand appears. You have to be sure that the time per unit of work grabbed is lots larger than the effort to get the work, or you'll get no parallel advantage; one way to do this is to grab a small range of iterations from the worklist, such that the total work in the range exceeds the scheduling overhead significantly.

岁月染过的梦 2024-11-07 14:55:53

使用TBB,您不必为parallel_for 选择粒度。在大多数情况下,TBB 默认情况下会很好地动态负载平衡工作。 Ira Baxter 的答案正确地描述了如何在线程池中划分工作;但 TBB 已经有类似的机制可以为您做到这一点。

添加:当然,在复杂的情况下,手动工作分区可能会得到更好的结果。尽管在这种情况下,可能需要使用 TBB 任务,因为 parallel_for 可能无法提供足够的控制;例如,通常不可能指定每个线程块的确切大小。

With TBB, you don't have to select a grain size for parallel_for. In most cases, TBB will dynamically load balance the work pretty well by default. The answer of Ira Baxter correctly describes how you should partition the work across a pool of threads; but TBB already has similar mechanisms in place that do this for you.

ADDED: Surely manual work partitioning might get better results in complex cases. Though in this case one would likely need to use TBB tasks, as parallel_for might not provide enough control; for example, in general it is not possible to specify the exact size of a per-thread chunk.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文