“偷工减料”与“工作耸肩”?

发布于 2024-08-27 09:33:24 字数 723 浏览 15 评论 0原文

为什么我可以找到很多关于“工作窃取”的信息,而没有找到关于“工作耸肩”作为动态负载平衡策略的信息?

我所说的“工作耸肩”是指将多余的工作从繁忙的处理器上推到负载较少的邻居上,而不是让空闲的处理器从繁忙的邻居那里拉取工作(“工作窃取”) )。

我认为两种策略的总体可扩展性应该是相同的。然而,我相信就延迟和延迟而言,它的效率要高得多。当肯定有工作要做时唤醒空闲处理器,而不是让所有空闲处理器定期轮询所有邻居以查找可能的工作。

无论如何,快速谷歌没有在“工作耸肩”或类似的标题下显示任何内容,因此任何指向现有技术和该策略的行话的指示都将受到欢迎。

澄清

我实际上设想工作提交处理器(可能是也可能不是目标处理器)负责查看首选目标处理器的直接位置(基于数据) /code locality)来决定是否应该给近邻分配新工作,因为他们没有那么多工作要做。

我不认为决策逻辑需要的只是原子读取直接(通常是 2 到 4 个)邻居的估计 q 长度。我不认为这比盗贼轮询和盗贼暗示的耦合程度更高。从邻居那里偷窃。 (我假设两种策略中都有“无锁、无等待”队列)。

解决方案

看来我的意思(但仅部分描述!)作为“工作耸肩”策略属于“正常”预先调度策略的领域,这些策略恰好对处理器、缓存和调度策略很聪明。内存忠诚度和可扩展性。

我发现很多参考文献都在搜索这些术语,其中一些看起来非常可靠。当我找到最符合(或推翻!)我对“工作耸肩”的定义的逻辑时,我将发布一个参考。

Why is it that I can find lots of information on "work stealing" and nothing on "work shrugging" as a dynamic load-balancing strategy?

By "work-shrugging" I mean pushing surplus work away from busy processors onto less loaded neighbours, rather than have idle processors pulling work from busy neighbours ("work-stealing").

I think the general scalability should be the same for both strategies. However I believe that it is much more efficient, in terms of latency & power consumption, to wake an idle processor when there is definitely work for it to do, rather than having all idle processors periodically polling all neighbours for possible work.

Anyway a quick google didn't show up anything under the heading of "Work Shrugging" or similar so any pointers to prior-art and the jargon for this strategy would be welcome.

Clarification

I actually envisage the work submitting processor (which may or may not be the target processor) being responsible for looking around the immediate locality of the preferred target processor (based on data/code locality) to decide if a near neighbour should be given the new work instead because they don't have as much work to do.

I dont think the decision logic would require much more than an atomic read of the immediate (typically 2 to 4) neighbours' estimated q length here. I do not think this is any more coupling than implied by the thieves polling & stealing from their neighbours. (I am assuming "lock-free, wait-free" queues in both strategies).

Resolution

It seems that what I meant (but only partially described!) as "Work Shrugging" strategy is in the domain of "normal" upfront scheduling strategies that happen to be smart about processor, cache & memory loyality, and scaleable.

I find plenty of references searching on these terms and several of them look pretty solid. I will post a reference when I identify one that best matches (or demolishes!) the logic I had in mind with my definition of "Work Shrugging".

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

傲娇萝莉攻 2024-09-03 09:33:24

负载均衡不是免费的;它具有上下文切换(到内核)、查找空闲处理器以及选择要重新分配的工作的成本。尤其是在任务始终切换(每秒数十次)的机器中,这种成本会增加。

那么有什么区别呢?工作耸肩意味着您会因负载平衡的开销而进一步加重过度配置的资源(繁忙的处理器)的负担。当隔壁的处理器无事可做时,为什么要通过 administrivia 来中断繁忙的处理器呢?另一方面,工作窃取让空闲处理器运行负载平衡器,而繁忙的处理器则继续工作。偷工减料可以节省时间。

示例

考虑:处理器 A 分配有两个任务。它们分别需要时间 a1a2。附近的处理器 B(也许是缓存反弹的距离)处于空闲状态。处理器在所有方面都是相同的。我们假设每个任务和内核的代码都位于两个处理器的 i-cache 中(负载平衡时没有添加页面错误)。

任何类型的上下文切换(包括负载平衡)都需要时间 c。

无负载平衡

完成任务的时间将为 a1 + a2 + c。 处理器 A 将完成所有工作,并在两项任务之间进行一次上下文切换。

工作窃取

假设 B 窃取了 a2,从而导致上下文切换时间本身。这项工作将在 max(a1, a2 + c) 时间内完成。假设处理器 A 开始处理 a1;当它这样做时,处理器 B 将窃取 a2 并避免处理 a1 时出现任何中断。 B 上的所有开销都是空闲周期。

如果a2是较短的任务,那么在这里,您已经有效地隐藏了这种情况下上下文切换的成本;总时间为a1。

工作耸肩

假设B完成a2,如上所述,但A产生了移动它的成本(“耸肩”工作)。本例中的工作将在 max(a1, a2) + c 时间内完成;上下文切换现在总是附加到总时间中,而不是被隐藏。处理器 B 的空闲周期在这里被浪费了;相反,繁忙的处理器 A 将时间交给 B 处理。

Load balancing is not free; it has a cost of a context switch (to the kernel), finding the idle processors, and choosing work to reassign. Especially in a machine where tasks switch all the time, dozens of times per second, this cost adds up.

So what's the difference? Work-shrugging means you further burden over-provisioned resources (busy processors) with the overhead of load-balancing. Why interrupt a busy processor with administrivia when there's a processor next door with nothing to do? Work stealing, on the other hand, lets the idle processors run the load balancer while busy processors get on with their work. Work-stealing saves time.

Example

Consider: Processor A has two tasks assigned to it. They take time a1 and a2, respectively. Processor B, nearby (the distance of a cache bounce, perhaps), is idle. The processors are identical in all respects. We assume the code for each task and the kernel is in the i-cache of both processors (no added page fault on load balancing).

A context switch of any kind (including load-balancing) takes time c.

No Load Balancing

The time to complete the tasks will be a1 + a2 + c. Processor A will do all the work, and incur one context switch between the two tasks.

Work-Stealing

Assume B steals a2, incurring the context switch time itself. The work will be done in max(a1, a2 + c) time. Suppose processor A begins working on a1; while it does that, processor B will steal a2 and avoid any interruption in the processing of a1. All the overhead on B is free cycles.

If a2 was the shorter task, here, you have effectively hidden the cost of a context switch in this scenario; the total time is a1.

Work-Shrugging

Assume B completes a2, as above, but A incurs the cost of moving it ("shrugging" the work). The work in this case will be done in max(a1, a2) + c time; the context switch is now always in addition to the total time, instead of being hidden. Processor B's idle cycles have been wasted, here; instead, a busy processor A has burned time shrugging work to B.

旧伤还要旧人安 2024-09-03 09:33:24

我认为这个想法的问题在于,它使实际工作的线程浪费时间不断寻找空闲处理器。当然,有一些方法可以使其更快,例如拥有一个空闲处理器队列,但随后该队列就会成为并发瓶颈。因此,最好让那些无事可做的线程坐下来寻找工作。

I think the problem with this idea is that it makes the threads with actual work to do waste their time constantly looking for idle processors. Of course there are ways to make that faster, like have a queue of idle processors, but then that queue becomes a concurrency bottleneck. So it's just better to have the threads with nothing better to do sit around and look for jobs.

谈情不如逗狗 2024-09-03 09:33:24

“工作窃取”算法的基本优点是,当每个人都很忙时,移动工作的开销会降至 0。因此,只有当某些处理器处于空闲状态时才会产生开销,并且该开销成本主要由空闲处理器支付,而繁忙的处理器只有很小的与总线同步相关的成本。

The basic advantage of 'work stealing' algorithms is that the overhead of moving work around drops to 0 when everyone is busy. So there's only overhead when some processor would otherwise have been idle, and that overhead cost is mostly paid by the idle processor with only a very small bus-synchronization related cost to the busy processor.

梦里寻她 2024-09-03 09:33:24

据我了解,工作窃取是为高度并行的系统设计的,以避免由单个位置(单线程或单个内存区域)负责共享工作。为了避免这个瓶颈,我认为它确实会在简单情况下引入低效率。

如果您的应用程序不是那么并行,以至于单点工作分配会导致可伸缩性问题,那么我希望您可以通过按照您的建议显式管理它来获得更好的性能。

恐怕不知道你可能会在谷歌上搜索什么。

Work stealing, as I understand it, is designed for highly-parallel systems, to avoid having a single location (single thread, or single memory region) responsible for sharing out the work. In order to avoid this bottleneck, I think it does introduce inefficiencies in simple cases.

If your application is not so parallel that a single point of work distribution causes scalability problems, then I would expect you could get better performance by managing it explicitly as you suggest.

No idea what you might google for though, I'm afraid.

尴尬癌患者 2024-09-03 09:33:24

有些问题......如果一个繁忙的线程很忙,您是否不希望它花时间处理实际工作而不是推测性地寻找空闲线程来卸载?

当你的线程有如此多的工作以至于它应该停止做这项工作以寻找能够提供帮助的朋友时,你的线程如何决定?

您如何知道其他线程没有同样多的工作,并且您无法找到合适的线程来卸载?

工作窃取似乎更优雅,因为以一种保证执行负载平衡的线程仅执行负载平衡的方式解决了相同的问题(争用),否则它们将处于空闲状态。

我的直觉是,从长远来看,您所描述的不仅效率会低得多,而且需要对每个系统进行大量调整才能获得可接受的结果。

尽管在您的编辑中,您建议您希望提交处理器来处理此问题,而不是像您之前建议的那样以及此处的一些评论中建议的工作线程。如果提交处理器正在搜索最低队列长度,则可能会增加提交的延迟,这实际上并不是一件可取的事情。

但更重要的是,它是工作窃取的一种补充技术,而不是一种相互排斥的技术。您可能已经缓解了一些关于发明工作窃取是为了控制的争论,但是在获得良好结果之前,您仍然需要调整许多事情,这些调整对于每个系统来说都不相同,并且您仍然有可能遇到偷工减料对你有帮助的情况。

我认为您编辑的建议,即提交线程进行“智能”工作分配可能是针对工作窃取的过早优化。您的空闲线程是否对总线造成如此严重的影响,以至于您的非空闲线程无法完成任何工作?然后是优化工作窃取的时候了。

Some issues... if a busy thread is busy, wouldn't you want it spending its time processing real work instead of speculatively looking for idle threads to offload onto?

How does your thread decide when it has so much work that it should stop doing that work to look for a friend that will help?

How do you know that the other threads don't have just as much work and you won't be able to find a suitable thread to offload onto?

Work stealing seems more elegant, because solves the same problem (contention) in a way that guarantees that the threads doing the load balancing are only doing the load balancing while they otherwise would have been idle.

It's my gut feeling that what you've described will not only be much less efficient in the long run, but will require lots of of tweaking per-system to get acceptable results.

Though in your edit you suggest that you want submitting processor to handle this, not the worker threads as you suggested earlier and in some of the comments here. If the submitting processor is searching for the lowest queue length, you're potentially adding latency to the submit, which isn't really a desirable thing.

But more importantly it's a supplementary technique to work-stealing, not a mutually exclusive technique. You've potentially alleviated some of the contention that work-stealing was invented to control, but you still have a number of things to tweak before you'll get good results, these tweaks won't be the same for every system, and you still risk running into situations where work-stealing would help you.

I think your edited suggestion, with the submission thread doing "smart" work distribution is potentially a premature optimization against work-stealing. Are your idle threads slamming the bus so hard that your non-idle threads can't get any work done? Then comes the time to optimize work-stealing.

猫烠⑼条掵仅有一顆心 2024-09-03 09:33:24

因此,与“工作窃取”相比,“工作耸肩”的真正含义是一种正常的前期工作调度策略,它对处理器、缓存和网络进行智能处理。内存忠诚度和可扩展性。

搜索上述术语/行话的组合会产生许多后续的实质性参考。有些解决了机器虚拟化带来的额外复杂性,这实际上并不是提问者关心的问题,但总体策略仍然相关。

So, by contrast to "Work Stealing", what is really meant here by "Work Shrugging", is a normal upfront work scheduling strategy that is smart about processor, cache & memory loyalty, and scalable.

Searching on combinations of the terms / jargon above yields many substantial references to follow up. Some address the added complication of machine virtualisation, which wasn't infact a concern of the questioner, but the general strategies are still relevent.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文