为什么增加管道深度并不总是意味着增加吞吐量?

发布于 2024-08-27 17:22:29 字数 165 浏览 5 评论 0原文

这可能更多的是一个讨论问题,但我认为 stackoverflow 可能是提出这个问题的正确地方。我正在研究指令流水线的概念。据我所知,一旦管道级数增加,管道的指令吞吐量就会增加,但在某些情况下,吞吐量可能不会改变。在什么条件下,会发生这种情况?我认为停滞和分支可能是问题的答案,但我想知道我是否错过了一些关键的东西。

This is perhaps more of a discussion question, but I thought stackoverflow could be the right place to ask it. I am studying the concept of instruction pipelining. I have been taught that a pipeline's instruction throughput is increased once the number of pipeline stages is increased, but in some cases, throughput might not change. Under what conditions, does this happen? I am thinking stalling and branching could be the answer to the question, but I wonder if I am missing something crucial.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

墨小沫ゞ 2024-09-03 17:22:29

当等待结果或缓存未命中时,整个过程可能会被其他指令停止。管道本身并不保证操作是完全独立的。
以下是有关 x86 Intel/AMD 架构复杂性的精彩演示: http://www.infoq.com/presentations/click-crash-course-modern-hardware

它非常详细地解释了类似的内容,并涵盖了一些有关如何进一步提高吞吐量和隐藏延迟的解决方案。 JustJeff 提到了乱序执行,并且你有程序员模型未公开的影子寄存器(x86 上超过 8 个寄存器),并且你还有分支预测。

The throughout can be stalled by other instructions when waiting for a result, or on cache misses. Pipelining doesn't itself guarantee that the operations are totally independent.
Here is a great presentation about the intricacies of the x86 Intel/AMD architecture: http://www.infoq.com/presentations/click-crash-course-modern-hardware

It explains stuff like this in great detail, and covers some solutions on how to further improve throughput and hide latency. JustJeff mentioned out-of-order execution for one, and you have shadow registers not exposed by the programmer model (more than 8 registers on x86), and you also have branch prediction.

花开半夏魅人心 2024-09-03 17:22:29

同意。最大的问题是停顿(等待先前指令的结果)和不正确的分支预测。如果您的管道有 20 个阶段深,并且您等待条件或操作的结果,那么您等待的时间将比管道只有 5 个阶段时等待的时间更长。如果您预测错误的分支,则必须从管道中清除 20 条指令,而不是 5 条。

我想您可能有一个深管道,其中多个阶段尝试访问相同的硬件(ALU 等),这将导致性能下降,但希望您投入足够的额外单元来支持每个阶段。

Agreed. The biggest problems are stalls (waiting for results from previous instructions), and incorrect branch prediction. If your pipeline is 20 stages deep, and you stall waiting for the results of a condition or operation, you're going to wait longer than if your pipeline was only 5 stages. If you predict the wrong branch, you have to flush 20 instructions out of the pipeline, as opposed to 5.

I guess presumably you could have a deep pipeline where multiple stages are attempting to access the same hardware (ALU, etc), which would cause a performance hit, though hopefully you throw in enough additional units to support each stage.

乄_柒ぐ汐 2024-09-03 17:22:29

指令级并行性的收益递减。特别是,指令之间的数据依赖性决定了可能的并行性。

考虑先写后读的情况(教科书中称为 RAW)。

在第一个操作数获取结果的语法中,请考虑此示例。

10: add r1, r2, r3
20: add r1, r1, r1

第 10 行的计算开始时必须知道第 10 行的结果。数据转发缓解了这个问题,但是……仅限于数据已知的程度。

Instruction level parallelism has diminishing returns. In particular, data dependencies between instructions determine the possible parallelism.

Consider the case of Read after Write (known as RAW in textbooks).

In the syntax where the first operand gets the result, consider this example.

10: add r1, r2, r3
20: add r1, r1, r1

The result of line 10 must be known by the time the computation of line 10 begins. Data forwarding mitigates this problem, but...only to the point where the data gets known.

爱殇璃 2024-09-03 17:22:29

我还认为,增加流水线操作超过一系列中最长指令执行所需的时间不会导致性能的提高。但我确实认为停滞和分支是根本问题。

I would also think that increasing pipelining beyond the amount of time the longest instruction in a series would take to execute would not cause an increase in performance. I do think that stalling and branching are the fundamental issues though.

拿命拼未来 2024-09-03 17:22:29

长管道中的停顿/气泡肯定会导致吞吐量的巨大损失。当然,管道越长,浪费的时钟周期就越多。

我花了很长时间尝试考虑其他场景,其中较长的管道可能会导致性能损失,但一切都会陷入停滞。 (还有执行单元的数量和发行方案,但这些与管道长度没有太大关系。)

Definitely stalls/bubbles in long pipelines cause a huge loss in throughput. And of course, the longer the pipeline the more clock cycles are wasted.

I tried for a long time to think of other scenarios where longer pipelines could cause a loss in performance, but it all comes back to stalls. (And number of execution units and issue schemes, but those don't have much to do with pipeline length.)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文