GPU序列化分解
据此, http://www.nvidia.co.uk /content/PDF/isc-2011/Ziegler.pdf,我理解GPU文献中的重放意味着序列化。但影响连载数量的因素有哪些呢?
为此,我做了一些实验。分析一些内核并查找重播次数(= 发出的指令 - 执行的指令)。有时,库冲突的数量等于重播的数量。有时,银行冲突的数量较少。这意味着银行冲突的数量始终是一个因素。另一个呢?
根据上面的幻灯片(来自幻灯片 35),还有其他一些:
。指令高速缓存未命中
。持续的内存库冲突
据我了解,可能还有其他两个:
。分支数量不同。由于两条路径都被执行,因此存在重播。但我不确定发出的指令数量是否会受到分歧的影响?
。缓存未命中数。我听说长延迟内存请求有时会被重放。但在我的实验中,L1 缓存未命中率通常高于重放率。
任何人都可以确认这些因素是否有助于序列化?什么是不正确的,我还错过了什么吗?
谢谢
According to this, http://www.nvidia.co.uk/content/PDF/isc-2011/Ziegler.pdf, I understand that replays in the GPU literature mean serializations. But what are the factors that contribute to the number of serializations?
To do this, I did some experiments. Profiled a some kernels and find the number of replays (= issued instructions - executed instructions). Sometimes, the number of bank conflicts to be equal to the number of replays. Some other times, the number of bank conflicts is smaller. This implies the number of bank conflicts is always a factor. What about the other?
According to the slides above (from slides 35), there are some others:
. The instruction cache misses
. Constant memory bank conflicts
To my understanding, there can be two others:
. The number of branches divergences. Since both paths are executed, there are replays. But i'm not sure if the number of issued instructions are affected by divergences or not?
. The number of cache misses. I have heard that long latency memory requests will be replayed sometimes. But in my experiments, L1 cache misses are often higher than replays.
Can anyone confirm these factors are those contribute to serializations? What is incorrect and do I miss something else?
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
据我所知,分支分歧有助于指令重放。
我不确定缓存未命中的数量。这应该由内存控制器透明地处理,而不影响指令。我能想到的更糟糕的事情是管道会停止,直到内存被正确获取为止。
As far as I know branch divergence contributes to instruction replay.
I am not sure about the number of cache misses. That should be handled transparently by the memory controller not affecting the instruction. Worse thing I can think is that the pipeline gets stopped until the memory has been properly fetched.