对多线程应用程序使用 perf stat
我使用 serial 和 OpenMP 实现。对于相同大小 (3200x3200),perf stat -a -e instructions,cycles
显示:
串行
265,755,992,060 instructions # 0.71 insn per cycle
375,319,584,656 cycles
85.923380841 seconds time elapsed
并行(16 个线程)
264,524,937,733 instructions # 0.30 insn per cycle
883,342,095,910 cycles
13.381343295 seconds time elapsed
在并行运行中,我预计周期数大致与串行运行。但事实并非如此。
对于差异有什么想法吗?
更新:
我使用 8 和 16 个线程重新运行实验,因为处理器最多有 16 个线程。
Using 8 threads
Max nthread = 16
Total execution Time in seconds: 13.4407235400
MM execution Time in seconds: 13.3349801241
Performance counter stats for 'system wide':
906.51 Joules power/energy-pkg/
264,995,810,457 instructions # 0.59 insn per cycle
449,772,039,792 cycles
13.469242993 seconds time elapsed
正如
Using 16 threads
Max nthread = 16
Total execution Time in seconds: 13.2618084711
MM execution Time in seconds: 13.1565077840
Performance counter stats for 'system wide':
1,000.39 Joules power/energy-pkg/
264,309,881,365 instructions # 0.30 insn per cycle
882,881,366,456 cycles
13.289234564 seconds time elapsed
您所看到的,挂钟大致相同,但 16 线程的周期是 8 线程的 2 倍。这意味着通过更高的周期数和更低的 IPC,可以通过更多线程保持挂钟与以前一样。根据perf list
,该事件是cpu-cycles OR Cycles [Hardware event]
。我想知道的是一个核心或聚合N个核心的平均周期?对此有何评论?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
假设您的进程完美扩展,指令数量将在所有内核之间共享,并且指令总数将相同。同样的情况也适用于循环次数。但是,当您的流程无法扩展时,指令数应该相同,但周期数会增加。这通常是由于共享资源的争用导致管道中的停顿周期。
在并行实现中,内存层次结构未得到有效使用,因为并行实现会导致大量缓存未命中,当使用许多线程时,这些缓存未命中可能会使 L3 或 RAM 饱和,从而导致内存停顿。如果您使用同时多线程(又名超线程),这也可能导致此问题,因为同一核心上的两个线程通常不会真正并行运行(核心的某些部分在线程之间共享)。
Assuming your process perfectly scale, the number of instructions will be shared amongst all cores and the total number of instructions will be the same. The same thing applies for the number of cycles. However, when your process does not scale, the number of instruction should be the same but the number of cycle increase. This is generally due to the contention of a shared resource that cause stall cycles in the pipeline.
In your parallel implementation, the memory hierarchy is not efficiently used since your parallel implementation cause a lot of cache misses that could saturate the L3 or the RAM when many threads are used, hence memory stalls. If you use simultaneous multithreading (aka Hyper-threading), this can also cause this problem because two threads on the same core often does not run truly in parallel (some part of the core are shared between threads).
不断的指示是有道理的;无论这些指令是否都在同一个内核上运行,总工作量都是相同的。
您是否使用 SMT 例如超线程?当同一个物理核心在两个逻辑核心之间分配时间时,IPC 会关闭。对于某些程序的扩展,SMT 可以提高整体吞吐量;对于缓存绑定程序(如简单的 matmul),有 2 个线程竞争相同的 L1/L2 缓存会造成伤害。
否则,如果这是 16 个物理核心,当单个线程本身没有所有 L3 时,您可能会看到 L3 争用的效果。
无论如何,通过适当的缓存阻塞,SMT 通常会损害 matmul/稠密线性代数的整体吞吐量。一个物理核心可能会被一个线程和经过良好调整的代码的 ALU 工作所饱和,因此对每核心缓存的争用只会造成伤害。 (但多线程肯定有助于整体时间,就像在您的情况下一样。)缓存阻塞的 matmul 通常是 5 个嵌套循环,如 每个程序员都应该了解内存什么?
Agner Fog (https://agner.org/optimize/)还在他的 microarch PDF 中提到了超线程对某些工作负载的损害。
Constant instructions makes sense; there's the same amount of total work to do, whether those instructions all run on the same core or not.
Are you using SMT such as hyperthreading? IPC goes down when the same physical core divides its time between two logical cores. For some programs scaling, SMT increases overall throughput; for cache-bound programs (like a naive matmul) having 2 threads compete for the same L1/L2 cache hurts.
Otherwise, if this is 16 physical cores, you might be seeing that effect for L3 contention, when a single thread doesn't have all the L3 to itself.
Anyway, with proper cache-blocking, SMT usually hurts overall throughput for matmul / dense linear algebra. A physical core can be saturated with ALU work by one thread with well-tuned code, so contention for per-core caches just hurts. (But multi-threading definitely helps overall time, like it did in your case.) A cache-blocked matmul is usually 5 nested loops, as in the example near the end of What Every Programmer Should Know About Memory?
Agner Fog (https://agner.org/optimize/) also mentions hyperthreading hurting for some workloads in his microarch PDF.
矩阵-矩阵乘法是理论上具有大量缓存重用的操作的典型示例,因此可以接近峰值速度运行。但标准三循环实现则不然。有效使用缓存的优化实现实际上有六级深度:三个循环中的每一个都被平铺,并且您需要交换循环并将平铺设置得恰到好处。这并非小事。
所以你的实现基本上不使用缓存。也许您可以从 L3 缓存中获得一些效果,但是,如果问题规模足够大,那么肯定不是 L1,也可能不是 L2。换句话说:您受到带宽限制。然后问题是您可能有 16 个核心,但没有足够的带宽来满足所有这些核心的需要。
我必须说你的 6--7 系数有点令人失望,但也许你的架构没有足够的带宽。我知道对于顶级节点,我希望有 12 个节点。但是为什么不测试一下呢?编写一个除了将数据传入和传出内存之外什么也不做的基准测试。矢量-矢量加法。然后看看有多少个核心可以获得线性加速。
为了解决您在回复中提出的一些问题:
加速很好,但您应该看看性能。 Matmul 可以以峰值的 90% 以上运行。使用 MKL 或任何优化的 BLAS,并将其与您得到的进行比较。
SIMD 在加速上没有区别,只有在绝对性能上有区别。
在
i,j,k
更新中,您没有说明哪个索引位于内部循环中。无论如何,让您的编译器生成优化报告。您会发现编译器非常聪明,并且很可能通过交换循环来实现矢量化。FP 延迟并不是像您这样简单的内核中的问题。在预取和乱序执行等之间,您实际上不必太担心延迟。
确实:您的性能受到带宽限制。但您只是测量加速,因此您并没有真正看到这一点,除了使用所有核心将使您的带宽饱和并限制您的加速这一事实之外。
Matrix-matrix multiplication is the typical example of an operation that theoretically has lots of cache reuse, and so can run close to peak speed. except that the standard three-loop implementation doesn't. Optimized implementations that use the cache efficiently are actually six levels deep: each of your three loops gets tiled, and you need to interchange loops and set the tilings just right. This is not trivial.
So your implementation basically uses no cache. Maybe you can have some effects from the L3 cache, but, given a sufficient problem size, certainly not L1, and likely not L2. In other words: you are bandwidth-bound. And then the problem is that you may have 16 cores, but you don't have enough bandwidth to feed all of those.
I must say that your factor of 6--7 is a little disappointing, but maybe your architecture just doesn't have enough bandwidth. I know that for top-of-the-line nodes I would expect something like 12. But why don't you test it? Write a benchmark that does nothing but stream data to and from memory. Vector-vector addition. And then see how many cores you can get linear speed up.
To address some points you raise in your reply:
Speedup is nice, but you should take a look at performance. Matmul can run at 90+ percent of peak. Use MKL or any optimized BLAS, and compare that to what you get.
SIMD makes no difference in speedup, only in absolute performance.
In that
i,j,k
update you're not stating which index is in the inner loop. Regardless, let your compiler generate an optimization report. You'll find that compilers are that clever, and may very well have interchanged loops to get vectorization.FP latency is not a concern in kernels as simple as yours. Between prefetching and out-of-order execution and whatnot you really don't have to worry much about latency.
Really: your performance is bandwidth-limited. But you're only measuring speedup so you don't really see that, apart from the fact that using all your cores will saturate your bandwidth and cap your speedup.