我正在研究一些优化的、低级的、跨平台的、设计用于在多核机器上运行的并发代码,并想检查它的一些假设。
多核设计可能不支持某些类型的硬件优化(例如,乱序执行支持 [wikipedia] 似乎是一个不错的候选者 - 它需要大量的表面积来实现,并且可能会消耗大量能量)。有没有人有其他此类设施的列表 - 通常在单核或少量核心机器上可用,但通常在具有大量核心的机器上被排除?
I'm looking at some optimized, low level, cross platform, concurrency code designed to run on multi-core machines, and want to check some of its assumptions.
Support for hardware optimizations of some kinds aren't, probably, supported on multi core designs (for example, Out of Order Execution support [wikipedia] seems like a good candidate - it takes a lot of surface area to implement, and can be a power hog). Does anyone have a list of other such facilities - ones typically available on single or small number of core machines, but typically left out from machines with larger number of cores on them?
发布评论
评论(1)
如今,多核机器是单处理器的芯片缩小版。您几乎可以想象将 4 核芯片锯成 4 个 1 核芯片。我只是夸大了一点点。
未来,多核机器将在能源效率和面积效率方面进行更周到的设计。您可能会看到相同的 ISA,但具有不同的资源组合(更多或更少数量的重复功能单元),甚至在内核之间共享一些资源(例如 AMD Bulldozer)。而且,正如您所说,摆脱无限制的乱序执行的复杂性和能源开销。这很可能被视为同一指令集架构上不同的每时钟指令 (IPC) 差异(或多或少的性能)。
此外,由于供应商必须兼顾假设的大型乱序串行性能优化内核和小型有序或少乱序 (OoO) 以及更窄、更节能的“吞吐量”内核的组合,他们将面临的挑战是如何使这些不同的实现与其 ISA 的发展保持同步。一些核心可能比其他核心更早支持新指令、新状态、新协处理器、虚拟化、安全性等。这导致了对公分母进行编码的挑战,同时还点亮了新设施,以在具有新功能的核心上获得更好的性能或能源效率(或其他)。
因此,为了回答您的具体问题,所有用于以表达能力、性能或能源效率来交换门的传统计算机架构技术都可能会被重新考虑,并在未来的小型面向吞吐量的核心中选择性地删除。
但它是双向的。也可能是新的小型吞吐量优化能量优化核心具有旧 OoO 核心中不存在的新功能。例如,Larrabee 新指令 (LRBni) (http://www.drdobbs.com/high-performance-computing/216402188) 是针对具有数十个更简单内核的机器提出的。作为另一个例子,小核可以转向硬件多线程来提供更好的内存延迟容忍度,以补偿较小的私有缓存。
此外,拥有大量小型节能核心意味着您可能愿意奉献并因此定制一些核心来优化特定有价值的工作负载的性能。例如,Tensilica 定制处理器和工具预计您的一些小内核将具有额外的指令和定制的特定问题数据路径(例如,加速视频解码的内循环)。因此,在这些情况下,小核心可能(与直觉相反)比大核心具有更好的性能。
有道理吗?
快乐黑客!
Today, multicore machines are warmed-over die shrinks of uniprocessors. You could almost imagine sawing a 4-core die into 4 1-core dice. I exaggerate only a little bit.
In future, multicore machines will be more thoughtfully designed for energy efficiency and area efficiency. You may see the same ISA, but with different mixes of resources (more or fewer numbers of duplicated functional units), and even with some sharing of resources between cores (e.g. AMD Bulldozer). And, as you say, backing off from the complexity and energy overhead of no-holds-barred out-of-order execution. This will most likely be perceived as different instruction-per-clock (IPC) differences (more or less performance) on the same instruction set architecture.
Also as vendors have to juggle a hypothetical portfolio of big out-of-order serial performance optimized cores and small in-order or less-out-of-order (OoO) and narrower, more energy efficient "throughput" cores, they will be challenged to keep these different implementations in sync with the evolutions of their ISAs. Some cores may support new instructions, new state, new coprocessors, virtualization, security, etc. earlier than others. This leads to a challenge of coding to the common denominator while also lighting up the new facilities for better perf or energy efficiency (or whatever) on those cores that have the new capabilities.
So to answer your specific question, all the traditional computer architecture techniques for trading gates for expressive-power, or performance, or energy efficiency may be rethought and selectively removed in future small throughput-oriented cores.
But it goes both ways. It may also be that the new small throughput-optimized energy-optimized cores have new features not present in the older OoO cores. For example, the Larrabee New Instructions (LRBni) (http://www.drdobbs.com/high-performance-computing/216402188) were proposed for a machine with dozens of simpler cores. As another example, the small cores may turn to hardware multithreading to afford better memory latency tolerance to compensate for smaller private caches.
Also, having lots of small energy frugal cores means you may be willing to dedicate and therefore customize some of the cores to optimize performance for particular valuable workloads. For example, the Tensilica custom processors and tools anticipate that some of your small cores will have additional instructions and custom problem-specific datapaths (accelerating an inner loop of video decoding, for example). So in these cases the little core may (counter-intuitively) have much better performance than the much larger core.
Makes sense?
Happy hacking!