有谁知道有什么编译器可以优化嵌入式设备的能耗代码？

发布于 2024-10-12 23:40:10 字数 428 浏览 13 评论 0原文

人们普遍认为，更快的代码会消耗更少的电量，因为您可以让 CPU 处于空闲状态更长的时间，但是当我们谈论能耗时，存在以下一种可能性：

假设有一个指令序列在 1ms 内执行，并且在执行过程中过程中平均电流消耗为 40mA。 .你的Vdd是3.3V，

所以消耗的总能量= V*I*t = 3.3 * 40*10^-3 * 1*10^-3焦耳= 13.2*10^-6焦耳

，在另一种情况下有一个指令序列其执行时间为2ms，执行过程中平均电流消耗为15mA。 Vdd 为 3.3V，

所以消耗的总能量 = V*I*t = 3.3 * 15*10^-3 * 2*10^-3 焦耳 = 9.9*10^-6 焦耳，

所以问题来了。 ……是否有任何架构具有不同的指令集，可以以不同的电流消耗执行相同的任务？

如果有……那么是否有编译器考虑到这一点并生成节能的代码？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

请远离我 2024-10-19 23:40:10

据我所知，没有，但我认为使用像 LLVM 这样的编译器框架，通过调整指令调度程序的加权算法，这应该是可能的。

编辑：有一个关于LLVM中的能源消耗分析的讨论在 FOSDEM。

回复收藏 0 原文

送你一个梦 2024-10-19 23:40:10

事实上，编译器完成的任何“代码优化”，比非优化代码更快地计算答案，都是“节能”的。（正如另一位发帖者所观察到的，避免缓存未命中是一个巨大的胜利）。所以真正的问题是，“哪些优化是明确旨在节省能源，而不是减少执行时间？” （注意：一些“优化”减少了代码占用空间的大小（通过将代码序列抽象为子例程等）；这实际上可能会花费更多的能量）。

一个不寻常的变化是我在任何编译器中都没有见过的，那就是改变数据的表示形式。事实证明，存储/传输零位的成本与存储一位的成本不同。（我对 TTL 和 CMOS 的经验是“零”更昂贵，因为它们在硬件中实现为一种通过电源电阻器的“主动下拉”，导致电流流动，从而产生热量，而“一”是通过让信号通过相同的下拉“浮动高”来实现）。如果存在偏差，则应实现程序代码和数据以最大化 1 位的数量，而不是 0 位。

对于数据来说，这应该相对简单。请参阅本文，了解在记忆;它包含一些非常精彩的图表。一个常见的主题是大量的内存位置被一小组不同值的成员占用。事实上，只有极少量的值（最多 8 个）占用最多48% 的内存位置，通常是非常小的数字（论文表明，对于某些程序，数据传输的很大一部分是小值，例如 0 到 4，其中 0 本质上是最常见的值）。 如果零的存储/传输确实比一更昂贵，则小通用值建议以其补码格式存储值。这是一个非常容易实现的优化。鉴于这些值并不总是最小的 N 个自然数，可以用 N 替换内存中第 N 个最常见的值并存储 N 的补码，从而查找更接近处理器的实际值。（该论文的作者建议使用硬件“值重用”缓存，但这不是编译器优化）。

对于程序代码来说，组织起来有点困难，因为指令集决定了你可以说什么，并且通常指令集的设计独立于任何能量测量。然而，人们可以选择不同的指令序列（这就是优化器所做的）并最大化指令流中的一位。我怀疑这对于传统指令集操作码是否非常有效。一旦肯定可以将变量放入地址具有大量一位的位置，并且更喜欢使用具有较高编号的寄存器而不是较低编号的寄存器（在x86上，EAX是二进制寄存器编号000，EDI是寄存器编号111）进而根据指令执行频率来设计指令集，为频繁执行的指令分配较多1位的操作码。

Virtually any "code optimization" done by a compiler, that computes the answer more quickly than the non-optimized code, is "energy saving". (As another poster observed, avoiding cache misses is a big win). So the real question is, "what optimizations are explicitly intended to save energy, vs. reduce execution time?" (Note: some "optimizations" reduce code footprint size (by abstracting sequences of code into subroutines, etc.); this may actually cost more energy).

An unusual one, that I have not seen in any compiler, is changing the representation of the data. It turns out that the cost of storing/transmitting a zero bit, is different than the cost of storing a one bit. (My experience with TTL and CMOS is "zero" are are more expensive, because they are implemented in hardware as a kind of "active pull-down" through a resistor from the powersupply, causing current flow thus heat, whereas "ones" are implemented by letting a signal "float high" through the same pull down). If there is a bias, then one should implement the program code and data to maximize the number of one bits, rather than zero bits.

For data, this should be relatively straightforward to do. See this paper for a very nice survey and analysis of value found in memory; it contains some pretty wonderful charts. A common theme is A large number of memory locations are occupied by members of a small set of distinct values. In fact, only a very small number of values (up to 8) occupy up to 48% of memory locations, often being very small numbers (the papers shows for some programs that a significant fraction of the data transfers are for small values, e.g., 0 to 4, with zero being essentially the most common value). If zeros are truly more expensive to store/transfer than ones, small common values suggest storing values in their ones complement format. This is a pretty easy optimization to implement. Given that the values are not always the smallest N naturals, one could replace the Nth most frequent value in memory with N and store the complement of N, doing a lookup of the actual value closer to the processor. (The paper's author suggests a hardware "value reuse" cache, but that's not a compiler optimization).

This is a bit hard to organize for program code, since the instruction set determines what you can say, and usually the instruction set was designed independently of any energy measurements. Yet one could choose different instruction sequences (that's what optimizers do) and maximized for one bits in the instruction stream. I doubt this is very effective on conventional instruction set opcodes. Once certainly could place variables into locations whose address has large numbers of one bits, and prefer use registers with higher numbers rather than lower ones (on the x86, EAX is binary-register-number 000 and EDI is register number 111) One could go so far as to design an instruction set according to instruction execution frequencies, assigning opcode with larger numbers of one bits to frequently executed instructions.

回复收藏 0 原文

青芜 2024-10-19 23:40:10

在单个指令级别，诸如移位而不是乘法之类的事情肯定会降低电流，从而降低能耗，但我不确定我是否相信你的例子，即花费两倍的时间但使用一半的电流（对于给定的时钟速率）。用移位和加法代替乘法（使时间加倍）真的只需要一半的电流吗？ CPU 中还有很多其他的事情发生（只是芯片上的时钟分配占用电流），我认为背景电流使用占主导地位。

降低时钟频率可能是降低功耗可以采取的最重要的措施。尽可能多地并行是降低时钟速率的最简单方法。例如，通过显式中断使用 DMA 可以让算法处理在更少的周期内完成。如果您的 CPU 有奇怪的寻址模式或并行指令（我正在看着您，TMS320），如果您不能将紧密循环的执行时间减半至远低于电流的两倍，从而实现净节能，我会感到惊讶。在 Blackfin 系列 CPU 上，降低时钟可以降低核心电压，从而显着降低功耗。我想这在其他嵌入式处理器上也是如此。

在时钟频率之后，我敢打赌功耗主要由外部 I/O 访问决定。在低功耗环境中，诸如缓存未命中之类的事情会对您造成两次伤害 - 一次是速度，一次是访问外部内存。因此，例如，循环展开可能会使事情变得更糟，因为乘法所需的指令数量会增加一倍。

所有这些都表明，创造性的系统架构可能比告诉编译器优先使用一组指令而不是另一组指令对功耗产生更大的影响。但我没有数据支持这一点，我很想看到一些数据。

回复收藏 0 原文