优化此 C (AVR) 代码
我有一个中断处理程序,但运行速度不够快,无法完成我想做的事情。基本上,我使用它通过将查找表中的值输出到 AVR 微控制器上的端口来生成正弦波,但不幸的是,这发生的速度不够快,无法让我获得所需的波频率。有人告诉我,我应该考虑在汇编中实现它,因为编译器生成的汇编可能效率稍低,并且可能能够进行优化,但在查看汇编代码后,我真的看不出我可以做得更好。
这是 C 代码:
const uint8_t amplitudes60[60] = {127, 140, 153, 166, 176, 191, 202, 212, 221, 230, 237, 243, 248, 251, 253, 254, 253, 251, 248, 243, 237, 230, 221, 212, 202, 191, 179, 166, 153, 140, 127, 114, 101, 88, 75, 63, 52, 42, 33, 24, 17, 11, 6, 3, 1, 0, 1, 3, 6, 11, 17, 24, 33, 42, 52, 63, 75, 88, 101, 114};
const uint8_t amplitudes13[13] = {127, 176, 221, 248, 202, 153, 101, 52, 17, 1, 6, 33, 75};
const uint8_t amplitudes10[10] = {127, 176, 248, 202, 101, 52, 17, 1, 33, 75};
volatile uint8_t numOfAmps = 60;
volatile uint8_t *amplitudes = amplitudes60;
volatile uint8_t amplitudePlace = 0;
ISR(TIMER1_COMPA_vect)
{
PORTD = amplitudes[amplitudePlace];
amplitudePlace++;
if(amplitudePlace == numOfAmps)
{
amplitudePlace = 0;
}
}
振幅和 numOfAmps 都由另一个中断例程更改,该例程的运行速度比此例程慢得多(它基本上是为了更改正在播放的频率而运行的)。最终我不会使用那些完全相同的数组,但它将是一个非常相似的设置。我很可能有一个包含 60 个值的数组,另一个只有 30 个值。这是因为我正在构建一个扫频器,在较低的频率下,我可以给它更多的样本,因为我有更多的时钟周期可以使用,但是在较高的频率下,我的时间非常紧张。
我确实意识到我可以让它以较低的采样率工作,但我不想每个周期的样本数低于 30 个。我不认为拥有指向数组的指针会使速度变慢,因为从数组获取值的程序集和从指向数组的指针获取值的程序集似乎是相同的(这是有道理的)。
有人告诉我,在我必须产生的最高频率下,我应该能够使其在每个正弦波周期大约 30 个样本的情况下工作。目前,30 个样本的运行速度最快约为所需最大频率的一半,我认为这意味着我的中断需要以两倍的速度运行。
因此,模拟时的代码需要 65 个周期才能完成。我再次被告知我最多应该能够将其减少到大约 30 个周期。
这是生成的 ASM 代码,我对每一行旁边的功能进行了思考:
ISR(TIMER1_COMPA_vect)
{
push r1
push r0
in r0, 0x3f ; save status reg
push r0
eor r1, r1 ; generates a 0 in r1, used much later
push r24
push r25
push r30
push r31 ; all regs saved
PORTD = amplitudes[amplitudePlace];
lds r24, 0x00C8 ; r24 <- amplitudePlace I’m pretty sure
lds r30, 0x00B4 ; these two lines load in the address of the
lds r31, 0x00B5 ; array which would explain why it’d a 16 bit number
; if the atmega8 uses 16 bit addresses
add r30, r24 ; aha, this must be getting the ADDRESS OF THE element
adc r31, r1 ; at amplitudePlace in the array.
ld r24, Z ; Z low is r30, makes sense. I think this is loading
; the memory located at the address in r30/r31 and
; putting it into r24
out 0x12, r24 ; fairly sure this is putting the amplitude into PORTD
amplitudePlace++;
lds r24, 0x011C ; r24 <- amplitudePlace
subi r24, 0xFF ; subi is subtract imediate.. 0xFF = 255 so I’m
; thinking with an 8 bit value x, x+1 = x - 255;
; I might just trust that the compiler knows what it’s
; doing here rather than try to change it to an ADDI
sts 0x011C, r24 ; puts the new value back to the address of the
; variable
if(amplitudePlace == numOfAmps)
lds r25, 0x00C8 ; r24 <- amplitudePlace
lds r24, 0x00B3 ; r25 <- numOfAmps
cp r24, r24 ; compares them
brne .+4 ; 0xdc <__vector_6+0x54>
{
amplitudePlace = 0;
sts 0x011C, r1 ; oh, this is why r1 was set to 0 earlier
}
}
pop r31 ; restores the registers
pop r30
pop r25
pop r24
pop r19
pop r18
pop r0
out 0x3f, r0 ; 63
pop r0
pop r1
reti
除了在中断中使用较少的寄存器以便减少推送/弹出之外,我真的看不出此汇编代码在哪里效率低下。
我唯一的另一个想法是,如果我能弄清楚如何在 C 中获取一个位 int 数据类型,以便数字在到达末尾时会环绕,也许可以去掉 if 语句?我的意思是,我将有 2^n - 1 个样本,然后让amplitudePlace 变量继续计数,这样当它达到 2^n 时,它就会溢出并重置为零。
我确实尝试完全不使用 if 位来模拟代码,虽然它确实提高了速度,但只需要大约 10 个周期,因此一次执行大约需要 55 个周期,不幸的是,这仍然不够快,所以我确实需要进一步优化代码,如果没有它只有 2 行,这是很难考虑的!
我唯一真正的想法是看看我是否可以将静态查找表存储在需要更少时钟周期访问的地方?我认为它用来访问数组的 LDS 指令都需要 2 个周期,所以我可能不会真正节省太多时间,但在这个阶段我愿意尝试任何事情。
我完全不知道从这里该去哪里。我不知道如何使我的 C 代码更高效,但我对这类事情还很陌生,所以我可能会错过一些东西。我希望得到任何形式的帮助..我意识到这是一个非常特殊且复杂的问题,通常我会尽量避免在这里问这类问题,但我已经研究这个问题很多年了,而且完全不知所措,所以我真的会接受任何我能得到的帮助。
I have an interrupt handler that just isn't running fast enough for what I want to do. Basically I'm using it to generate sine waves by outputting a value from a look up table to a PORT on an AVR microncontroller but, unfortunately, this isn't happening fast enough for me to get the frequency of the wave that I want. I was told that I should look at implementing it in assembly as the compiler generated assembly might be slightly inefficient and may be able to be optimised but after looking at the assembly code I really can't see what I could do any better.
This is the C code:
const uint8_t amplitudes60[60] = {127, 140, 153, 166, 176, 191, 202, 212, 221, 230, 237, 243, 248, 251, 253, 254, 253, 251, 248, 243, 237, 230, 221, 212, 202, 191, 179, 166, 153, 140, 127, 114, 101, 88, 75, 63, 52, 42, 33, 24, 17, 11, 6, 3, 1, 0, 1, 3, 6, 11, 17, 24, 33, 42, 52, 63, 75, 88, 101, 114};
const uint8_t amplitudes13[13] = {127, 176, 221, 248, 202, 153, 101, 52, 17, 1, 6, 33, 75};
const uint8_t amplitudes10[10] = {127, 176, 248, 202, 101, 52, 17, 1, 33, 75};
volatile uint8_t numOfAmps = 60;
volatile uint8_t *amplitudes = amplitudes60;
volatile uint8_t amplitudePlace = 0;
ISR(TIMER1_COMPA_vect)
{
PORTD = amplitudes[amplitudePlace];
amplitudePlace++;
if(amplitudePlace == numOfAmps)
{
amplitudePlace = 0;
}
}
amplitudes and numOfAmps are both changed by another interrupt routine that runs much slower than this one (it basically is run to change the frequencies that are being played). At the end of the day I won't be using those exact arrays but it will be a very similar set up. I'll most likely have an array with 60 values and another with just 30. This is because I'm building a frequency sweeper and at the lower frequencies I can afford to give it more samples as I have more clock cycles to play with but at the higher frequencies I'm very much strapped for time.
I do realise that I can get it to work with a lower sampling rate but I don't want to go under 30 samples per period. I don't think having the pointer to the array makes it any slower as the assembly to get a value from an array and the assembly to get a value from a pointer to an array seems the same (which makes sense).
At the highest frequency that I have to produce I've been told I should be able to get it working with about 30 samples per sine wave period. At the moment with 30 samples the fastest it will run is at about half the required max frequency which I think means that my interrupt needs to run twice as fast.
So that code there when simulated takes 65 cycles to complete. Again, I've been told I should be able to get it down to about 30 cycles at best.
This is the ASM code produced, with my thinking of what each line does next to it:
ISR(TIMER1_COMPA_vect)
{
push r1
push r0
in r0, 0x3f ; save status reg
push r0
eor r1, r1 ; generates a 0 in r1, used much later
push r24
push r25
push r30
push r31 ; all regs saved
PORTD = amplitudes[amplitudePlace];
lds r24, 0x00C8 ; r24 <- amplitudePlace I’m pretty sure
lds r30, 0x00B4 ; these two lines load in the address of the
lds r31, 0x00B5 ; array which would explain why it’d a 16 bit number
; if the atmega8 uses 16 bit addresses
add r30, r24 ; aha, this must be getting the ADDRESS OF THE element
adc r31, r1 ; at amplitudePlace in the array.
ld r24, Z ; Z low is r30, makes sense. I think this is loading
; the memory located at the address in r30/r31 and
; putting it into r24
out 0x12, r24 ; fairly sure this is putting the amplitude into PORTD
amplitudePlace++;
lds r24, 0x011C ; r24 <- amplitudePlace
subi r24, 0xFF ; subi is subtract imediate.. 0xFF = 255 so I’m
; thinking with an 8 bit value x, x+1 = x - 255;
; I might just trust that the compiler knows what it’s
; doing here rather than try to change it to an ADDI
sts 0x011C, r24 ; puts the new value back to the address of the
; variable
if(amplitudePlace == numOfAmps)
lds r25, 0x00C8 ; r24 <- amplitudePlace
lds r24, 0x00B3 ; r25 <- numOfAmps
cp r24, r24 ; compares them
brne .+4 ; 0xdc <__vector_6+0x54>
{
amplitudePlace = 0;
sts 0x011C, r1 ; oh, this is why r1 was set to 0 earlier
}
}
pop r31 ; restores the registers
pop r30
pop r25
pop r24
pop r19
pop r18
pop r0
out 0x3f, r0 ; 63
pop r0
pop r1
reti
Apart from maybe using less registers in the interrupt so that I have less push/pops I really can't see where this assembly code is inefficient.
My only other thought is maybe the if statement could be gotten rid of if I could work out how to get a n bit int datatype in C so that the number will wrap around when it reaches the end? By this I mean I would have 2^n - 1 samples and then have the amplitudePlace variable just keep counting up so that when it reaches 2^n it'll overflow and will be reset to zero.
I did try simulating the code without the if bit completely and while it did improve the speed, it only took about 10 cycles off so that it was at about 55 cycles for one execution which still isn't quite fast enough unfortunately so I do need to optimise the code even further which is hard considering without that it's only 2 lines!!
My only other real thought is to see if I can store the static look up tables somewhere that takes less clock cycles to access? The LDS instructions it uses to access the array I think all take 2 cycles so I probably wouldn't really be saving much time there but at this stage I'm willing to try anything.
I'm totally at a loss of where to go from here. I can't see how I could make my C code any more efficient but I'm only fairly new to this sort of thing so I could be missing something. I would love any sort of help.. I realise this is a pretty particular and involved problem and normally I'd try to avoid asking those sort of questions here but I've been working on this for ages and am at a total loss so I'll really take any help that I can get.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
我可以看到一些需要开始工作的领域,未按特定顺序列出:
1. 减少要推送的寄存器数量,因为每个推送/弹出对需要四个周期。例如,avr-gcc 允许您从其寄存器分配器中删除一些寄存器,因此您可以将它们用作单个 ISR 中的寄存器变量,并确保它们仍然包含上次的值。如果您的程序从未将
r1
设置为0< 以外的任何内容,您也可能会摆脱
r1
和eor r1,r1
的推送/代码>。2. 使用局部临时变量作为数组索引的新值,以将不必要的加载和存储指令保存到该易失性变量中。像这样的:
3. 从 59 向后计数到 0,而不是从 0 到 59,以避免单独的比较指令(在减法中无论如何都会发生与 0 的比较)。伪代码:
而不是
4。也许使用指针和指针比较(与预先计算的值!)而不是数组索引。需要检查与倒数哪个更有效。也许将数组与 256 字节边界对齐,并仅使用 8 位寄存器作为指针,以节省加载和保存地址的高 8 位的时间。 (如果 SRAM 即将耗尽,您仍然可以将 60 字节数组中的 4 个的内容放入一个 256 字节数组中,并且仍然可以利用由 8 个恒定高位和 8 个可变低位组成的所有地址。 )
问题是指针是 16 位,而以前的简单数组索引大小是 8 位。如果您设计数组地址,使地址的高 8 位是常量(在汇编代码中为
hi8(array)
),并且您只处理低 8 位,那么解决这个问题可能是一个技巧。 ISR 中实际改变的 8 位。但这确实意味着编写汇编代码。上面生成的汇编代码可能是在汇编中编写该版本 ISR 的良好起点。5. 如果从时间角度来看可行,请将样本缓冲区大小调整为 2 的幂,以用简单的
i = (i+ 1)& ((1 << POWER)-1);
。如果您想采用 4. 中提出的 8 位/8 位地址分割,甚至可能采用 256 的 2 次幂(并根据需要复制样本数据以填充 256)。字节缓冲区)甚至会在 ADD 之后保存 AND 指令。6. 如果 ISR 仅使用不影响状态寄存器的指令,请停止压入和弹出
SREG
。常规
以下内容可能会派上用场,特别是在手动检查所有其他汇编代码的假设时:
这会生成整个固件映像的带注释的完整汇编语言列表。您可以使用它来验证注册(非)使用情况。请注意,启动代码仅在首次启用中断之前运行一次,不会干扰 ISR 稍后对寄存器的独占使用。
如果您决定不直接在汇编代码中编写 ISR,我建议您编写 C 代码并在每次编译后检查生成的汇编代码,以便立即观察您的更改最终生成的内容。
您最终可能会用 C 和汇编编写 ISR 的十几个变体,将每个变体的周期相加,然后选择最好的一个。
注意 在不进行任何寄存器预留的情况下,我最终得到了大约 31 个 ISR 周期(不包括进入和离开,这又增加了 8 或 10 个周期)。完全摆脱寄存器推送将使 ISR 减少到 15 个周期。更改为大小恒定为 256 字节的样本缓冲区,并让 ISR 独占使用四个寄存器,可以将 ISR 所花费的周期减少到 6 个周期(加上 8 或 10 个进入/离开周期)。
I can see a few areas to start working on, listed in no particular order:
1. Reduce the number of registers to push, as each push/pop pair takes four cycles. For example,
avr-gcc
allows you to remove a few registers from its register allocator, so you can just use them for register variables in that single ISR and be sure they still contain the value from last time. You might also get rid of the pushing ofr1
andeor r1,r1
if your program never setsr1
to anything but0
.2. Use a local temporary variable for the new value of the array index to save unnecessary load and store instructions to that volatile variable. Something like this:
3. Count backwards from 59 to 0 instead of from 0 to 59 to avoid the separate comparison instruction (comparison with 0 happens anyway in subtraction). Pseudo code:
instead of
4. Perhaps use pointers and pointer comparisons (with precalculated values!) instead of array indexes. It needs to be checked versus counting backwards which one is more efficient. Maybe align the arrays to 256 byte boundaries and use only 8-bit registers for the pointers to save on loading and saving the higher 8 bits of the addresses. (If you are running out of SRAM, you can still fit the content of 4 of those 60 byte arrays into one 256 byte array and still get the advantage of all addresses consisting of 8 constant high bits and the 8 variable lower bits.)
The problem is that pointers are 16 bit whereas your simple array index formerly was 8 bit in size. Helping with that might be a trick if you design your array addresses such that the higher 8 bits of the address are constants (in assembly code,
hi8(array)
), and you only deal with the lower 8 bits that actually change in the ISR. That does mean writing assembly code, though. The generated assembly code from above might be a good starting point for writing that version of the ISR in assembly.5. If feasible from a timing point of view, adjust the sample buffer size to a power of 2 to replace the if-reset-to-zero part with a simple
i = (i+1) & ((1 << POWER)-1);
. If you want to go with the 8-bit/8-bit address split proposed in 4., perhaps even going to 256 for the power of two (and duplicating sample data as necessary to fill the 256 byte buffer) will even save you the AND instruction after the ADD.6. In case the ISR only uses instructions which do not affect the status register, stop push and popping
SREG
.General
The following might come in handy especially for manually checking all the other assembly code for assumptions:
This generates a commented complete assembly language listing of the whole firmware image. You can use that to verify register (non-)usage. Note that startup code only run once long before you first enable interrupts will not interfere with your ISR's later exclusive use of registers.
If you decide to not write that ISR in assembly code directly, I would recommend you write the C code and check the generated assembly code after every compilation, in order to immediately observe what your changes end up generating.
You might end up writing a dozen or so variants of the ISR in C and assembly, adding up the cycles for each variant, and then chosing the best one.
Note Without doing any register reservation, I end up with something around 31 cycles for the ISR (excluding entering and leaving, which adds another 8 or 10 cycles). Completely getting rid of the register pushing would get the ISR down to 15 cycles. Changing to a sample buffer with a constant size of 256 bytes and giving the ISR exclusive use of four registers allows getting down to 6 cycles being spent in the ISR (plus 8 or 10 to enter/leave).
我想说最好的办法是用纯汇编程序编写 ISR。这是非常简短的代码,并且您可以使用现有的反汇编程序来指导您。但对于这种性质的东西,你应该能够做得更好:例如使用更少的寄存器,以节省
push
和pop
;重构它,以便它不会从内存中加载amplitudePlace
三次,等等。I'd say the best thing would be to write your ISR in pure assembler. It's very short and simple code, and you have the existing disassembler to guide you. But for something of this nature, you ought to be able to do better: e.g. use fewer registers, to save on
push
andpop
; re-factor it so that it's not loadingamplitudePlace
from memory three separate times, etc.您必须与程序的其余部分共享所有这些变量吗?由于您共享的每个此类变量都必须是易失性的,因此不允许编译器对其进行优化。至少amplacePlace看起来可以改为局部静态变量,然后编译器也许可以进一步优化它。
Must you share all those variables with the rest of the program? Since every such variable you share must be volatile, the compiler isn't allowed optimize it. At least amplitudePlace looks like it could be changed to a local static variable, and then the compiler may be able to optimize it further.
为了澄清,你的中断应该是这样的:
这将要求你的表有 64 个条目长。如果您可以选择表的地址,则可以使用单个指针,将其递增,然后将其添加到表中。它与 0xffBf。
如果使用变量而不是固定常量会减慢速度,您可以用特定数组替换指针变量:
然后更改中断指针以对每个波形使用不同的函数。这可能不会带来很大的节省,但我们的周期总数已减少到 10 个。
至于寄存器使用的事情。一旦获得像这样非常简单的 ISR,您就可以检查 ISR 的序言和结尾,它们推送和弹出处理器状态。如果您的 ISR 仅使用 1 个寄存器,您可以在汇编程序中执行此操作,并且仅保存和恢复该一个寄存器。这将减少中断开销而不影响程序的其余部分。有些编译器可能会为你做这件事,但我对此表示怀疑。
如果有时间和空间,您还可以创建一个长表,并将 ++ 替换为 +=freq,其中 freq 将导致波形成为基频的整数倍(2x、3x、4x 等),方法是跳过那么多样本。
To clarify, your interrupt should be this:
This will require your table to be 64 entries long. If you can choose the address of your table, you can get away with a single pointer, increment it, & it with 0xffBf.
If using variables instead of fixed constant is slowing things down, you can replace the pointer variable with a specific array:
Then you change the interrupt pointer to use a different function for each waveform. This is not likely to be a big savings, but we're getting down to 10's of cycles total.
As for the register usage thing. Once you get a really simple ISR like this, you can check the prolog and epilog of the ISR which push and pop the processor state. If your ISR only uses 1 register, you can do it in assembler and only save and restore that one register. This will reduce the interrupt overhead without affecting the rest of the program. Some compilers might do this for you, but I doubt it.
If there is time and space you can also create a long table and replace the ++ with +=freq where freq will cause the waveform to be an integer multiple of the base frequency (2x,3x,4x etc...) by skipping that many samples.
您是否考虑过扭转问题并以固定中断频率以可变速率单步执行,而不是一次以不同的中断率单步执行表中的一个条目?这样,ISR 本身会更重,但您可以负担得起以较低的速率运行它。另外,通过一点定点算法,您可以轻松生成更广泛的频率,而无需弄乱多个表。
无论如何,如果您有能力稍微改变您的要求以适应硬件,那么有一百零一种作弊方法可以节省此类问题的周期。例如,您可以将计时器的输出链接到另一个硬件计时器的时钟,并使用第二个计时器的计数器作为表索引。您可能会保留全局寄存器或滥用未使用的 I/O 来存储变量。您可以在 COMPA 中断中一次查找(或插入)两个条目,并在两者之间设置一个微小的第二个 COMPB 中断以发出缓冲的条目。等等,等等。
通过一点点硬件滥用和精心编写的汇编代码,您应该能够在 15 个周期左右的时间内完成此操作,而不会遇到太多麻烦。是否能让它与系统的其他部分很好地配合是另一个问题。
Instead of stepping through the table one entry at a time with varying interrupt rates, have you considered turning the problem around and stepping at a variable rate with a fixed interrupt frequency? That way the ISR itself would be heavier but you may afford to run it at a lower rate. Plus, with a little fixed-point arithmetic you can easily generate a wider spectrum of frequencies without messing around with multiple tables.
Anyway, there are a hundred and one ways of cheating to save cycles for this type of problem, if you can afford to bend your requirements a little to suite the hardware. For instance you could chain your timer's output to clock another hardware timer, and use the second timer's counter as your table index. You might reserve global registers or abuse unused I/Os to store variables. You can look up two entries at a time (or interpolate) in your COMPA interrupt and set up a tiny second COMPB interrupt in between to emit the buffered entry. And so on, and so forth.
With a little hardware abuse and carefully crafted assembly code you should be able to do this in 15 cycles or so without too much trouble. Whether you can make it play nice with the rest of the system is another question.
也许使用算术表达式来摆脱条件和比较就足够了:
如果您的 CPU 以合理的速度执行模运算,这应该会快得多。如果仍然不够,请尝试用汇编程序编写此版本。
Maybe it suffices to get rid of the conditional and the comparison all together by using an arithmetic expression:
If your CPU executes the modulo operation with reasonable speed, this should be much faster. If it still doesn't suffice, try writing this version in assembler.