如何为嵌入式系统优化此图像复制功能
下面的函数使用 read_page(pageIter, pageArr, PAGESIZE) 一次一页读取图像,并在 DOUT 和 CCLK 引脚上输出数据。
有人告诉我这效率低下,但我似乎找不到办法让它更快。它基本上是一个管道,在 64 针 uProcessor 上运行,位于两个内存空间之间。一个持有图像,另一个接收图像。
我使用了 register 关键字,删除了数组索引并用指针算术替换,但它需要更快。
谢谢!
/*
Port C Pin Out
*/
#define BIT0 0x01 // CCLK
#define BIT1 0x02 // CS_B
#define BIT2 0x04 // INIT_B
#define BIT3 0x08 // PROG_B
#define BIT4 0x10 // RDRW_B
#define BIT5 0x20 // BUSY_OUT
#define BIT6 0x40 // DONE
#define BIT7 0x80 // DOUT (DIN)
/*
PAGE
*/
#define PAGESIZE 1024 // Example
void copyImage(ulong startAddress, ulong endAddress)
{
ulong pageIter;
uchar *eByte, *byteIter, pageArr[PAGESIZE];
register uchar bitIter, portCvar;
portCvar = PORTC;
/* Loops through pages in an image using ulong type*/
for(pageIter = startAddress ; pageIter <= endAddress ; pageIter += PAGESIZE)
{
read_page(pageIter, pageArr, PAGESIZE);
eByte = pageArr+PAGESIZE;
/* Loops through bytes in a page using pointer to uchar (pointer to a byte)*/
for(byteIter = pageArr; byteIter <= eByte; byteIter++)
{
/* Loops through bits in byte and writes to PORTC - DIN ANC CCLK */
for(bitIter = 0x01; bitIter != 0x00; bitIter = bitIter << 1)
{
PORTC = portCvar | BIT0;
(bitIter & *byteIter) ? (PORTC = portCvar & ~BIT7) : (PORTC = portCvar | BIT7);
PORTC = portCvar & ~BIT0;
}
}
}
}
The function below reads an image a page at a time using read_page(pageIter, pageArr, PAGESIZE) and outputs the data on the DOUT AND CCLK pins.
I was told it was inefficient but I can't seem to find a way to make it faster. It is basically a pipe ,running on a 64 pin uProcessor, between two memory spaces. One holds the image and the other receives the image.
I've used the register keyword, removed array indexing and replaced with pointer arithemetic, but it needs to be faster.
Thanks!
/*
Port C Pin Out
*/
#define BIT0 0x01 // CCLK
#define BIT1 0x02 // CS_B
#define BIT2 0x04 // INIT_B
#define BIT3 0x08 // PROG_B
#define BIT4 0x10 // RDRW_B
#define BIT5 0x20 // BUSY_OUT
#define BIT6 0x40 // DONE
#define BIT7 0x80 // DOUT (DIN)
/*
PAGE
*/
#define PAGESIZE 1024 // Example
void copyImage(ulong startAddress, ulong endAddress)
{
ulong pageIter;
uchar *eByte, *byteIter, pageArr[PAGESIZE];
register uchar bitIter, portCvar;
portCvar = PORTC;
/* Loops through pages in an image using ulong type*/
for(pageIter = startAddress ; pageIter <= endAddress ; pageIter += PAGESIZE)
{
read_page(pageIter, pageArr, PAGESIZE);
eByte = pageArr+PAGESIZE;
/* Loops through bytes in a page using pointer to uchar (pointer to a byte)*/
for(byteIter = pageArr; byteIter <= eByte; byteIter++)
{
/* Loops through bits in byte and writes to PORTC - DIN ANC CCLK */
for(bitIter = 0x01; bitIter != 0x00; bitIter = bitIter << 1)
{
PORTC = portCvar | BIT0;
(bitIter & *byteIter) ? (PORTC = portCvar & ~BIT7) : (PORTC = portCvar | BIT7);
PORTC = portCvar & ~BIT0;
}
}
}
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
展开每个字节的传输来加快速度
也许您可以通过在图像循环之外进行预计算之后
Probably you can go faster by unrolling the transmission of each byte with something like
after precomputing once outside the image loop
那个循环是关键。我会使用生产优化标志对其进行编译,然后查看反汇编。编译器可能会做各种聪明的事情,例如展开循环或简化循环条件。如果我不喜欢在那里看到的内容,我会开始调整 C 代码以帮助编译器找到良好的优化。如果事实证明这是不可能的,那么我可能会使用内联汇编来获得我想要的东西。
假设我们可以尽可能快地进行(并且循环中的延迟不考虑接收器的建立保持时间),那么我希望将该循环减少到尽可能少的指令。您可以同时设置 BIT0 和数据位吗?这会对接收器造成危险吗?如果可以的话,这将节省一两条指令。许多微优化将依赖于特定的指令集。如果数据有大量 0 或 0xFF,您可以进行特殊的展开情况,其中数据位不更改且 BIT0 切换 8 次。您可以为单个 nybble 制作 16 个展开的情况,并为每个字节切换两次。
That loop is the key. I would compile it with production optimization flags and then look at the disassembly. The compiler may do all kinds of clever things like unroll the loop or simplify the loop condition. If I didn't like what I saw there I'd start tweaking the C code to help the compiler find a good optimization. If that proved impossible then I might use inline assembly to get what I want.
Assuming we can go as fast as possible (and delays in the loop aren't accounting for setup-hold times at the receiver) then I'd want to get that loop down to as few instructions as possible. Can you set BIT0 and also the data bit at the same time or does that create a hazard at the receiver? If you can that would save an instruction or two. Lots of microoptimizations would rely on the specific instruction set. If the data has lots of 0 or 0xFF you could make special unrolled cases where the data bit doesn't change and BIT0 toggles 8 times. You could make 16 unrolled cases for a single nybble and switch into that twice for each byte.
首先,这个循环被打破了。
bitIter
是一个uchar
(我假设它是一个无符号的 8 位字符)。通过将其向左移动,最终将获得预期最终迭代的值 0x80。下一次移位后,它将得到值 0。转向效率。根据架构,执行操作 PORTC = PORTC | BIT0 可能会导致单个位设置。然而,它也可能导致读取、在寄存器中设置一个位以及存储。
如前所述,如果可能,请尝试同时设置BIT0和BIT7(如果硬件允许)。
我会尝试这样的事情:
通过使用
do ... while
循环,它将终止问题,并且您将在第一次迭代之前摆脱循环测试的不必要的比较(除非您的编译器已经有优化掉它)。您可以尝试手动展开循环八次,每一位展开一次。
To start with, this loop is broken.
bitIter
is anuchar
(which I assume is an unsigned 8-bit character). By shifting it to the left it will eventually get the value 0x80 for the intended final iteration. After the next shift it will get the value 0.Over to the efficiency. Depending on the architecture, doing the operation
PORTC = PORTC | BIT0
might result in a single bit set. However, it also might result in a read, set a bit in a register, and a store.As mentioned before, if possible, try to set the BIT0 and BIT7 at the same time (if the hardware permits this).
I would try something like:
By using a
do ... while
loop, it will terminate problem and you would get rid of the unnecessary comparison of the loop test before the first iteration (unless your compiler already have optimized it away).You could try to unroll the loop, by hand, eigth times, once for every bit.
我假设当您输入此函数时 PORTC 处于已知状态:即数据和时钟线为 0? (或者时钟低而数据高?)
如果这个假设成立,您甚至应该能够通过首先设置
value = ~(*byteIter);
然后这样做来避免 @6502 答案中的条件8 次:- 或者,如果 Bit7 开始为高 -
此处的优点是它避免了条件 - 这可能会对大量流水线处理器的速度造成严重破坏。
I'm assuming that PORTC is in a known state when you enter this function: i.e. the Data and Clock lines are 0? (or Clock is low and Data is high?)
If that assumption is true you should be able to even avoid the conditionals in @6502's answer by first setting
value = ~(*byteIter);
then doing this 8 times:-or, if Bit7 starts high -
The advantage here is it avoids the conditionals - which can play havoc on a the speed of a heavily pipelined processor.