为什么LDR有时会进行20个CPU周期?
我正在使用 ARM Cortex M4组件中的LDR和STR指令。由于某种原因,它们在记忆中编写/读取某些部分的时间比其他人需要更长的时间。
为了说明这一点,我已经设置了一个简单的示例:
我创建了一个带有主C文件的项目,以及一个包含汇编代码的相邻“ .s”文件。我使用“ Extern”对象将汇编功能包括在C文件中。
//Add the asm functions to our C code
extern "C" void LoadTest(uint32_t *memory_adress);
extern "C" void LoadTestLoop(uint32_t *memory_adress);
这是该程序的作用:
void perform_test()
{
//Time
register uint32_t register_before_time=before_time;
register uint32_t register_after_time=after_time;
register uint32_t* input_address=0x400E9000;
register_before_time=ARM_DWT_CYCCNT;
//Time measurment occurs in here!
LoadTestLoop(input_address);
register_after_time=ARM_DWT_CYCCNT;
Serial.print(" Time: ");
Serial.println(register_after_time-register_before_time-time_error);
}
它向我们展示了它花费的时间 “ register_before_time = arm_dwt_cyccnt;” 和“ register_after_time = arm_dwt_cyccnt;” 行。
这是我们将以它们的速度测试的组件子例程:
.global LoadTest
LoadTest:
ldr r1, [r0] /*Load value into r1 from memory_address*/
orr r1, #0xC0 /*OR bits 7,6 to be on.*/
str r1, [r0] /*Store the changed value back into memory_address*/
bx lr
.global LoadTestLoop
LoadTestLoop:
mov r2, #255 /* Set r2 to be 255 for the loop*/
TestLoop: /*Same code as before*/
ldr r1, [r0]
orr r1, #0xC0
str r1, [r0]
subs r2, r2, #1 /*Decrement r2 + set Z flag if it's zero*/
bne TestLoop /*Repeat until r2==0*/
bx lr
loadTest - 从我们给它的地址加载值。用0xC0或将其存储回同一地址。
loadTestloop - 进行同样的事情,但是,在循环中进行了255次,这样我们可以平均获得一个环路迭代需要多长时间,并最大程度地减少分支说明的时间测量错误进出功能。
注意:也可以最大程度地减少测量错误,在Input_Address指针中,在时间测量区之外的两个功能都提供了工作地址。
register uint32_t* input_address=0x400E9000;
测试结果和问题:
我对普通C变量
uint32_t test_value=255;
register uint32_t* input_address=&test_value;
和微控制器内的配置寄存器进行了这两个测试。请注意,在数据表中,它们作为内存表示。
register uint32_t* input_address=0x400E9000;
标准变量的平均负载测试需要9个周期才能执行,但是在控制寄存器的27个周期时更长的时间。 loadTestloop测试加强了这一点,标准变量平均进行1541个周期(每次迭代的6个周期)和该控件记录了一个惊人的12227循环,每次迭代都可以疯狂47个周期!
为什么会发生这种情况?
为什么LDR和STR有时需要更长的时间来执行?它是否与此指令集网站?单击它会将您发送回同一页面。 旁边的小蓝色B>
有人知道为什么会发生这种情况?很长一段时间以来,我一直被这个问题困扰,并且真的很想知道。
谢谢你的帮助
I am having and issue with the LDR and STR instructions in ARM Cortex M4 assembly. For some reason, they take way longer to write/read certain parts in memory than to others.
To illustrate this, I’ve setup this simple example:
I have created a project with a main C file, and a neighboring “.S” file containing the assembly code. I’ve included the assembly functions into my C file using the “extern” object.
//Add the asm functions to our C code
extern "C" void LoadTest(uint32_t *memory_adress);
extern "C" void LoadTestLoop(uint32_t *memory_adress);
Here is what the program does:
void perform_test()
{
//Time
register uint32_t register_before_time=before_time;
register uint32_t register_after_time=after_time;
register uint32_t* input_address=0x400E9000;
register_before_time=ARM_DWT_CYCCNT;
//Time measurment occurs in here!
LoadTestLoop(input_address);
register_after_time=ARM_DWT_CYCCNT;
Serial.print(" Time: ");
Serial.println(register_after_time-register_before_time-time_error);
}
It shows us the time it’s taken something to execute in between the
“register_before_time=ARM_DWT_CYCCNT;” and “register_after_time=ARM_DWT_CYCCNT;” lines.
Here are the assembly subroutines we will be testing for their speed:
.global LoadTest
LoadTest:
ldr r1, [r0] /*Load value into r1 from memory_address*/
orr r1, #0xC0 /*OR bits 7,6 to be on.*/
str r1, [r0] /*Store the changed value back into memory_address*/
bx lr
.global LoadTestLoop
LoadTestLoop:
mov r2, #255 /* Set r2 to be 255 for the loop*/
TestLoop: /*Same code as before*/
ldr r1, [r0]
orr r1, #0xC0
str r1, [r0]
subs r2, r2, #1 /*Decrement r2 + set Z flag if it's zero*/
bne TestLoop /*Repeat until r2==0*/
bx lr
LoadTest – Loads a value from the address we give it. ORs the value with 0xC0 and then stores it back to the same address.
LoadTestLoop – Does the same thing, however, does it in a loop 255 times, this way we can get a average of how long one loop iteration takes, and minimize the time measurement errors from the branching instructions going in and out of the function.
Note: To also minimize measurement errors, the address to work on is provided to both functions outside of the time measurement zone, in the input_address pointer.
register uint32_t* input_address=0x400E9000;
Test results and the issue:
I ran these two tests for both normal C variables
uint32_t test_value=255;
register uint32_t* input_address=&test_value;
And for the configuration registers inside the microcontrollers. Note that in the datasheet they are presented as just memory.
register uint32_t* input_address=0x400E9000;
On average LoadTest for standard variables took 9 cycles to execute, but much longer at 27 cycles for the control registers. The LoadTestLoop tests reinforced this with standard variables taking on average 1541 cycles (6 cycles per iteration) and the control registers a astounding 12227 cycles, which works out to a crazy 47 cycles per iteration!
Why is this happening?
Why does LDR and STR sometimes take way longer to execute? Does it have something to do with the little “b” written next to the cycle count on this instruction set website? Clicking on it sends you back to the same page.
Does anybody know why this is happening? I’ve been bugged by this question for a long time and would really like to know.
Thank You for the help
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这是完全正常的。
通常,内存的负载需要尽可能长。时间安排不受CPU本身的控制,因此引用的周期数只能代表“最佳情况”。如果CPU无法从其自己的内部结构(例如Store Buffer或L1 Cache)中实现负载,那么它只需要将请求放在存储总线和失速上,直到内存子系统响应。 (或继续执行以后的指令,如果有的话,如果可以找到一些不取决于负载结果的指令。)
实际所需的时间可能是高度可变的,例如负载命中或错过L2或L3高速缓存,无论另一个核心还是外部设备都容纳总线锁,等等。
但是在您的情况下,您要加载的地址实际上将映射到硬件设备。因此,您根本没有真正阅读RAM,您正在做I/O。在这种情况下,响应必须来自设备本身,并且设备基本上可以在需要的情况下使用。如果您需要能够预测时间,那么您需要查看该设备的文档(以及两者之间的任何接口硬件),而不是在CPU手册中的周期计数。
This is completely normal.
In general, a load from memory takes as long as it takes. The timing isn't under the control of the CPU itself, so a quoted cycle count can only represent a "best case". If the CPU can't fulfill the load from its own internal structures (e.g. store buffer or L1 cache), then it just has to put the request out on the memory bus and stall until the memory subsystem responds. (Or go on executing later instructions out-of-order, if so equipped and if it can find some that don't depend on the result of the load.)
The actual time taken can be highly variable, depending for instance on whether the load hits or misses L2 or L3 cache, whether another core or external device holds a bus lock, etc. If the machine has no cache and all memory is fast SRAM, then the time could be pretty stable.
But in your case the address you're loading is actually mapped to a hardware device. So you're not really reading RAM at all, you're doing I/O. In this case, the response has to come from the device itself, and the device can essentially take as long as it needs. If you need to be able to predict the time, then you need to be looking at the documentation of that device (and any interface hardware in between), not at cycle counts in the CPU manual.