如何测量ARM Cortex-A8处理器中的程序执行时间?
我使用的是基于 ARM Cortex-A8 的处理器,称为 i.MX515。有linux Ubuntu 9.10 发行版。我正在运行一个用 C 编写的非常大的应用程序,并且我正在使用 gettimeofday(); 函数来测量我的应用程序所花费的时间。
main()
{
gettimeofday(start);
....
....
....
gettimeofday(end);
}
此方法足以查看我的应用程序的哪些块花费了多少时间。但是,现在,我正在尝试非常彻底地优化我的代码,使用 gettimeofday() 计算时间的方法,我看到连续运行之间有很多波动(在优化之前和之后运行),所以我无法确定实际的执行时间,从而确定我的改进的影响。
谁能建议我应该做什么?
如果通过访问周期计数器(ARM 网站上针对 Cortex-M3 建议的想法),任何人都可以向我指出一些代码,这些代码为我提供了访问 Cortex 上的计时器寄存器所必须遵循的步骤-A8?
如果此方法不是很准确,请建议一些替代方法。
谢谢
跟进
跟进1:在Code Sorcery上编写了以下程序,生成了可执行文件,当我尝试在板上运行时,我收到 - 非法指令消息:(
static inline unsigned int get_cyclecount (void)
{
unsigned int value;
// Read CCNT Register
asm volatile ("MRC p15, 0, %0, c9, c13, 0\t\n": "=r"(value));
return value;
}
static inline void init_perfcounters (int32_t do_reset, int32_t enable_divider)
{
// in general enable all counters (including cycle counter)
int32_t value = 1;
// peform reset:
if (do_reset)
{
value |= 2; // reset all counters to zero.
value |= 4; // reset cycle counter to zero.
}
if (enable_divider)
value |= 8; // enable "by 64" divider for CCNT.
value |= 16;
// program the performance-counter control-register:
asm volatile ("MCR p15, 0, %0, c9, c12, 0\t\n" :: "r"(value));
// enable all counters:
asm volatile ("MCR p15, 0, %0, c9, c12, 1\t\n" :: "r"(0x8000000f));
// clear overflows:
asm volatile ("MCR p15, 0, %0, c9, c12, 3\t\n" :: "r"(0x8000000f));
}
int main()
{
/* enable user-mode access to the performance counter*/
asm ("MCR p15, 0, %0, C9, C14, 0\n\t" :: "r"(1));
/* disable counter overflow interrupts (just in case)*/
asm ("MCR p15, 0, %0, C9, C14, 2\n\t" :: "r"(0x8000000f));
init_perfcounters (1, 0);
// measure the counting overhead:
unsigned int overhead = get_cyclecount();
overhead = get_cyclecount() - overhead;
unsigned int t = get_cyclecount();
// do some stuff here..
printf("\nHello World!!");
t = get_cyclecount() - t;
printf ("function took exactly %d cycles (including function call) ", t - overhead);
get_cyclecount();
return 0;
}
跟进2:我已写信给飞思卡尔寻求支持,他们给我发回了以下回复和一个程序(我对此不太了解)
这是我们可以帮助您的现在: 我正在向您发送附加的代码示例,该示例使用 UART 发送流,从您的代码来看,您似乎没有正确初始化 MPU。
(hash)include <stdio.h>
(hash)include <stdlib.h>
(hash)define BIT13 0x02000
(hash)define R32 volatile unsigned long *
(hash)define R16 volatile unsigned short *
(hash)define R8 volatile unsigned char *
(hash)define reg32_UART1_USR1 (*(R32)(0x73FBC094))
(hash)define reg32_UART1_UTXD (*(R32)(0x73FBC040))
(hash)define reg16_WMCR (*(R16)(0x73F98008))
(hash)define reg16_WSR (*(R16)(0x73F98002))
(hash)define AIPS_TZ1_BASE_ADDR 0x70000000
(hash)define IOMUXC_BASE_ADDR AIPS_TZ1_BASE_ADDR+0x03FA8000
typedef unsigned long U32;
typedef unsigned short U16;
typedef unsigned char U8;
void serv_WDOG()
{
reg16_WSR = 0x5555;
reg16_WSR = 0xAAAA;
}
void outbyte(char ch)
{
while( !(reg32_UART1_USR1 & BIT13) );
reg32_UART1_UTXD = ch ;
}
void _init()
{
}
void pause(int time)
{
int i;
for ( i=0 ; i < time ; i++);
}
void led()
{
//Write to Data register [DR]
*(R32)(0x73F88000) = 0x00000040; // 1 --> GPIO 2_6
pause(500000);
*(R32)(0x73F88000) = 0x00000000; // 0 --> GPIO 2_6
pause(500000);
}
void init_port_for_led()
{
//GPIO 2_6 [73F8_8000] EIM_D22 (AC11) DIAG_LED_GPIO
//ALT1 mode
//IOMUXC_SW_MUX_CTL_PAD_EIM_D22 [+0x0074]
//MUX_MODE [2:0] = 001: Select mux mode: ALT1 mux port: GPIO[6] of instance: gpio2.
// IOMUXC control for GPIO2_6
*(R32)(IOMUXC_BASE_ADDR + 0x74) = 0x00000001;
//Write to DIR register [DIR]
*(R32)(0x73F88004) = 0x00000040; // 1 : GPIO 2_6 - output
*(R32)(0x83FDA090) = 0x00003001;
*(R32)(0x83FDA090) = 0x00000007;
}
int main ()
{
int k = 0x12345678 ;
reg16_WMCR = 0 ; // disable watchdog
init_port_for_led() ;
while(1)
{
printf("Hello word %x\n\r", k ) ;
serv_WDOG() ;
led() ;
}
return(1) ;
}
I'm using an ARM Cortex-A8 based processor called as i.MX515. There is linux Ubuntu 9.10 distribution. I'm running a very big application written in C and I'm making use of gettimeofday();
functions to measure the time my application takes.
main()
{
gettimeofday(start);
....
....
....
gettimeofday(end);
}
This method was sufficient to look at what blocks of my application was taking what amount of time. But, now that, I'm trying to optimize my code very throughly, with the gettimeofday() method of calculating time, I see a lot of fluctuation between successive runs (Run before and after my optimizations), so I'm not able to determine the actual execution times, hence the impact of my improvements.
Can anyone suggest me what I should do?
If by accessing the cycle counter (Idea suggested on ARM website for Cortex-M3) can anyone point me to some code which gives me the steps I have to follow to access the timer registers on Cortex-A8?
If this method is not very accurate then please suggest some alternatives.
Thanks
Follow ups
Follow up 1: Wrote the following program on Code Sorcery, the executable was generated which when I tried running on the board, I got - Illegal instruction message :(
static inline unsigned int get_cyclecount (void)
{
unsigned int value;
// Read CCNT Register
asm volatile ("MRC p15, 0, %0, c9, c13, 0\t\n": "=r"(value));
return value;
}
static inline void init_perfcounters (int32_t do_reset, int32_t enable_divider)
{
// in general enable all counters (including cycle counter)
int32_t value = 1;
// peform reset:
if (do_reset)
{
value |= 2; // reset all counters to zero.
value |= 4; // reset cycle counter to zero.
}
if (enable_divider)
value |= 8; // enable "by 64" divider for CCNT.
value |= 16;
// program the performance-counter control-register:
asm volatile ("MCR p15, 0, %0, c9, c12, 0\t\n" :: "r"(value));
// enable all counters:
asm volatile ("MCR p15, 0, %0, c9, c12, 1\t\n" :: "r"(0x8000000f));
// clear overflows:
asm volatile ("MCR p15, 0, %0, c9, c12, 3\t\n" :: "r"(0x8000000f));
}
int main()
{
/* enable user-mode access to the performance counter*/
asm ("MCR p15, 0, %0, C9, C14, 0\n\t" :: "r"(1));
/* disable counter overflow interrupts (just in case)*/
asm ("MCR p15, 0, %0, C9, C14, 2\n\t" :: "r"(0x8000000f));
init_perfcounters (1, 0);
// measure the counting overhead:
unsigned int overhead = get_cyclecount();
overhead = get_cyclecount() - overhead;
unsigned int t = get_cyclecount();
// do some stuff here..
printf("\nHello World!!");
t = get_cyclecount() - t;
printf ("function took exactly %d cycles (including function call) ", t - overhead);
get_cyclecount();
return 0;
}
Follow up 2: I had written to Freescale for support and they have sent me back the following reply and a program (I did not quite understand much from it)
Here is what we can help you with right now:
I am sending you attach an example of code, that sends an stream using the UART, from what your code, it seems that you are not init correctly the MPU.
(hash)include <stdio.h>
(hash)include <stdlib.h>
(hash)define BIT13 0x02000
(hash)define R32 volatile unsigned long *
(hash)define R16 volatile unsigned short *
(hash)define R8 volatile unsigned char *
(hash)define reg32_UART1_USR1 (*(R32)(0x73FBC094))
(hash)define reg32_UART1_UTXD (*(R32)(0x73FBC040))
(hash)define reg16_WMCR (*(R16)(0x73F98008))
(hash)define reg16_WSR (*(R16)(0x73F98002))
(hash)define AIPS_TZ1_BASE_ADDR 0x70000000
(hash)define IOMUXC_BASE_ADDR AIPS_TZ1_BASE_ADDR+0x03FA8000
typedef unsigned long U32;
typedef unsigned short U16;
typedef unsigned char U8;
void serv_WDOG()
{
reg16_WSR = 0x5555;
reg16_WSR = 0xAAAA;
}
void outbyte(char ch)
{
while( !(reg32_UART1_USR1 & BIT13) );
reg32_UART1_UTXD = ch ;
}
void _init()
{
}
void pause(int time)
{
int i;
for ( i=0 ; i < time ; i++);
}
void led()
{
//Write to Data register [DR]
*(R32)(0x73F88000) = 0x00000040; // 1 --> GPIO 2_6
pause(500000);
*(R32)(0x73F88000) = 0x00000000; // 0 --> GPIO 2_6
pause(500000);
}
void init_port_for_led()
{
//GPIO 2_6 [73F8_8000] EIM_D22 (AC11) DIAG_LED_GPIO
//ALT1 mode
//IOMUXC_SW_MUX_CTL_PAD_EIM_D22 [+0x0074]
//MUX_MODE [2:0] = 001: Select mux mode: ALT1 mux port: GPIO[6] of instance: gpio2.
// IOMUXC control for GPIO2_6
*(R32)(IOMUXC_BASE_ADDR + 0x74) = 0x00000001;
//Write to DIR register [DIR]
*(R32)(0x73F88004) = 0x00000040; // 1 : GPIO 2_6 - output
*(R32)(0x83FDA090) = 0x00003001;
*(R32)(0x83FDA090) = 0x00000007;
}
int main ()
{
int k = 0x12345678 ;
reg16_WMCR = 0 ; // disable watchdog
init_port_for_led() ;
while(1)
{
printf("Hello word %x\n\r", k ) ;
serv_WDOG() ;
led() ;
}
return(1) ;
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
访问性能计数器并不困难,但您必须从内核模式启用它们。默认情况下,计数器被禁用。
简而言之,您必须在内核中执行以下两行。作为可加载模块或仅在 board-init 中的某处添加两行即可:
一旦执行此操作,周期计数器将开始为每个周期递增。寄存器的溢出不会被注意到并且不会引起任何问题(除非它们可能会扰乱您的测量)。
现在您想从用户模式访问周期计数器:
我们从读取寄存器的函数开始:
您很可能还想重置和设置分频器:
do_reset
将设置循环计数器归零。就这么简单。enable_diver
将启用 1/64 周期分频器。如果没有设置此标志,您将测量每个周期。启用后,计数器每 64 个周期增加一次。如果您想要测量长时间,否则会导致计数器溢出,这非常有用。如何使用它:
应该适用于所有 Cortex-A8 CPU。
哦 - 还有一些注意事项:
使用这些计数器,您将测量两次调用
get_cyclecount()
之间的确切时间,包括在其他进程或内核中。无法将测量限制为进程或单个线程。另外,调用
get_cyclecount()
也不是免费的。它将编译为单个 asm 指令,但从协处理器的移动将停止整个 ARM 管道。开销相当高并且可能会扭曲您的测量结果。幸运的是,开销也是固定的,因此您可以测量它并从计时中减去它。在我的例子中,我对每次测量都这样做了。在实践中不要这样做。两次调用之间迟早会发生中断,从而进一步扭曲您的测量结果。我建议您在空闲系统上测量几次开销,忽略所有外部因素并使用固定常量。
Accessing the performance counters isn't difficult, but you have to enable them from kernel-mode. By default the counters are disabled.
In a nutshell you have to execute the following two lines inside the kernel. Either as a loadable module or just adding the two lines somewhere in the board-init will do:
Once you did this the cycle counter will start incrementing for each cycle. Overflows of the register will go unnoticed and don't cause any problems (except they might mess up your measurements).
Now you want to access the cycle-counter from the user-mode:
We start with a function that reads the register:
And you most likely want to reset and set the divider as well:
do_reset
will set the cycle-counter to zero. Easy as that.enable_diver
will enable the 1/64 cycle divider. Without this flag set you'll be measuring each cycle. With it enabled the counter gets increased for every 64 cycles. This is useful if you want to measure long times that would otherwise cause the counter to overflow.How to use it:
Should work on all Cortex-A8 CPUs..
Oh - and some notes:
Using these counters you'll measure the exact time between the two calls to
get_cyclecount()
including everything spent in other processes or in the kernel. There is no way to restrict the measurement to your process or a single thread.Also calling
get_cyclecount()
isn't free. It will compile to a single asm-instruction, but moves from the co-processor will stall the entire ARM pipeline. The overhead is quite high and can skew your measurement. Fortunately the overhead is also fixed, so you can measure it and subtract it from your timings.In my example I did that for every measurement. Don't do this in practice. An interrupt will sooner or later occur between the two calls and skew your measurements even further. I suggest that you measure the overhead a couple of times on an idle system, ignore all outsiders and use a fixed constant instead.
您需要在优化之前和之后使用性能分析工具分析代码。
Acct 是一个命令行,您可以使用它来监视资源的功能。您可以通过谷歌搜索更多有关 acct 生成的 dat 文件的使用和查看信息。
我将用其他开源性能分析工具更新这篇文章。
Gprof 是另一个这样的工具。请检查文档是否相同。
You need to profile your code with performance analysis tools before and after your optimizations.
Acct is a command line and a function which you can use to monitor your resources. You can google more on the usage and viewing of the dat file hence generated by acct.
I will update this post with other opensource performance analysis tools.
Gprof is another such tool. Please check the documentation for the same.
几年过去了,现在要扩展尼尔斯的答案! - 访问这些计数器的一个简单方法是 构建带有 gator 的内核。然后报告计数器值,供 ARM 的性能分析工具 Streamline 使用。
它将在时间轴上显示每个函数(为您提供系统执行情况的高级概述),准确显示执行所需的时间以及它占用的 CPU 百分比。您可以将其与您设置的每个计数器的图表进行比较,以收集和跟踪 CPU 密集型任务直至源代码级别。
Streamline 适用于所有 Cortex-A 系列处理器。
To expand on the answer by Nils now that a couple of years have elapsed! - an easy way to access these counters is to build the kernel with gator. This then reports counter values for use with Streamline, which is ARM's performance analysis tool.
It will display each function on a timeline (giving you a high-level overview of how your system is performing), showing you exactly how long it took to execute, along with % CPU that it has taken up. You can compare this with charts of each counter that you've set it up to collect and follow CPU intensive tasks down to source code level.
Streamline works with all the Cortex-A series processors.
我曾在 ARM7 的工具链中工作过,它有一个指令级模拟器。在其中运行应用程序可以给出各个行和/或 asm 指令的计时。这对于给定例程的微观优化非常有用。不过,这种方法可能不适合整个应用程序/整个系统优化。
I've worked in an toolchain for ARM7 which had an instruction level simulator. Running apps in that could give timings for individual lines and/or asm instruction. That was great for a micro optimization of a given routine. That approach probably isn't appropriate for a whole app/whole system optimization though.