Linux 嵌入式 (ARM) 中的内存吞吐量较低
我使用的是ARM926EJS。在没有 Linux 的情况下,我在内存复制测试中获得了 20% 以上的内存速度(就像入门可执行文件一样)。但在 Linux 中,相同的代码运行速度要慢 20%。
代码是
/// Below code just performs burst mode memcopy test. void asmcpy(void *a, void *b, int iSize) { do { asm volatile ( "ldmia %0!, {r3-r10} \n\t" "stmia %0!, {r3-r10} \n\t" :"+r"(a), "+r"(b) : :"r"(r3),"r"(r4),"r"(r5),"r"(r6),"r"(r7),"r"(r8),"r"(r9),"r"(r10) ); }while(size--) }
我验证没有其他进程在 Linux 上占用 CPU 时间。(我使用 time 命令检查了这一点,它显示实时与usr时间相同)
请告诉我linux可能出现什么问题?
谢谢&问候。
添加:
我的测试代码是
int main() { int a[320 * 120], b[320 * 120]; for(int i=0; i != 10000; i++) { /// Size is divided by 8 because our memcpy function performs 8 integer load stores in the iteration asmcpy(a, b, (320 * 120) / 8); } }
入门可执行文件是一个 bin 文件,它使用串行端口发送到 RAM,并通过跳转到 RAM 中的该地址直接执行。 (无需操作系统)
已添加。
我在其他处理器上没有看到这样的性能差异。他们使用的是 SD RAM,该处理器使用的是 DDR Ram。能有理由吗?
额外。 入门代码中未启用数据缓存,而在 Linux 模式下启用了数据缓存,因此理想情况下,所有数据都应缓存并在没有任何 RAM 延迟的情况下进行访问,但 Linux 仍然慢 20%。
额外: 我的微控制器是LPC3250。两项测试均在同一外部 DDR RAM 上进行测试。
I am using ARM926EJS. I am getting 20 % more memory speed in memory copy test, without Linux ( Just as a Getting Started executable). But in linux same code is running 20% slower.
Code is
/// Below code just performs burst mode memcopy test. void asmcpy(void *a, void *b, int iSize) { do { asm volatile ( "ldmia %0!, {r3-r10} \n\t" "stmia %0!, {r3-r10} \n\t" :"+r"(a), "+r"(b) : :"r"(r3),"r"(r4),"r"(r5),"r"(r6),"r"(r7),"r"(r8),"r"(r9),"r"(r10) ); }while(size--) }
I verified that no other process is taking CPU time on linux.( I checked this with the use of time command, It shows real time is same as usr time)
Please tell me what can be the problem with linux?
Thanks & Regards.
ADDED:
my test code is
int main() { int a[320 * 120], b[320 * 120]; for(int i=0; i != 10000; i++) { /// Size is divided by 8 because our memcpy function performs 8 integer load stores in the iteration asmcpy(a, b, (320 * 120) / 8); } }
Getting Started executable is a bin file which is sent to the RAM using serial port and executes directly by jumping to that address in RAM. (without the need of an OS)
ADDED.
I haven't seen such performance difference on other processors.They were using SD RAM, This processor is using DDR Ram. Can it be a reason?
ADDED.
Data Cache is not enabled in getting started code and Data Cache is eabled in Linux mode, So Ideally all data should be cached and get accessed without any RAM latency, But still Linux is 20% slow.
ADDED:
My microcontroller is LPC3250. Both the test are been tested on same external DDR RAM.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
该芯片有一个MMU,因此Linux很可能使用它来管理内存。也许仅仅启用它就会带来一些性能影响。此外,Linux 使用惰性内存分配策略,仅在进程第一次访问时才将内存页分配给该进程。如果您正在复制一大块内存,MMU 将生成页面错误,要求内核在循环内分配页面。在低端处理器上,所有这些上下文切换都会导致缓存刷新并导致明显的速度减慢。
如果您的系统足够小,请尝试无 MMU 版本的 Linux(例如 uClinux)。也许它可以让你使用具有相似性能的更便宜的芯片。在嵌入式系统上,每一分钱都很重要。
更新:一些额外的细节:
每个Linux进程都有它自己的内存映射,最初只包括内核和(可能)可执行代码。所有其余的线性 4GB(32 位)似乎都可用,但没有分配给它们的 RAM 页。一旦读取或写入未分配的内存地址,MMU 就会发出页面错误信号并切换到内核。内核发现它仍然有大量空闲 RAM 页,因此选择一个,将其分配给故障点并返回到您的代码,从而完成中断的指令。下一个不会失败,因为整个页面(通常为 4KB)已经被分配;但几次迭代后,它将到达另一个未分配的空间,MMU 将再次调用内核。
This chip has an MMU, so Linux is likely using it to manage memory. Maybe just enabling it introduces some performance hit. Also, Linux uses a lazy memory allocation strategy, only assigning memory pages to a process when it first hits it. If you're copying a big chunk of memory, the MMU will generate page faults to ask the kernel to allocate a page while inside your loop. On a low-end processor, all these context switches cause cache flushes and introduce a noticeable slowdown.
If your system is small enough, try an MMU-less version of Linux (like uClinux). Maybe it would let you use a cheaper chip with similar performance. On embedded systems, every penny counts.
update: Some extra details:
Every Linux process gets it's own memory mappings, At first this include only the kernel and (maybe) executable code. All the rest of the linear 4GB (on 32bit) seems available, but there's no RAM pages assigned to them. As soon as you read or write an unallocated memory address, the MMU signals a page fault and switches to the kernel. The kernel sees that it still has lots of free RAM pages, so picks one, assigns it to the faulted point and returns to your code, which finishes the interrupted instruction. The very next one won't fail because the whole page (typically 4KB) is already assigned; but a few iterations later, it will hit another non-assigned space, and the MMU will invoke the kernel again.
您如何执行计时?您的示例中没有计时代码。
您确定没有测量进程加载/卸载时间吗?
两种情况下的处理器时钟速度是否相同?
如果使用外部 SDRAM,两种情况下的 RAM 时序是否相同?
这两种情况都启用了数据缓存吗?
克利福德
How are you performing the timing? There is no timing code in your example.
Are you sure that you are not measuring process load/unload time?
Is the processor clock speed the same in both cases?
If using external SDRAM are the RAM timings the same in both cases?
Is the data cache enabled in both cases?
Clifford
入门并不是“只是一个可执行文件”。必须有一些代码来设置 DDR 控制器寄存器。
如果缓存也被启用,那么MMU 也必须被启用。我认为在ARM926EJS上,如果没有MMU就不能有数据缓存。
我相信每次上下文切换都会导致缓存刷新,因为缓存是虚拟索引的,虚拟标记的,并且内核和用户空间不共享相同的地址空间,因此与没有操作系统相比,您可能会在操作系统中进行更多不需要的缓存刷新。
这是一篇论文,其中涉及 VIVT 成本的某些方面运行 Linux 时缓存刷新
Getting started is not "just an executable". There must be some code to set the DDR controller register.
If cache is also enabled, then so must be the MMU. I think on ARM926EJS, you can't have data cache without MMU.
I believe every context switch results in a cache flush, because the cache is virtually indexed, virtually tagged and Kernel and Userspace don't share the same address space, so you probably have a lot more unwanted cache flush in the than without OS.
Here is a paper with some aspect on the cost of VIVT cache flush when running Linux
您使用什么微控制器(不仅仅是 ARM CPU)?
是否有可能在非 Linux 运行中,您正在测试的阵列是微控制器设备本身上的 RAM,而在 Linux 测试中,被测试的阵列是在外部 RAM 中?内部 RAM 的访问速度通常比外部 RAM 快得多 - 这可能是 Linux 测试速度较慢的原因,即使仅针对 Linux 运行启用了数据缓存。
What microcontroller (not just what ARM CPU) are you using?
Is it possible that in the non-Linux run the array you're testing is RAM on the microcontroller device itself while in the Linux test the array being tested is in external RAM? Internal RAM is usually accessed much faster than external RAM - this might account for the Linux test being slower, even if data caching is enabled only for the Linux run.