ARM Cortex-A8:一次内存读取读取了多少字节?
我正在尝试改进在 ARM cortex-a8 处理器上运行的图像处理项目。
我正在从内存中访问 8 位灰度图像数据。在我的函数中,现在我正在逐字节访问单个像素值。
我认为通过使用 NEON,我可以通过从内存中一次访问 128/8 = 16 字节来改进这一点,然后在我的函数中使用它们。但在运行更改后的版本后,我发现这实际上比逐字节访问花费更多时间。我认为我使用 NEON 的获取正在成为瓶颈,花费的时间比我的计算时间还要多。
ARM Cortex-A8 的数据总线大小是多少?一次内存读取从内存中访问了多少字节?
I'm trying to improve my image processing project running on an ARM cortex-a8 processor.
I was accessing 8-bit Grayscale Image data from memory. In my function, right now I'm accessing individual pixel value, byte-by-byte.
I thought that by making use of NEON I can improve this by accessing 128/8 = 16 bytes in one shot from memory and then make use of them in my function. But upon running the changed version I see that this is actually taking MORE time than byte-by-byte access. I think that my fetching using NEON is becoming a bottleneck, taking more time than my computation time.
What is the data bus size of ARM Cortex-A8? How many bytes are accessed from memory in one memory fetch?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
来自 Cortex A8 TRM:
“您可以将处理器配置为连接到 64 位或 128 位 AXI 互连,从而为系统设计提供灵活性”
NEON 是否必要,也许您正在将苹果与橙子进行比较?您可以使用 ldrd/strd 或 ldm/stm 来代替 ldrb/strb 来获得 64 位传输。 ARM/AXI 可以足够智能,可以预见并将较小的传输分组为较大的传输,例如将两个 32 位传输分组为一个 64 位。但我不会依赖于此。我只是在您发现通过更改为 ldr/str 或 ldrd/strd 不会获得任何性能提升时才提到它。
您是否隔离(无数据处理)读取或写入循环并尝试字节、字和双字?从字中提取字节的代码可能会超出总线上的节省。
这是什么类型的内存?这是在片上还是片外之类的,该内存相对于 AXI (ARM) 时钟速度的速度是多少?
您是否启用了该区域的数据缓存?如果是这样,它可能是一个静音点,第一个字节读取将使用最佳数据总线大小进行缓存行填充,该缓存行内的后续读取将不会到达 AXI 总线,更不用说到达目标内存了。同样,写入应该只到达高速缓存,并稍后以更宽的总线优化大小发送到目标。取决于缓存/写入缓冲区的配置方式。
From the Cortex A8 TRM:
"You can configure the processor to connect to either a 64-bit or 128-bit AXI interconnect that provides flexibility to system designs"
Is NEON necessary, perhaps you are comparing apples to oranges? Instead of ldrb/strb you can use ldrd/strd or ldm/stm to get 64 bit transfers. The ARM/AXI can be smart enough to look ahead and group smaller transfers into larger transfers, say two 32 bit transfers into one 64 bit. But I would not rely on that. I only mention it in case you find that by changing to an ldr/str or ldrd/strd you dont make any performance gains.
Did you isolate (no data processing) the read or write loop and try bytes vs words vs double words? It may be that the code to extract bytes from words overwhelms the savings on the bus.
What type of memory is this? Is this on chip or off chip, that sort of thing, what speed is this memory relative to the AXI (ARM) clock speed?
Do you have the data cache enabled for this region? If so it may be a mute point, the first byte read will do a cache line fill using an optimal data bus size, subsequent reads within that cache line will not reach the AXI bus much less the target memory. Likewise the writes should only go as far as the cache and go out to the target in a wider bus optimized size later. Depends on how the cache/write buffer is configured.
您可能会遇到管道停顿的情况。如果您想读取 Neon 数据,那么在 CPU 核心中使用该数据之前将会有一些延迟。
It could be that you experience pipeline stalls. If you want to read through Neon there will be some latency before you can use that data in the CPU core.