数组大小优化
如果使用 64 位 UNIX 操作系统,将数组大小定义为 8 的倍数有什么好处吗?我打算使用这个数组从共享内存加载数据。因此,操作系统和页面大小可能存在依赖性。
Is there any advantage defining an array's size to be a multiple of 8, if using 64 bit UNIX OS? I am intended to use this array for loading data from shared memory. So dependencies may exist on the operating system and the page size.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
没关系。你的编译器知道它是否需要在那里填充,所以让它决定。不要因为猜测而搞乱你的代码。
首先让您的程序正常运行,然后使用分析器关注性能。
Doesn't matter. Your compiler knows whether or not it wants padding there, so let it decide. Don't mud up your code because of guess-work.
Get your program working first, then care about performance with a profiler.
假设您在堆上动态分配数组,则可以合理地假设 malloc 的内部分配算法将对内核的实际内存请求进行一些抽象。也就是说,你的 malloc() 调用和 libc 的 brk() (或 mmap())系统调用之间可能有也可能没有直接关系。
malloc 手册页对此有更多内容。
因此,就内存使用而言,我倾向于建议您是否分配 8 字节的倍数并不重要,因为 malloc 可能会在您下面做一些不同(且合理)的事情。
就程序性能而言,内存中数据结构的分配会对缓存性能产生巨大影响。但最终,您需要分析您的应用程序,看看是否可以提高其缓存性能。我不相信有一个硬性规定可以让您在编写代码时对此进行优化。
如果您有兴趣了解有关内存和 Linux 的更多信息,Ulrich Drepper 几年前为 LWN 撰写了有关该主题的精彩系列:
http://lwn.net/Articles/250967
Assuming you're dynamically allocating the array on the heap, it's fair to assume that malloc's internal allocation algorithm will be doing some abstraction away from actual memory requests to the kernel. That is to say, there may or may not be a direct relationship between your malloc() call and libc's brk() (or mmap()) system call.
The malloc man page has some more on this.
So in terms of memory usage I would tend to suggest that it won't really matter whether or not you allocate in multiples of 8 bytes since malloc will likely be doing something different (and sensible) beneath you.
In terms of program performance, the allocation of your data structures in memory can have a huge impact on cache performance. Ultimately, though, you will need to profile your application to see whether you could improve its cache performance. I don't believe there is a hard and fast rule which will let you optimise for this as you write your code.
If you're interested in learning more about memory and Linux, Ulrich Drepper wrote a fantastic series for LWN on the subject a few years ago:
http://lwn.net/Articles/250967
如果您关心内存访问对齐等问题 - 如何对齐动态分配是内部环境/libc 的问题。如果数组的大小对齐,则不能保证某些数组以特定方式对齐。许多分配器返回与某个值(大约是机器字大小的 2 倍或 4 倍)对齐的内存块,因此不需要担心对齐问题。
我只记得几件可能有意义的事情:
您可能希望使用向量运算和/或展开循环来处理数组,因此可能需要一些填充以使程序不超出分配的区域。
(但是,如果您的矢量引擎需要超出标准 C 实现提供的对齐方式,那么您必须以另一种方式分配内存,而不仅仅是简单的 malloc() )。
大多数内存分配器在分配区域旁边存储服务信息(例如分配块大小),并且从空闲中切出的内存总大小稍大。如果最好分配大小略小于某个舍入值的区域,以便将区域密集地封装在某个标准分配块(例如内存页等)中。例如,如果 CPU 有 4k 页,则页可能仅包含 3 1024 字节块,但包含 4 1008 字节(=1024-8)块。
此外,许多内存分配器都有一个块大小阈值,低于该阈值的内存是从堆中分配的,但高于该阈值的内存是通过整个硬件页面直接从操作系统虚拟机调度程序获取的,因此在页面边界上对齐。在这种情况下,可能需要将分配大小舍入到页面大小以获得整个页面。
可能还有其他一些问题,但我不记得了。
If you is about memory access alignment or so - it is internal environment/libc matters how to align dynamic allocations. It is not guaranteed to have some array aligned in specific way if its size is aligned. Many allocators return memory blocks aligned to some value (about of 2x or 4x size of machine word) so it is not the place to bother about alignment.
I remember only several things that may have significance:
You may want use vector operations and/or unrolled loops to process an array, so it may be necessary to have some padding to make program not to fall beyond allocated area.
(But if your vector engine require alignment beyond standard C implementation provide, you have to allocate the memory in another way than just simple malloc() anyway).
Most of memory allocators store service information (e.g. allocated block size) beside allocated area, and total size of memory cut from free are slightly larger. Si it may be best to allocate area of size slightly less than some round value to have areas densely packed in some standard allocation block (say memory page or so). As an example if CPU have 4k page, then page may contain only 3 1024 byte blocks, but 4 1008 byte (=1024-8) blocks.
Also, many memory allocators have a block size threshold, below such memory is allocated from heap, but above it memory is got directly from OS VM dispatcher by whole hardware pages and thus aligned on page boundary. In this case it may be necessary to round allocation size up to page size to get whole page.
There may be soume other issues but I don't remember 'em.