具有动态分配成员的动态创建对象的访问成本
我正在构建一个应用程序,该应用程序将具有 A 类型的动态分配对象,每个对象都有一个动态分配的成员 (v),类似于下面的类,
class A {
int a;
int b;
int* v;
};
其中:
- v 的内存将在构造函数中分配。
- v 将在创建类型 A 的对象时分配一次,并且永远不需要调整大小。
- v 的大小在 A 的所有实例中都会有所不同。
应用程序可能有大量此类对象,并且大多数需要通过 CPU 流式传输大量此类对象,但只需要对成员变量执行非常简单的计算。
- 动态分配 v 是否意味着 A 的实例及其成员 v 不在内存中一起定位?
- 可以使用哪些工具和技术来测试这种碎片是否是性能瓶颈?
- 如果这种碎片是一个性能问题,是否有任何技术可以允许 A 和 v 在连续的内存区域中分配?
- 或者是否有任何技术可以帮助内存访问,例如预取方案?例如,获取类型 A 的对象,同时预取 v,对其他成员变量进行操作。
- 如果 v 的大小或可接受的最大大小在编译时已知,则将 v 替换为固定大小的数组,如 int v[max_length]带来更好的表现?
目标平台是配备 x86/AMD64 处理器、Windows 或 Linux 操作系统的标准台式机,并使用 GCC 或 MSVC 编译器进行编译。
I'm building an application which will have dynamic allocated objects of type A each with a dynamically allocated member (v) similar to the below class
class A {
int a;
int b;
int* v;
};
where:
- The memory for v will be allocated in the constructor.
- v will be allocated once when an object of type A is created and will never need to be resized.
- The size of v will vary across all instances of A.
The application will potentially have a huge number of such objects and mostly need to stream a large number of these objects through the CPU but only need to perform very simple computations on the members variables.
- Could having v dynamically allocated could mean that an instance of A and its member v are not located together in memory?
- What tools and techniques can be used to test if this fragmentation is a performance bottleneck?
- If such fragmentation is a performance issue, are there any techniques that could allow A and v to allocated in a continuous region of memory?
- Or are there any techniques to aid memory access such as pre-fetching scheme? for example get an object of type A operate on the other member variables whilst pre-fetching v.
- If the size of v or an acceptable maximum size could be known at compile time would replacing v with a fixed sized array like int v[max_length] lead to better performance?
The target platforms are standard desktop machines with x86/AMD64 processors, Windows or Linux OSes and compiled using either GCC or MSVC compilers.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果您有充分的理由关心性能......
如果他们都被分配了“新”,那么他们很可能会彼此接近。然而,当前的记忆状态会极大地影响这一结果,这在很大程度上取决于您对记忆的处理方式。如果你只是一个接一个地分配一千个这样的东西,那么后面的几乎肯定是“几乎连续的”。
如果 A 实例在堆栈上,则它的“v”不太可能在附近。
为两者分配空间,然后将新的它们放置到该空间中。它很脏,但通常应该可以工作:
预取是特定于编译器和平台的,但许多编译器都有可用的内在函数来执行此操作。请注意,如果您要尝试立即访问该数据,那么它不会有太大帮助,为了使预取具有任何价值,您通常需要在需要数据之前执行数百个周期。也就是说,它可以极大地提高速度。内在函数看起来像 __pf(my_a->v);
或许。如果固定大小的缓冲区通常接近您需要的大小,那么它可能会大大提高速度。通过这种方式访问一个实例总是会更快,但如果缓冲区过大并且大部分未使用,您将失去将更多对象放入缓存的机会。即,在缓存中拥有更多较小的对象比在缓存中填充大量未使用的数据要好。
具体细节取决于您的设计和性能目标。关于此问题的有趣讨论,以及使用特定编译器的特定硬件上的“现实世界”特定问题,请参阅 面向对象编程的陷阱(这是 PDF 的 Google 文档链接,PDF 本身可以在 此处)。
If you have a good reason to care about performance...
If they are both allocated with 'new', then it is likely that they will be near one another. However, the current state of memory can drastically affect this outcome, it depends significantly on what you've been doing with memory. If you just allocate a thousand of these things one after another, then the later ones will almost certainly be "nearly contiguous".
If the A instance is on the stack, it is highly unlikely that its 'v' will be nearby.
Allocate space for both, then placement new them into that space. It's dirty, but it should typically work:
Prefetching is compiler and platform specific, but many compilers have intrinsics available to do it. Mind- it won't help a lot if you're going to try to access that data right away, for prefetching to be of any value you often need to do it hundreds of cycles before you want the data. That said, it can be a huge boost to speed. The intrinsic would look something like
__pf(my_a->v);
Maybe. If the fixed size buffer is usually close to the size you'll need, then it could be a huge boost in speed. It will always be faster to access one A instance in this way, but if the buffer is unnecessarily gigantic and largely unused, you'll lose the opportunity for more objects to fit into the cache. I.e. it's better to have more smaller objects in the cache than it is to have a lot of unused data filling the cache up.
The specifics depend on what your design and performance goals are. An interesting discussion about this, with a "real-world" specific problem on a specific bit of hardware with a specific compiler, see The Pitfalls of Object Oriented Programming (that's a Google Docs link for a PDF, the PDF itself can be found here).
是的,这是有可能的。
缓存研磨,鲨鱼。
是的,您可以将它们分配在一起,但您应该首先看看这是否是一个问题。例如,您可以使用 arena 分配,或者编写自己的分配器。
是的,您可以这样做。最好的办法是将一起使用的内存区域分配得彼此靠近。
可能会也可能不会。它至少会使 v 成为结构成员的本地变量。
Yes, it that will be likely.
cachegrind, shark.
Yes, you could allocate them together, but you should probably see if it's an issue first. You could use arena allocation, for example, or write your own allocators.
Yes, you could do this. The best thing to do would be to allocate regions of memory used together near each other.
It might or might not. It would at least make v local with the struct members.
如果您需要通过 CPU 传输大量这些数据,并对每个数据进行很少的计算,正如您所说,为什么我们要进行所有这些内存分配?
您能否只拥有该结构的一份副本和一个(大)v 缓冲区,将数据读入其中(以二进制形式,以提高速度),进行少量计算,然后继续进行下一张。
程序应该将几乎 100% 的时间花在 I/O 上。
如果你在运行时暂停它几次,你几乎每次在调用像 FileRead 这样的系统例程的过程中都会看到它。某些分析器可能会为您提供此信息,但它们往往对 I/O 时间过敏。
If you need to stream a large number of these through the CPU and do very little calculation on each one, as you say, why are we doing all this memory allocation?
Could you just have one copy of the structure, and one (big) buffer of
v
, read your data into it (in binary, for speed), do your very little calculation, and move on to the next one.The program should spend almost 100% of time in I/O.
If you pause it several times while it's running, you should see it almost every time in the process of calling a system routine like FileRead. Some profilers might give you this information, except they tend to be allergic to I/O time.