gcc 中针对小型或固定大小数据的优化 memcpy
我使用 memcpy 复制可变大小的数据和固定大小的数据。在某些情况下,我复制少量内存(只有少数字节)。在 GCC 中,我记得 memcpy 曾经是一个内在的/内置的。然而,通过分析我的代码(使用 valgrind),我看到了对 glibc 中实际“memcpy”函数的数千次调用。
使用内置函数需要满足什么条件?我可以快速滚动我自己的 memcpy,但我确信内置函数比我能做的更有效。
注意:在大多数情况下,要复制的数据量可作为编译时常量使用。
CXXFLAGS: -O3 -DNDEBUG
我现在使用的代码,强制内置函数,如果去掉 _builtin 前缀,则不会使用内置函数。这是使用 T=sizeof(type) 从各种其他模板/函数调用的。使用的大小是 1、2、4 的倍数、一些 50-100 字节大小以及一些更大的结构。
template<int T>
inline void load_binary_fixm(void *address)
{
if( (at + T) > len )
stream_error();
__builtin_memcpy( address, data + at, T );
at += T;
}
I use memcpy to copy both variable sizes of data and fixed sized data. In some cases I copy small amounts of memory (only a handful of bytes). In GCC I recall that memcpy used to be an intrinsic/builtin. Profiling my code however (with valgrind) I see thousands of calls to the actual "memcpy" function in glibc.
What conditions have to be met to use the builtin function? I can roll my own memcpy quickly, but I'm sure the builtin is more efficient than what I can do.
NOTE: In most cases the amount of data to be copied is available as a compile-time constant.
CXXFLAGS: -O3 -DNDEBUG
The code I'm using now, forcing builtins, if you take off the _builtin prefix the builtin is not used. This is called from various other templates/functions using T=sizeof(type). The sizes that get used are 1, 2, multiples of 4, a few 50-100 byte sizes, and some larger structures.
template<int T>
inline void load_binary_fixm(void *address)
{
if( (at + T) > len )
stream_error();
__builtin_memcpy( address, data + at, T );
at += T;
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
对于 T 很小的情况,我会专门化并使用本机分配。
例如,当 T 为 1 时,只需分配一个字符。
如果您知道地址是对齐的,请使用适合您平台的适当大小的 int 类型。
如果地址未对齐,您最好进行适当数量的字符分配。
这样做的目的是避免分支并保留计数器。
当 T 很大时,如果你做得比库 memcpy() 更好,我会感到惊讶,并且函数调用开销可能会在噪音中消失。如果您确实想优化,请查看周围的 memcpy() 实现。有些变体使用扩展指令等。
更新:
查看有关内联 memcpy 的实际(!)问题,诸如编译器版本和平台之类的问题变得相关。出于好奇,您是否尝试过使用 std::copy ,如下所示:
For the cases where T is small, I'd specialise and use a native assignment.
For example, where T is 1, just assign a single char.
If you know the addresses are aligned, use and appropriately sized int type for your platform.
If the addresses are not aligned, you might be better off doing the appropriate number of char assignments.
The point of this is to avoid a branch and keeping a counter.
Where T is big, I'd be surprised if you do better than the library memcpy(), and the function call overhead is probably going to be lost in the noise. If you do want to optimise, look around at the memcpy() implementations around. There are variants that use extended instructions, etc.
Update:
Looking at your actual(!) question about inlining memcpy, questions like compiler versions and platform become relevant. Out of curiosity, have you tried using std::copy, something like this: