比 memset 更快的零内存方法?
我了解到 memset(ptr, 0, nbytes) 确实很快,但是有没有更快的方法(至少在 x86 上)?
我假设 memset 使用 mov
,但是当将内存归零时,大多数编译器使用 xor
因为它更快,对吗? edit1:错误,正如 GregS 指出的那样,它仅适用于寄存器。我在想什么?
我还请一个比我更了解汇编程序的人查看 stdlib,他告诉我在 x86 上 memset 没有充分利用 32 位宽寄存器。但当时我很累,所以我不太确定我是否理解正确。
编辑2: 我重新审视了这个问题并做了一些测试。这是我测试的:
#include <stdio.h>
#include <malloc.h>
#include <string.h>
#include <sys/time.h>
#define TIME(body) do { \
struct timeval t1, t2; double elapsed; \
gettimeofday(&t1, NULL); \
body \
gettimeofday(&t2, NULL); \
elapsed = (t2.tv_sec - t1.tv_sec) * 1000.0 + (t2.tv_usec - t1.tv_usec) / 1000.0; \
printf("%s\n --- %f ---\n", #body, elapsed); } while(0) \
#define SIZE 0x1000000
void zero_1(void* buff, size_t size)
{
size_t i;
char* foo = buff;
for (i = 0; i < size; i++)
foo[i] = 0;
}
/* I foolishly assume size_t has register width */
void zero_sizet(void* buff, size_t size)
{
size_t i;
char* bar;
size_t* foo = buff;
for (i = 0; i < size / sizeof(size_t); i++)
foo[i] = 0;
// fixes bug pointed out by tristopia
bar = (char*)buff + size - size % sizeof(size_t);
for (i = 0; i < size % sizeof(size_t); i++)
bar[i] = 0;
}
int main()
{
char* buffer = malloc(SIZE);
TIME(
memset(buffer, 0, SIZE);
);
TIME(
zero_1(buffer, SIZE);
);
TIME(
zero_sizet(buffer, SIZE);
);
return 0;
}
结果:
除-O3 外,zero_1 是最慢的。 Zero_sizet 是最快的,-O1、-O2 和 -O3 的性能大致相同。 memset 总是比 Zero_sizet 慢。 (-O3 慢两倍)。有趣的一件事是,在 -O3 时,zero_1 与 Zero_sizet 一样快。然而,反汇编的函数的指令数量大约是原来的四倍(我认为是由循环展开引起的)。另外,我尝试进一步优化 Zero_sizet,但编译器总是比我强,但这里并不奇怪。
目前 memset 获胜,之前的结果因 CPU 缓存而扭曲。 (所有测试均在 Linux 上运行)需要进一步测试。接下来我将尝试汇编器:)
edit3: 修复了测试代码中的错误,测试结果不受影响
edit4: 在查看反汇编的 VS2010 C 运行时时,我注意到 < code>memset 有一个针对零的 SSE 优化例程。很难击败这一点。
I learned that memset(ptr, 0, nbytes)
is really fast, but is there a faster way (at least on x86)?
I assume that memset uses mov
, however when zeroing memory most compilers use xor
as it's faster, correct? edit1: Wrong, as GregS pointed out that only works with registers. What was I thinking?
Also I asked a person who knew of assembler more than me to look at the stdlib, and he told me that on x86 memset is not taking full advantage of the 32 bit wide registers. However at that time I was very tired, so I'm not quite sure I understood it correctly.
edit2:
I revisited this issue and did a little testing. Here is what I tested:
#include <stdio.h>
#include <malloc.h>
#include <string.h>
#include <sys/time.h>
#define TIME(body) do { \
struct timeval t1, t2; double elapsed; \
gettimeofday(&t1, NULL); \
body \
gettimeofday(&t2, NULL); \
elapsed = (t2.tv_sec - t1.tv_sec) * 1000.0 + (t2.tv_usec - t1.tv_usec) / 1000.0; \
printf("%s\n --- %f ---\n", #body, elapsed); } while(0) \
#define SIZE 0x1000000
void zero_1(void* buff, size_t size)
{
size_t i;
char* foo = buff;
for (i = 0; i < size; i++)
foo[i] = 0;
}
/* I foolishly assume size_t has register width */
void zero_sizet(void* buff, size_t size)
{
size_t i;
char* bar;
size_t* foo = buff;
for (i = 0; i < size / sizeof(size_t); i++)
foo[i] = 0;
// fixes bug pointed out by tristopia
bar = (char*)buff + size - size % sizeof(size_t);
for (i = 0; i < size % sizeof(size_t); i++)
bar[i] = 0;
}
int main()
{
char* buffer = malloc(SIZE);
TIME(
memset(buffer, 0, SIZE);
);
TIME(
zero_1(buffer, SIZE);
);
TIME(
zero_sizet(buffer, SIZE);
);
return 0;
}
results:
zero_1 is the slowest, except for -O3. zero_sizet is the fastest with roughly equal performance across -O1, -O2 and -O3. memset was always slower than zero_sizet. (twice as slow for -O3). one thing of interest is that at -O3 zero_1 was equally fast as zero_sizet. however the disassembled function had roughly four times as many instructions (I think caused by loop unrolling). Also, I tried optimizing zero_sizet further, but the compiler always outdid me, but no surprise here.
For now memset wins, previous results were distorted by CPU cache. (all tests were run on Linux) Further testing needed. I'll try assembler next :)
edit3: fixed bug in test code, test results are not affected
edit4: While poking around the disassembled VS2010 C runtime, I noticed that memset
has a SSE optimized routine for zero. It will be hard to beat this.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
x86 是相当广泛的设备。
对于完全通用的 x86 目标,带有“rep movsd”的汇编块可以一次将零喷射到 32 位内存中。尝试确保大部分工作都是 DWORD 对齐的。
对于带有 mmx 的芯片,带有 movq 的汇编循环一次可以达到 64 位。
您也许能够让 C/C++ 编译器使用带有指向 long long 或 _m64 的指针的 64 位写入。目标必须是 8 字节对齐才能获得最佳性能。
对于带有 sse 的芯片,movaps 很快,但前提是地址是 16 字节对齐的,所以使用 movsb 直到对齐,然后用 movaps 循环完成清除
Win32 有“ZeroMemory()”,但我忘了那是不是宏到 memset,或者一个实际的“良好”实现。
x86 is rather broad range of devices.
For totally generic x86 target, an assembly block with "rep movsd" could blast out zeros to memory 32-bits at time. Try to make sure the bulk of this work is DWORD aligned.
For chips with mmx, an assembly loop with movq could hit 64bits at a time.
You might be able to get a C/C++ compiler to use a 64-bit write with a pointer to a long long or _m64. Target must be 8 byte aligned for the best performance.
for chips with sse, movaps is fast, but only if the address is 16 byte aligned, so use a movsb until aligned, and then complete your clear with a loop of movaps
Win32 has "ZeroMemory()", but I forget if thats a macro to memset, or an actual 'good' implementation.
memset
通常被设计为非常非常快的通用设置/归零代码。它可以处理具有不同尺寸和对齐方式的所有情况,这会影响您可以用来完成工作的指令类型。根据您所在的系统(以及您的 stdlib 来自哪个供应商),底层实现可能位于特定于该体系结构的汇编程序中,以利用其本机属性。它还可能有内部特殊情况来处理归零的情况(而不是设置其他值)。也就是说,如果您需要执行非常具体、对性能非常关键的内存清零,那么您当然有可能通过自己完成来击败特定的
memset
实现。memset
及其在标准库中的朋友始终是胜人一筹的编程的有趣目标。 :)memset
is generally designed to be very very fast general-purpose setting/zeroing code. It handles all cases with different sizes and alignments, which affect the kinds of instructions you can use to do your work. Depending on what system you're on (and what vendor your stdlib comes from), the underlying implementation might be in assembler specific to that architecture to take advantage of whatever its native properties are. It might also have internal special cases to handle the case of zeroing (versus setting some other value).That said, if you have very specific, very performance critical memory zeroing to do, it's certainly possible that you could beat a specific
memset
implementation by doing it yourself.memset
and its friends in the standard library are always fun targets for one-upmanship programming. :)现在你的编译器应该为你做所有的工作。至少据我所知,gcc 在优化对 memset 的调用方面非常高效(不过,最好检查一下汇编器)。
然后,如果不需要的话,请避免使用
memset
:... = { 0
)用于堆栈内存}
,对于非常大的块,请使用
mmap
(如果有的话)。这只是“免费”从系统获取零初始化内存。Nowadays your compiler should do all the work for you. At least of what I know gcc is very efficient in optimizing calls to
memset
away (better check the assembler, though).Then also, avoid
memset
if you don't have to:... = { 0
) for stack memory}
And for really large chunks use
mmap
if you have it. This just gets zero initialized memory from the system "for free".如果我没记错的话(几年前),一位高级开发人员正在谈论一种在 PowerPC 上快速调用 bzero() 的方法(规格说明我们需要在开机时将几乎所有内存清零)。它可能无法很好地(如果有的话)转换为 x86,但它可能值得探索。
这个想法是加载数据缓存行,清除该数据缓存行,然后将清除的数据缓存行写回内存。
对于它的价值,我希望它有所帮助。
If I remember correctly (from a couple of years ago), one of the senior developers was talking about a fast way to bzero() on PowerPC (specs said we needed to zero almost all the memory on power up). It might not translate well (if at all) to x86, but it could be worth exploring.
The idea was to load a data cache line, clear that data cache line, and then write the cleared data cache line back to memory.
For what it is worth, I hope it helps.
除非您有特定需求或知道您的编译器/stdlib 很糟糕,否则请坚持使用 memset。它是通用的,一般来说应该具有不错的性能。此外,编译器可能更容易优化/内联 memset(),因为它可以拥有对其的内在支持。
例如,Visual C++ 通常会生成 memcpy/memset 的内联版本,它们与对库函数的调用一样小,从而避免了推送/调用/ret 开销。当可以在编译时评估大小参数时,还可以进行进一步的优化。
也就是说,如果您有特定需求(其中尺寸始终很小*或*巨大),您可以通过下降到装配级别来获得速度提升。例如,使用直写操作将大块内存清零,而不会污染二级缓存。
但这一切都取决于 - 对于正常的东西,请坚持使用 memset/memcpy :)
Unless you have specific needs or know that your compiler/stdlib is sucky, stick with memset. It's general-purpose, and should have decent performance in general. Also, compilers might have an easier time optimizing/inlining memset() because it can have intrinsic support for it.
For instance, Visual C++ will often generate inline versions of memcpy/memset that are as small as a call to the library function, thus avoiding push/call/ret overhead. And there's further possible optimizations when the size parameter can be evaluated at compile-time.
That said, if you have specific needs (where size will always be tiny *or* huge), you can gain speed boosts by dropping down to assembly level. For instance, using write-through operations for zeroing huge chunks of memory without polluting your L2 cache.
But it all depends - and for normal stuff, please stick to memset/memcpy :)
memset 函数被设计得灵活简单,甚至以牺牲速度为代价。在许多实现中,它是一个简单的 while 循环,在给定的字节数上一次复制一个字节的指定值。如果您想要一个更快的 memset(或 memcpy、memmove 等),几乎总是可以自己编写一个。
最简单的定制是执行单字节“设置”操作,直到目标地址对齐为 32 位或 64 位(无论与您的芯片架构匹配),然后开始一次复制完整的 CPU 寄存器。如果您的范围没有以对齐地址结束,您可能必须在末尾执行几个单字节“设置”操作。
根据您的特定 CPU,您可能还有一些流 SIMD 指令可以帮助您。这些通常在对齐地址上工作得更好,因此上述使用对齐地址的技术在这里也很有用。
为了将大块内存清零,您还可以通过将范围分成多个部分并并行处理每个部分(其中部分的数量与核心/硬件线程的数量相同)来获得速度提升。
最重要的是,除非您尝试一下,否则无法判断这些是否有帮助。至少,看看你的编译器针对每种情况发出了什么。看看其他编译器为其标准“memset”发出了什么(它们的实现可能比您的编译器更有效)。
The memset function is designed to be flexible and simple, even at the expense of speed. In many implementations, it is a simple while loop that copies the specified value one byte at a time over the given number of bytes. If you are wanting a faster memset (or memcpy, memmove, etc), it is almost always possible to code one up yourself.
The simplest customization would be to do single-byte "set" operations until the destination address is 32- or 64-bit aligned (whatever matches your chip's architecture) and then start copying a full CPU register at a time. You may have to do a couple of single-byte "set" operations at the end if your range doesn't end on an aligned address.
Depending on your particular CPU, you might also have some streaming SIMD instructions that can help you out. These will typically work better on aligned addresses, so the above technique for using aligned addresses can be useful here as well.
For zeroing out large sections of memory, you may also see a speed boost by splitting the range into sections and processing each section in parallel (where number of sections is the same as your number or cores/hardware threads).
Most importantly, there's no way to tell if any of this will help unless you try it. At a minimum, take a look at what your compiler emits for each case. See what other compilers emit for their standard 'memset' as well (their implementation might be more efficient than your compiler's).
这是一个有趣的问题。我实现的这个实现在 VC++ 2012 上编译 32 位版本时速度稍快一些(但很难测量)。它可能还可以改进很多。在多线程环境中将其添加到您自己的类中可能会给您带来更多性能提升,因为在多线程场景中,
memset()
报告了一些瓶颈问题。32 位系统发布编译时输出如下:
64 位系统发布编译时输出如下:
在这里你可以找到Berkley的
memset()
源代码,我认为这是最常见的实现。That's an interesting question. I made this implementation that is just slightly faster (but hardly measurable) when 32-bit release compiling on VC++ 2012. It probably can be improved on a lot. Adding this in your own class in a multithreaded environment would probably give you even more performance gains since there are some reported bottleneck problems with
memset()
in multithreaded scenarios.Output is as follows when release compiling for 32-bit systems:
Output is as follows when release compiling for 64-bit systems:
Here you can find the source code Berkley's
memset()
, which I think is the most common implementation.这个原本伟大而有用的测试有一个致命的缺陷:
由于 memset 是第一条指令,因此似乎存在一些“内存开销”左右,这使得它非常慢。
将 memset 的计时移至第二位,将其他内容移至第一位,或者简单地将 memset 计时两次,使得 memset 在所有编译开关中速度最快!
There is one fatal flaw in this otherwise great and helpful test:
As memset is the first instruction, there seems to be some "memory overhead" or so which makes it extremely slow.
Moving the timing of memset to second place and something else to first place or simply timing memset twice makes memset the fastest with all compile switches!!!
memset 可以由编译器内联为一系列有效的操作码,展开几个周期。对于非常大的内存块,例如 4000x2000 64 位帧缓冲区,您可以尝试跨多个线程对其进行优化(您为该唯一任务准备的线程),每个线程设置自己的部分。请注意,还有 bzero(),但它更晦涩难懂,并且不太可能像 memset 那样优化,并且编译器肯定会注意到您传递了 0。
编译器通常假设您 memset 大块,因此对于较小的块如果您初始化大量小对象,那么仅执行
*(uint64_t*)p = 0
可能会更有效。一般来说,所有 x86 CPU 都是不同的(除非您针对某些标准化平台进行编译),并且针对 Pentium 2 优化的某些内容在 Core Duo 或 i486 上的表现会有所不同。因此,如果您真的很喜欢它并且想要挤掉最后一点牙膏,那么发布针对不同流行 CPU 型号编译和优化的 exe 的多个版本是有意义的。根据个人经验,与没有 -march 相比,Clang -march=native 将我的游戏的 FPS 从 60 提高到 65。
memset could be inlined by compiler as a series of efficient opcodes, unrolled for a few cycles. For very large memory blocks, like 4000x2000 64bit framebuffer, you can try optimizing it across several threads (which you prepare for that sole task), each setting its own part. Note that there is also bzero(), but it is more obscure, and less likely to be as optimized as memset, and the compiler will surely notice you pass 0.
What compiler usually assumes, is that you memset large blocks, so for smaller blocks it would likely be more efficient to just do
*(uint64_t*)p = 0
, if you init large number of small objects.Generally, all x86 CPUs are different (unless you compile for some standardized platform), and something you optimize for Pentium 2 will behave differently on Core Duo or i486. So if you really into it and want to squeeze the last few bits of toothpaste, it makes sense to ship several versions your exe compiled and optimized for different popular CPU models. From personal experience Clang -march=native boosted my game's FPS from 60 to 65, compared to no -march.