比 memset 更快的零内存方法？

发布于 2024-09-18 03:56:21 字数 2479 浏览 8 评论 0原文

我了解到 memset(ptr, 0, nbytes) 确实很快，但是有没有更快的方法（至少在 x86 上）？

我假设 memset 使用 mov，但是当将内存归零时，大多数编译器使用 xor 因为它更快，对吗？ edit1：错误，正如 GregS 指出的那样，它仅适用于寄存器。我在想什么？

我还请一个比我更了解汇编程序的人查看 stdlib，他告诉我在 x86 上 memset 没有充分利用 32 位宽寄存器。但当时我很累，所以我不太确定我是否理解正确。

编辑2：我重新审视了这个问题并做了一些测试。这是我测试的：

    #include <stdio.h>
    #include <malloc.h>
    #include <string.h>
    #include <sys/time.h>

    #define TIME(body) do {                                                     \
        struct timeval t1, t2; double elapsed;                                  \
        gettimeofday(&t1, NULL);                                                \
        body                                                                    \
        gettimeofday(&t2, NULL);                                                \
        elapsed = (t2.tv_sec - t1.tv_sec) * 1000.0 + (t2.tv_usec - t1.tv_usec) / 1000.0; \
        printf("%s\n --- %f ---\n", #body, elapsed); } while(0)                 \


    #define SIZE 0x1000000

    void zero_1(void* buff, size_t size)
    {
        size_t i;
        char* foo = buff;
        for (i = 0; i < size; i++)
            foo[i] = 0;

    }

    /* I foolishly assume size_t has register width */
    void zero_sizet(void* buff, size_t size)
    {
        size_t i;
        char* bar;
        size_t* foo = buff;
        for (i = 0; i < size / sizeof(size_t); i++)
            foo[i] = 0;

        // fixes bug pointed out by tristopia
        bar = (char*)buff + size - size % sizeof(size_t);
        for (i = 0; i < size % sizeof(size_t); i++)
            bar[i] = 0;
    }

    int main()
    {
        char* buffer = malloc(SIZE);
        TIME(
            memset(buffer, 0, SIZE);
        );
        TIME(
            zero_1(buffer, SIZE);
        );
        TIME(
            zero_sizet(buffer, SIZE);
        );
        return 0;
    }

结果：

除-O3 外，zero_1 是最慢的。 Zero_sizet 是最快的，-O1、-O2 和 -O3 的性能大致相同。 memset 总是比 Zero_sizet 慢。（-O3 慢两倍）。有趣的一件事是，在 -O3 时，zero_1 与 Zero_sizet 一样快。然而，反汇编的函数的指令数量大约是原来的四倍（我认为是由循环展开引起的）。另外，我尝试进一步优化 Zero_sizet，但编译器总是比我强，但这里并不奇怪。

目前 memset 获胜，之前的结果因 CPU 缓存而扭曲。（所有测试均在 Linux 上运行）需要进一步测试。接下来我将尝试汇编器:)

edit3: 修复了测试代码中的错误，测试结果不受影响

edit4: 在查看反汇编的 VS2010 C 运行时时，我注意到 < code>memset 有一个针对零的 SSE 优化例程。很难击败这一点。

原文

I learned that memset(ptr, 0, nbytes) is really fast, but is there a faster way (at least on x86)?

I assume that memset uses mov, however when zeroing memory most compilers use xor as it's faster, correct? edit1: Wrong, as GregS pointed out that only works with registers. What was I thinking?

Also I asked a person who knew of assembler more than me to look at the stdlib, and he told me that on x86 memset is not taking full advantage of the 32 bit wide registers. However at that time I was very tired, so I'm not quite sure I understood it correctly.

edit2:
I revisited this issue and did a little testing. Here is what I tested:

    #include <stdio.h>
    #include <malloc.h>
    #include <string.h>
    #include <sys/time.h>

    #define TIME(body) do {                                                     \
        struct timeval t1, t2; double elapsed;                                  \
        gettimeofday(&t1, NULL);                                                \
        body                                                                    \
        gettimeofday(&t2, NULL);                                                \
        elapsed = (t2.tv_sec - t1.tv_sec) * 1000.0 + (t2.tv_usec - t1.tv_usec) / 1000.0; \
        printf("%s\n --- %f ---\n", #body, elapsed); } while(0)                 \


    #define SIZE 0x1000000

    void zero_1(void* buff, size_t size)
    {
        size_t i;
        char* foo = buff;
        for (i = 0; i < size; i++)
            foo[i] = 0;

    }

    /* I foolishly assume size_t has register width */
    void zero_sizet(void* buff, size_t size)
    {
        size_t i;
        char* bar;
        size_t* foo = buff;
        for (i = 0; i < size / sizeof(size_t); i++)
            foo[i] = 0;

        // fixes bug pointed out by tristopia
        bar = (char*)buff + size - size % sizeof(size_t);
        for (i = 0; i < size % sizeof(size_t); i++)
            bar[i] = 0;
    }

    int main()
    {
        char* buffer = malloc(SIZE);
        TIME(
            memset(buffer, 0, SIZE);
        );
        TIME(
            zero_1(buffer, SIZE);
        );
        TIME(
            zero_sizet(buffer, SIZE);
        );
        return 0;
    }

results:

zero_1 is the slowest, except for -O3. zero_sizet is the fastest with roughly equal performance across -O1, -O2 and -O3. memset was always slower than zero_sizet. (twice as slow for -O3). one thing of interest is that at -O3 zero_1 was equally fast as zero_sizet. however the disassembled function had roughly four times as many instructions (I think caused by loop unrolling). Also, I tried optimizing zero_sizet further, but the compiler always outdid me, but no surprise here.

For now memset wins, previous results were distorted by CPU cache. (all tests were run on Linux) Further testing needed. I'll try assembler next :)

edit3: fixed bug in test code, test results are not affected

edit4: While poking around the disassembled VS2010 C runtime, I noticed that memset has a SSE optimized routine for zero. It will be hard to beat this.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

水溶 2024-09-25 03:56:21

x86 是相当广泛的设备。

对于完全通用的 x86 目标，带有“rep movsd”的汇编块可以一次将零喷射到 32 位内存中。尝试确保大部分工作都是 DWORD 对齐的。

对于带有 mmx 的芯片，带有 movq 的汇编循环一次可以达到 64 位。

您也许能够让 C/C++ 编译器使用带有指向 long long 或 _m64 的指针的 64 位写入。目标必须是 8 字节对齐才能获得最佳性能。

对于带有 sse 的芯片，movaps 很快，但前提是地址是 16 字节对齐的，所以使用 movsb 直到对齐，然后用 movaps 循环完成清除

Win32 有“ZeroMemory()”，但我忘了那是不是宏到 memset，或者一个实际的“良好”实现。

回复收藏 0 原文

ˇ宁静的妩媚 2024-09-25 03:56:21

memset 通常被设计为非常非常快的通用设置/归零代码。它可以处理具有不同尺寸和对齐方式的所有情况，这会影响您可以用来完成工作的指令类型。根据您所在的系统（以及您的 stdlib 来自哪个供应商），底层实现可能位于特定于该体系结构的汇编程序中，以利用其本机属性。它还可能有内部特殊情况来处理归零的情况（而不是设置其他值）。

也就是说，如果您需要执行非常具体、对性能非常关键的内存清零，那么您当然有可能通过自己完成来击败特定的 memset 实现。 memset 及其在标准库中的朋友始终是胜人一筹的编程的有趣目标。 :)

回复收藏 0 原文

于我来说 2024-09-25 03:56:21

现在你的编译器应该为你做所有的工作。至少据我所知，gcc 在优化对 memset 的调用方面非常高效（不过，最好检查一下汇编器）。

然后，如果不需要的话，请避免使用 memset：

对堆内存使用 calloc
使用正确的初始化 (... = { 0 }）用于堆栈内存

，对于非常大的块，请使用mmap（如果有的话）。这只是“免费”从系统获取零初始化内存。

回复收藏 0 原文

痴骨ら 2024-09-25 03:56:21

如果我没记错的话（几年前），一位高级开发人员正在谈论一种在 PowerPC 上快速调用 bzero() 的方法（规格说明我们需要在开机时将几乎所有内存清零）。它可能无法很好地（如果有的话）转换为 x86，但它可能值得探索。

这个想法是加载数据缓存行，清除该数据缓存行，然后将清除的数据缓存行写回内存。

对于它的价值，我希望它有所帮助。

回复收藏 0 原文

绅士风度i 2024-09-25 03:56:21

除非您有特定需求或知道您的编译器/stdlib 很糟糕，否则请坚持使用 memset。它是通用的，一般来说应该具有不错的性能。此外，编译器可能更容易优化/内联 memset()，因为它可以拥有对其的内在支持。

例如，Visual C++ 通常会生成 memcpy/memset 的内联版本，它们与对库函数的调用一样小，从而避免了推送/调用/ret 开销。当可以在编译时评估大小参数时，还可以进行进一步的优化。

也就是说，如果您有特定需求（其中尺寸始终很小*或*巨大），您可以通过下降到装配级别来获得速度提升。例如，使用直写操作将大块内存清零，而不会污染二级缓存。

但这一切都取决于 - 对于正常的东西，请坚持使用 memset/memcpy :)

回复收藏 0 原文

梦在深巷 2024-09-25 03:56:21

memset 函数被设计得灵活简单，甚至以牺牲速度为代价。在许多实现中，它是一个简单的 while 循环，在给定的字节数上一次复制一个字节的指定值。如果您想要一个更快的 memset（或 memcpy、memmove 等），几乎总是可以自己编写一个。

最简单的定制是执行单字节“设置”操作，直到目标地址对齐为 32 位或 64 位（无论与您的芯片架构匹配），然后开始一次复制完整的 CPU 寄存器。如果您的范围没有以对齐地址结束，您可能必须在末尾执行几个单字节“设置”操作。

根据您的特定 CPU，您可能还有一些流 SIMD 指令可以帮助您。这些通常在对齐地址上工作得更好，因此上述使用对齐地址的技术在这里也很有用。

为了将大块内存清零，您还可以通过将范围分成多个部分并并行处理每个部分（其中部分的数量与核心/硬件线程的数量相同）来获得速度提升。

最重要的是，除非您尝试一下，否则无法判断这些是否有帮助。至少，看看你的编译器针对每种情况发出了什么。看看其他编译器为其标准“memset”发出了什么（它们的实现可能比您的编译器更有效）。

回复收藏 0 原文

迷路的信 2024-09-25 03:56:21

这是一个有趣的问题。我实现的这个实现在 VC++ 2012 上编译 32 位版本时速度稍快一些（但很难测量）。它可能还可以改进很多。在多线程环境中将其添加到您自己的类中可能会给您带来更多性能提升，因为在多线程场景中，memset() 报告了一些瓶颈问题。

// MemsetSpeedTest.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"
#include <iostream>
#include "Windows.h"
#include <time.h>

#pragma comment(lib, "Winmm.lib") 
using namespace std;

/** a signed 64-bit integer value type */
#define _INT64 __int64

/** a signed 32-bit integer value type */
#define _INT32 __int32

/** a signed 16-bit integer value type */
#define _INT16 __int16

/** a signed 8-bit integer value type */
#define _INT8 __int8

/** an unsigned 64-bit integer value type */
#define _UINT64 unsigned _INT64

/** an unsigned 32-bit integer value type */
#define _UINT32 unsigned _INT32

/** an unsigned 16-bit integer value type */
#define _UINT16 unsigned _INT16

/** an unsigned 8-bit integer value type */
#define _UINT8 unsigned _INT8

/** maximum allo

wed value in an unsigned 64-bit integer value type */
    #define _UINT64_MAX 18446744073709551615ULL

#ifdef _WIN32

/** Use to init the clock */
#define TIMER_INIT LARGE_INTEGER frequency;LARGE_INTEGER t1, t2;double elapsedTime;QueryPerformanceFrequency(&frequency);

/** Use to start the performance timer */
#define TIMER_START QueryPerformanceCounter(&t1);

/** Use to stop the performance timer and output the result to the standard stream. Less verbose than \c TIMER_STOP_VERBOSE */
#define TIMER_STOP QueryPerformanceCounter(&t2);elapsedTime=(t2.QuadPart-t1.QuadPart)*1000.0/frequency.QuadPart;wcout<<elapsedTime<<L" ms."<<endl;
#else
/** Use to init the clock */
#define TIMER_INIT clock_t start;double diff;

/** Use to start the performance timer */
#define TIMER_START start=clock();

/** Use to stop the performance timer and output the result to the standard stream. Less verbose than \c TIMER_STOP_VERBOSE */
#define TIMER_STOP diff=(clock()-start)/(double)CLOCKS_PER_SEC;wcout<<fixed<<diff<<endl;
#endif    


void *MemSet(void *dest, _UINT8 c, size_t count)
{
    size_t blockIdx;
    size_t blocks = count >> 3;
    size_t bytesLeft = count - (blocks << 3);
    _UINT64 cUll = 
        c 
        | (((_UINT64)c) << 8 )
        | (((_UINT64)c) << 16 )
        | (((_UINT64)c) << 24 )
        | (((_UINT64)c) << 32 )
        | (((_UINT64)c) << 40 )
        | (((_UINT64)c) << 48 )
        | (((_UINT64)c) << 56 );

    _UINT64 *destPtr8 = (_UINT64*)dest;
    for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr8[blockIdx] = cUll;

    if (!bytesLeft) return dest;

    blocks = bytesLeft >> 2;
    bytesLeft = bytesLeft - (blocks << 2);

    _UINT32 *destPtr4 = (_UINT32*)&destPtr8[blockIdx];
    for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr4[blockIdx] = (_UINT32)cUll;

    if (!bytesLeft) return dest;

    blocks = bytesLeft >> 1;
    bytesLeft = bytesLeft - (blocks << 1);

    _UINT16 *destPtr2 = (_UINT16*)&destPtr4[blockIdx];
    for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr2[blockIdx] = (_UINT16)cUll;

    if (!bytesLeft) return dest;

    _UINT8 *destPtr1 = (_UINT8*)&destPtr2[blockIdx];
    for (blockIdx = 0; blockIdx < bytesLeft; blockIdx++) destPtr1[blockIdx] = (_UINT8)cUll;

    return dest;
}

int _tmain(int argc, _TCHAR* argv[])
{
    TIMER_INIT

    const size_t n = 10000000;
    const _UINT64 m = _UINT64_MAX;
    const _UINT64 o = 1;
    char test[n];
    {
        cout << "memset()" << endl;
        TIMER_START;

        for (int i = 0; i < m ; i++)
            for (int j = 0; j < o ; j++)
                memset((void*)test, 0, n);  

        TIMER_STOP;
    }
    {
        cout << "MemSet() took:" << endl;
        TIMER_START;

        for (int i = 0; i < m ; i++)
            for (int j = 0; j < o ; j++)
                MemSet((void*)test, 0, n);

        TIMER_STOP;
    }

    cout << "Done" << endl;
    int wait;
    cin >> wait;
    return 0;
}

32 位系统发布编译时输出如下：

memset() took:
5.569000
MemSet() took:
5.544000
Done

64 位系统发布编译时输出如下：

memset() took:
2.781000
MemSet() took:
2.765000
Done

在这里你可以找到Berkley的memset()源代码，我认为这是最常见的实现。

That's an interesting question. I made this implementation that is just slightly faster (but hardly measurable) when 32-bit release compiling on VC++ 2012. It probably can be improved on a lot. Adding this in your own class in a multithreaded environment would probably give you even more performance gains since there are some reported bottleneck problems with memset() in multithreaded scenarios.

// MemsetSpeedTest.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"
#include <iostream>
#include "Windows.h"
#include <time.h>

#pragma comment(lib, "Winmm.lib") 
using namespace std;

/** a signed 64-bit integer value type */
#define _INT64 __int64

/** a signed 32-bit integer value type */
#define _INT32 __int32

/** a signed 16-bit integer value type */
#define _INT16 __int16

/** a signed 8-bit integer value type */
#define _INT8 __int8

/** an unsigned 64-bit integer value type */
#define _UINT64 unsigned _INT64

/** an unsigned 32-bit integer value type */
#define _UINT32 unsigned _INT32

/** an unsigned 16-bit integer value type */
#define _UINT16 unsigned _INT16

/** an unsigned 8-bit integer value type */
#define _UINT8 unsigned _INT8

/** maximum allo

wed value in an unsigned 64-bit integer value type */
    #define _UINT64_MAX 18446744073709551615ULL

#ifdef _WIN32

/** Use to init the clock */
#define TIMER_INIT LARGE_INTEGER frequency;LARGE_INTEGER t1, t2;double elapsedTime;QueryPerformanceFrequency(&frequency);

/** Use to start the performance timer */
#define TIMER_START QueryPerformanceCounter(&t1);

/** Use to stop the performance timer and output the result to the standard stream. Less verbose than \c TIMER_STOP_VERBOSE */
#define TIMER_STOP QueryPerformanceCounter(&t2);elapsedTime=(t2.QuadPart-t1.QuadPart)*1000.0/frequency.QuadPart;wcout<<elapsedTime<<L" ms."<<endl;
#else
/** Use to init the clock */
#define TIMER_INIT clock_t start;double diff;

/** Use to start the performance timer */
#define TIMER_START start=clock();

/** Use to stop the performance timer and output the result to the standard stream. Less verbose than \c TIMER_STOP_VERBOSE */
#define TIMER_STOP diff=(clock()-start)/(double)CLOCKS_PER_SEC;wcout<<fixed<<diff<<endl;
#endif    


void *MemSet(void *dest, _UINT8 c, size_t count)
{
    size_t blockIdx;
    size_t blocks = count >> 3;
    size_t bytesLeft = count - (blocks << 3);
    _UINT64 cUll = 
        c 
        | (((_UINT64)c) << 8 )
        | (((_UINT64)c) << 16 )
        | (((_UINT64)c) << 24 )
        | (((_UINT64)c) << 32 )
        | (((_UINT64)c) << 40 )
        | (((_UINT64)c) << 48 )
        | (((_UINT64)c) << 56 );

    _UINT64 *destPtr8 = (_UINT64*)dest;
    for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr8[blockIdx] = cUll;

    if (!bytesLeft) return dest;

    blocks = bytesLeft >> 2;
    bytesLeft = bytesLeft - (blocks << 2);

    _UINT32 *destPtr4 = (_UINT32*)&destPtr8[blockIdx];
    for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr4[blockIdx] = (_UINT32)cUll;

    if (!bytesLeft) return dest;

    blocks = bytesLeft >> 1;
    bytesLeft = bytesLeft - (blocks << 1);

    _UINT16 *destPtr2 = (_UINT16*)&destPtr4[blockIdx];
    for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr2[blockIdx] = (_UINT16)cUll;

    if (!bytesLeft) return dest;

    _UINT8 *destPtr1 = (_UINT8*)&destPtr2[blockIdx];
    for (blockIdx = 0; blockIdx < bytesLeft; blockIdx++) destPtr1[blockIdx] = (_UINT8)cUll;

    return dest;
}

int _tmain(int argc, _TCHAR* argv[])
{
    TIMER_INIT

    const size_t n = 10000000;
    const _UINT64 m = _UINT64_MAX;
    const _UINT64 o = 1;
    char test[n];
    {
        cout << "memset()" << endl;
        TIMER_START;

        for (int i = 0; i < m ; i++)
            for (int j = 0; j < o ; j++)
                memset((void*)test, 0, n);  

        TIMER_STOP;
    }
    {
        cout << "MemSet() took:" << endl;
        TIMER_START;

        for (int i = 0; i < m ; i++)
            for (int j = 0; j < o ; j++)
                MemSet((void*)test, 0, n);

        TIMER_STOP;
    }

    cout << "Done" << endl;
    int wait;
    cin >> wait;
    return 0;
}

Output is as follows when release compiling for 32-bit systems:

memset() took:
5.569000
MemSet() took:
5.544000
Done

Output is as follows when release compiling for 64-bit systems:

memset() took:
2.781000
MemSet() took:
2.765000
Done

Here you can find the source code Berkley's memset(), which I think is the most common implementation.

回复收藏 0 原文

听，心雨的声音 2024-09-25 03:56:21

这个原本伟大而有用的测试有一个致命的缺陷：
由于 memset 是第一条指令，因此似乎存在一些“内存开销”左右，这使得它非常慢。
将 memset 的计时移至第二位，将其他内容移至第一位，或者简单地将 memset 计时两次，使得 memset 在所有编译开关中速度最快！

回复收藏 0 原文

江湖正好 2024-09-25 03:56:21

memset 可以由编译器内联为一系列有效的操作码，展开几个周期。对于非常大的内存块，例如 4000x2000 64 位帧缓冲区，您可以尝试跨多个线程对其进行优化（您为该唯一任务准备的线程），每个线程设置自己的部分。请注意，还有 bzero()，但它更晦涩难懂，并且不太可能像 memset 那样优化，并且编译器肯定会注意到您传递了 0。

编译器通常假设您 memset 大块，因此对于较小的块如果您初始化大量小对象，那么仅执行 *(uint64_t*)p = 0 可能会更有效。

一般来说，所有 x86 CPU 都是不同的（除非您针对某些标准化平台进行编译），并且针对 Pentium 2 优化的某些内容在 Core Duo 或 i486 上的表现会有所不同。因此，如果您真的很喜欢它并且想要挤掉最后一点牙膏，那么发布针对不同流行 CPU 型号编译和优化的 exe 的多个版本是有意义的。根据个人经验，与没有 -march 相比，Clang -march=native 将我的游戏的 FPS 从 60 提高到 65。

回复收藏 0 原文

~没有更多了~