想要使用 ASM 进行快速 8 字节对齐数组复制而不是 memmove

发布于 2024-12-11 21:32:21 字数 1268 浏览 0 评论 0原文

我有一个结构数组，其大小在 8 字节边界内。我需要在数组本身内大块地移动数据，所以我一直在使用 memmove()。它有效，但速度非常慢。我认为编译器没有优化该函数以一次复制 4 或 8 个字节，因此会出现延迟。

我宁愿做的是使用 int32_t 或 int64_t 变量强制复制。这样，我可以让 memcpy 一次复制 4 或 8 个字节，从而加快速度。这可以正常工作，因为我的结构体的大小始终为 8 字节边界。

我无法找出在 C 中强制执行此操作的方法。我尝试使用内联汇编来完成此操作，但我不知道如何将操作数指向特定的数组元素。例如，如果我的 ASM 语句一次复制 4 个字节，我需要将数组前进 4 个字节。我不知道该怎么做。这就是我的想法：

//here's our 2048 byte struct
typedef struct {
    filename[1024];
    description[1024];
} RECORD;

//total number of rows, or elements
int row_count = 0;

//create initial record
RECORD *record = (RECORD*)malloc(sizeof(RECORD));

//insert some stuff
strcpy(record->filename,"filename.txt");
strcpy(record->description,"Description of file");

//increment our row count
row_count++;

//now let's add a row
record = (RECORD*)realloc(record,sizeof(RECORD)*(row_count+1));

//duplicate first record
//copy first 4 bytes from "record" to the newly appended row
//obviously this would be a loop copying 4 bytes at a time
//up to the the size of the row, which is 2048 bytes.
__asm__("movl (%1), %%eax; \n\t"
    "movl %%eax, (%0); \n\t"
    : "=r"(record+row_count)    //output
    :  "r"(record+0)            //input
    : "%eax" );                 //list of registers used

//Don't work. :-(

原文

I've got an array of structs that are sized to be within an 8 byte boundary. I need to move the data around in big chunks within the array itself, so I've been using memmove(). It works, but it's very slow. I think the compiler is not optimizing the function to copy 4 or 8 bytes at at time, hence the delay.

What I would rather do is force the copy by using int32_t or int64_t vars. That way, I can have the memcpy copy 4 or 8 bytes at at time, speeding things up. This will work ok since my structs are always sized to 8 byte boundaries.

I can't figure out a way to force this in C. I tried to do it with inline assembly, but I don't know how to point the operands to specific array elements. For example, if my ASM statement copies 4 bytes at a time, I need to advance the array by 4 bytes. I don't know how to do that. Here's what I'm thinking:

//here's our 2048 byte struct
typedef struct {
    filename[1024];
    description[1024];
} RECORD;

//total number of rows, or elements
int row_count = 0;

//create initial record
RECORD *record = (RECORD*)malloc(sizeof(RECORD));

//insert some stuff
strcpy(record->filename,"filename.txt");
strcpy(record->description,"Description of file");

//increment our row count
row_count++;

//now let's add a row
record = (RECORD*)realloc(record,sizeof(RECORD)*(row_count+1));

//duplicate first record
//copy first 4 bytes from "record" to the newly appended row
//obviously this would be a loop copying 4 bytes at a time
//up to the the size of the row, which is 2048 bytes.
__asm__("movl (%1), %%eax; \n\t"
    "movl %%eax, (%0); \n\t"
    : "=r"(record+row_count)    //output
    :  "r"(record+0)            //input
    : "%eax" );                 //list of registers used

//Don't work. :-(

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

北斗星光 2024-12-18 21:32:21

正如@Vlad 指出的，memmove & memcpy 通常是高度优化的，现在它们通常使用 SIMD 来实现大块，这意味着您应该在花时间优化您认为的代码之前真正分析您的代码。瓶颈。

关于你的实际问题：
您的副本中没有任何循环，但是，最好使用 REP MOVSD 一次 4 字节，或在 x64 上使用 REP MOVSQ 一次 8 个字节- 一次字节。但是，看到您的数据是 8 字节对齐的，您甚至可以通过 MOVQ 使用 MMX 进行复制，一次可以处理 64 位。

当存在重叠和其他有趣的极端情况时，这会变得更加复杂，但从它的声音来看，你不应该/需要它，所以事实上，最好的方法可能是最天真的方法（这只是复制，如果您不需要 memmove 的其他语义，则会加快速度）：

void MyMemCopy(void* pSrc, void* pDst, int nElements)
{
    int64_t* s = (int64_t*)pSrc;
    int64_t* d = (int64_t*)pDst;
    while(nElements--)
        *d++ = *s++;
}

现在编译器可以自由地以尽可能最好的方式对其进行优化，无论是内联还是展开等，而您不需要存在ASM的可移植性问题

As @Vlad pointed out, memmove & memcpy are generally highly optimized, these days they are generally implemented with SIMD for big blocks, this means you should really profile your code before spending time optimizing what you think to be the bottlenecks.

On to your actual question:
you don't have any looping in your copy, however, its better to use something such as REP MOVSD for 4-bytes at a time or REP MOVSQ on x64 for 8-bytes at a time. however, seeing your data is 8 byte aligned, you can even use MMX to do copies, via MOVQ, which would do 64bits at a time.

This becomes a little more complex when there is overlapping and other funny corner cases, but from the sounds of it you shouldn't have/need that, so in fact, the best approach might be the most naive one (this just copies, which will speed up things if you don't need the other semantics of memmove):

void MyMemCopy(void* pSrc, void* pDst, int nElements)
{
    int64_t* s = (int64_t*)pSrc;
    int64_t* d = (int64_t*)pDst;
    while(nElements--)
        *d++ = *s++;
}

now the compiler if free to optimize this in the best way possible, be it inlining or unrolling etc, and you don't have the portability issues of ASM

回复收藏 0 原文

~没有更多了~