在 D 语言中比较两个内存块的最有效方法是什么？

发布于 2024-08-09 23:40:04 字数 1208 浏览 6 评论 0原文

我需要一个内存块的比较函数，用于在 D 编程语言中对字节数组进行二进制搜索。它不需要有任何有用的语义。它只需要速度快并且是一个有效的比较函数（产生总排序的函数）。已知要比较的内存块具有相同的长度。

C 的 memcmp 实际上相当慢，因为它试图保留有用的字符串比较语义，而我不需要。以下是迄今为止我想到的最好的。有谁知道更好的东西，最好不要深入研究不可移植的 CPU 特定指令？

// Faster than C's memcmp because it doesn't preserve any meaningful
// semantics.  It's just a completely arbitrary, but really fast,
// comparison function.
int memoryCompare(const(void)* lhs, const(void)* rhs, size_t n) {
    for(; n >= uint.sizeof; n -= uint.sizeof) {
        if( *(cast(uint*) lhs) < *(cast(uint*) rhs)) {
            return -1;
        } else if( *(cast(uint*) lhs) > *(cast(uint*) rhs)) {
            return 1;
        }
        lhs += uint.sizeof;
        rhs += uint.sizeof;
    }

    for(; n >= ubyte.sizeof; n -= ubyte.sizeof) {
        if( *(cast(ubyte*) lhs) < *(cast(ubyte*) rhs)) {
            return -1;
        } else if( *(cast(ubyte*) lhs) > *(cast(ubyte*) rhs)) {
            return 1;
        }
        lhs += ubyte.sizeof;
        rhs += ubyte.sizeof;
    }

    return 0;
}

编辑：我已经阅读了 SSE，但我不想使用它，原因有 3 个：

它不可移植。
它需要在 ASM 中编程。
比较指令假定您的数据是浮点型，如果某些数据恰好与 NaN 的模式匹配，则可能会出现问题。

原文

I need a comparison function for blocks of memory for doing binary searches on arrays of bytes in the D programming language. It does not need to have any useful semantics. It only needs to be fast and be a valid comparison function (one that produces a total ordering). The blocks of memory to be compared are already known to be the same length.

C's memcmp is actually pretty slow because it tries to preserve useful string comparison semantics, which I don't need. Below is the best I've come up with so far. Does anyone know of anything better, preferably w/o dipping into non-portable CPU-specific instructions?

// Faster than C's memcmp because it doesn't preserve any meaningful
// semantics.  It's just a completely arbitrary, but really fast,
// comparison function.
int memoryCompare(const(void)* lhs, const(void)* rhs, size_t n) {
    for(; n >= uint.sizeof; n -= uint.sizeof) {
        if( *(cast(uint*) lhs) < *(cast(uint*) rhs)) {
            return -1;
        } else if( *(cast(uint*) lhs) > *(cast(uint*) rhs)) {
            return 1;
        }
        lhs += uint.sizeof;
        rhs += uint.sizeof;
    }

    for(; n >= ubyte.sizeof; n -= ubyte.sizeof) {
        if( *(cast(ubyte*) lhs) < *(cast(ubyte*) rhs)) {
            return -1;
        } else if( *(cast(ubyte*) lhs) > *(cast(ubyte*) rhs)) {
            return 1;
        }
        lhs += ubyte.sizeof;
        rhs += ubyte.sizeof;
    }

    return 0;
}

Edit: I've read up on SSE and I don't want to use it for 3 reasons:

It's not portable.
It requires programming in ASM.
The comparison instructions assume your data is floating points, which might be problematic if some data happens to match the pattern for a NaN.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

末が日狂欢 2024-08-16 23:40:04

您可以尝试：

检查 uint 是否是适合您的目标 CPU 的最大类型（ulongs 可能更适合本机寄存器）
使用该类型的 2 个指针
使用 *p++ 使用 2 个局部变量（不要为 1 个值取消引用指针 2 次）
预先划分第一个循环的计数器（使用 while (counter--)）
通过用开关替换它来展开第二个循环（如果 sizeof（适合寄存器的类型）已知并且将始终相同。）

<强>编辑：如果第一个循环是瓶颈，展开可能就是答案。结合在值相等的情况下将条件数量减半，展开 4 次，我得到如下结果：

uint* lp = (uint*)lhs;
uint* rp = (uint*)rhs;
uint  l;
uint  r;
int count = (n / uint.sizeof) / 4;

while (count--) {
    if( (l = *lp++) != (r = *rp++) {
        return (l < r) ? -1 : 1;
    }
    if( (l = *lp++) != (r = *rp++) {
        return (l < r) ? -1 : 1;
    }
    if( (l = *lp++) != (r = *rp++) {
        return (l < r) ? -1 : 1;
    }
    if( (l = *lp++) != (r = *rp++) {
        return (l < r) ? -1 : 1;
    }
}

当然，还剩下 (n / uint.sizeof) % 4 迭代要做，您可以通过交错一个开关混合到这个循环中，我把它留给读者邪恶的笑容作为练习。

You could try:

check if uint is the largest type that fits in your target CPU (ulongs might fit a native register better)
use 2 pointers of that type
use 2 local variables using *p++ (dont dereference the pointers 2 times for 1 value)
devide the counter of the 1st loop up front (use while (counter--))
unroll the 2nd loop by replacing it with a switch (if sizeof(type that fits in a register) is known and will be always the same.)

Edit: if the first loop is the bottleneck, unrolling might be the answer. Combined with halving the number of conditions in case of equal values, for unrolling 4 times I get something like:

uint* lp = (uint*)lhs;
uint* rp = (uint*)rhs;
uint  l;
uint  r;
int count = (n / uint.sizeof) / 4;

while (count--) {
    if( (l = *lp++) != (r = *rp++) {
        return (l < r) ? -1 : 1;
    }
    if( (l = *lp++) != (r = *rp++) {
        return (l < r) ? -1 : 1;
    }
    if( (l = *lp++) != (r = *rp++) {
        return (l < r) ? -1 : 1;
    }
    if( (l = *lp++) != (r = *rp++) {
        return (l < r) ? -1 : 1;
    }
}

Of course that leaves (n / uint.sizeof) % 4 iterations left to do, which you can mix into this loop by interleaving a switch, I left that as exercise for the reader evil grin.

回复收藏 0 原文

思念绕指尖 2024-08-16 23:40:04

我对此了解不多，但有一些向量指令可以一次将指令应用于多个字节。您可以使用这些结果来执行超快的 memcmp。我不知道您需要哪些说明，但如果您查找 Larrabee 新说明或查看这篇文章，您可能会找到您正在寻找的内容 http://www.ddj.com/architect/216402188

注意：这个 CPU 不是 ATM AFAIK

-编辑 - 现在我确信有一个指令集（尝试查看 SSE 或 SSE2 ），如果它们对齐，则可以一次比较 16 个字节。

你可以尝试这个纯c++代码。

template<class T>
int memoryCompare(const T* lhs, const T * rhs, size_t n) {
    const T* endLHS = lhs + n/sizeof(T);
    while(lhs<endLHS) {
        int i = *lhs - *rhs;
        if(i != 0) return i > 0 ? 1 : -1;
        lhs++;
        rhs++;
    }
    //more code for the remaining bytes or call memoryCompare<char>(lhs, rhs, n%sizeof(T));
    return 0;
}

这里的优点是增加指针，这样您就可以取消引用它而不应用索引（它的 ptr_offset[index] 与 ptr_offset）。上面使用了一个模板，因此您可以在 64 位机器上使用 64 位。汇编中的 CMP 实际上只是减去检查 N 和 Z 标志。我只是在我的版本中进行比较，而不是比较 N 和减少 N。

I dont know much about it but there are vector instructions which can apply instructions to many bytes at once. You can use those results to do a blazing fast memcmp. I dont know which instructions you would need but if you look up Larrabee New Instructions or check out this article you may find what you are looking for http://www.ddj.com/architect/216402188

NOTE: This CPU isnt out ATM AFAIK

-Edit- Right now i am positive there is an instruction sets (try looking at SSE or SSE2) which may compare 16 bytes at once if they are aligned.

You can try this pure c++ code.

template<class T>
int memoryCompare(const T* lhs, const T * rhs, size_t n) {
    const T* endLHS = lhs + n/sizeof(T);
    while(lhs<endLHS) {
        int i = *lhs - *rhs;
        if(i != 0) return i > 0 ? 1 : -1;
        lhs++;
        rhs++;
    }
    //more code for the remaining bytes or call memoryCompare<char>(lhs, rhs, n%sizeof(T));
    return 0;
}

The advantage here is your increasing the pointer so you can dereference it and not apply an index (its ptr_offset[index] vs ptr_offset). The above uses a template so you can use 64 bits on 64 bit machines. and CMPs in assembly are really just subtracts checking the N and Z flags. Instead of comparing N and decreasing N i just compare in my version.

回复收藏 0 原文

时光是把杀猪刀 2024-08-16 23:40:04

很大程度上取决于您的系统和数据。我们只能做出这么多假设。您使用什么处理器？它必须是直接的 C 代码吗？ CPU寄存器的宽度是多少？ CPU的缓存结构是怎样的？等等

这还取决于您的数据有多么不同。如果每个缓冲区的第一个字节不太可能相同，那么函数的速度就毫无意义，因为从逻辑上讲，它不会到达函数的其余部分。如果前 n-1 个字节通常是 sme，那么它就变得更加重要。

无论您如何进行测试，您都不太可能看到太多变化。

无论如何，这是我自己的一个小实现，它可能会也可能不会比你自己的更快（或者，考虑到我只是在进行过程中完成它，它可能会或可能不会工作；））：

int memoryCompare(const void* lhs, const void* rhs, size_t n) 
{
    uint_64 diff    = 0

    // Test the first few bytes until we are 32-bit aligned.
    while( (n & 0x3) != 0 && diff != 0 )
    {
        diff = (uint_8*)lhs - (uint_8*)rhs;
        n--;
        ((uint_8*)lhs)++;
        ((uint_8*)rhs)++;              
        }

    // Test the next set of 32-bit integers using comparisons with
    // aligned data.
    while( n > sizeof( uint_32 ) && diff != 0 )
    {
        diff = (uint_32*)lhs - (uint_32*)rhs;
        n   -= sizeof( uint_32 );
        ((uint_32*)lhs)++;
        ((uint_32*)rhs)++;
    }

    // now do final bytes.
    while( n > 0 && diff != 0 ) 
        {
        diff = (uint_8*)lhs - (uint_8*)rhs;
        n--;
        ((uint_8*)lhs)++;
        ((uint_8*)rhs)++;  
        }

    return (int)*diff / abs( diff ));
}

Well a lot depends upon your system and data. There is only so many assumptions we can make. What processor are you using? Does it have to be straight C code? How wide are the CPU registers? What is the cache structure of the CPU? etc etc

It also depends on how different your data is. If it is very unlikely that the first byte from each buffer is the same then the speed of the function is pretty meaningless as, logically, it won't get to the rest of the function. If it is likely that the first n-1 bytes are usually the sme then it becomes more important.

All in you are unlikely to see much change regardless of how you do the test.

Anyway this is a little implementation of my own, it may, or may not, be faster than your own (or, given i just made it up as i went along, it may, or may not, work ;)):

int memoryCompare(const void* lhs, const void* rhs, size_t n) 
{
    uint_64 diff    = 0

    // Test the first few bytes until we are 32-bit aligned.
    while( (n & 0x3) != 0 && diff != 0 )
    {
        diff = (uint_8*)lhs - (uint_8*)rhs;
        n--;
        ((uint_8*)lhs)++;
        ((uint_8*)rhs)++;              
        }

    // Test the next set of 32-bit integers using comparisons with
    // aligned data.
    while( n > sizeof( uint_32 ) && diff != 0 )
    {
        diff = (uint_32*)lhs - (uint_32*)rhs;
        n   -= sizeof( uint_32 );
        ((uint_32*)lhs)++;
        ((uint_32*)rhs)++;
    }

    // now do final bytes.
    while( n > 0 && diff != 0 ) 
        {
        diff = (uint_8*)lhs - (uint_8*)rhs;
        n--;
        ((uint_8*)lhs)++;
        ((uint_8*)rhs)++;  
        }

    return (int)*diff / abs( diff ));
}

回复收藏 0 原文

无力看清 2024-08-16 23:40:04

这个问题的答案对您有帮助吗？

如果编译器支持将 memcmp() 实现为内在/内置函数，那么您似乎很难超越它。

我不得不承认我对 D 知之甚少甚至一无所知，所以我不知道 D 编译器是否支持内在函数。

回复收藏 0 原文

楠木可依 2024-08-16 23:40:04

我认为 memcmp 被指定进行逐字节比较，无论数据类型如何。您确定您的编译器的实现保留了字符串语义吗？不应该。

回复收藏 0 原文

不甘平庸 2024-08-16 23:40:04

如果您相信编译器的优化，您可以尝试对 Acidzombie24s 建议进行一些修改：

template<class T> int memoryCompare(const T* lhs, const T * rhs, size_t n) {
    const T* endLHS = &lhs[n];
    while(lhs<endLHS) {
        //A good optimiser will keep these values in register
        //and may even be clever enough to just retest the flags
        //before incrementing the pointers iff we loop again.
        //gcc -O3 did the optimisation very well.
        if (*lhs > *rhs) return 1;
        if (*lhs++ < *rhs++) return -1;
    }
    //more code for the remaining bytes or call memoryCompare<char>(lhs, rhs, n%sizeof(T));
    return 0;
}

Here's the Complete gcc -O3 optimizationinnerloop in x86 assembler code
对于上述的 C 版本，仅传递 char 数组指针：

Loop:
        incl    %eax         ; %eax is lhs
        incl    %edx         ; %edx is rhs
        cmpl    %eax, %ebx   ; %ebx is endLHS
        jbe     ReturnEq
        movb    (%edx), %cl
        cmpb    %cl, (%eax)
        jg      ReturnGT
        jge     Loop
ReturnLT:
        ...

If you trust your compiler's optimisation you might try a few modifications to acidzombie24s suggestion:

template<class T> int memoryCompare(const T* lhs, const T * rhs, size_t n) {
    const T* endLHS = &lhs[n];
    while(lhs<endLHS) {
        //A good optimiser will keep these values in register
        //and may even be clever enough to just retest the flags
        //before incrementing the pointers iff we loop again.
        //gcc -O3 did the optimisation very well.
        if (*lhs > *rhs) return 1;
        if (*lhs++ < *rhs++) return -1;
    }
    //more code for the remaining bytes or call memoryCompare<char>(lhs, rhs, n%sizeof(T));
    return 0;
}

Here's the entire gcc -O3 optimised inner loop in x86 assembler code
for a C version of the above, only passing char array pointers:

Loop:
        incl    %eax         ; %eax is lhs
        incl    %edx         ; %edx is rhs
        cmpl    %eax, %ebx   ; %ebx is endLHS
        jbe     ReturnEq
        movb    (%edx), %cl
        cmpb    %cl, (%eax)
        jg      ReturnGT
        jge     Loop
ReturnLT:
        ...

回复收藏 0 原文

~没有更多了~