在 D 语言中比较两个内存块的最有效方法是什么?
我需要一个内存块的比较函数,用于在 D 编程语言中对字节数组进行二进制搜索。它不需要有任何有用的语义。它只需要速度快并且是一个有效的比较函数(产生总排序的函数)。已知要比较的内存块具有相同的长度。
C 的 memcmp 实际上相当慢,因为它试图保留有用的字符串比较语义,而我不需要。以下是迄今为止我想到的最好的。有谁知道更好的东西,最好不要深入研究不可移植的 CPU 特定指令?
// Faster than C's memcmp because it doesn't preserve any meaningful
// semantics. It's just a completely arbitrary, but really fast,
// comparison function.
int memoryCompare(const(void)* lhs, const(void)* rhs, size_t n) {
for(; n >= uint.sizeof; n -= uint.sizeof) {
if( *(cast(uint*) lhs) < *(cast(uint*) rhs)) {
return -1;
} else if( *(cast(uint*) lhs) > *(cast(uint*) rhs)) {
return 1;
}
lhs += uint.sizeof;
rhs += uint.sizeof;
}
for(; n >= ubyte.sizeof; n -= ubyte.sizeof) {
if( *(cast(ubyte*) lhs) < *(cast(ubyte*) rhs)) {
return -1;
} else if( *(cast(ubyte*) lhs) > *(cast(ubyte*) rhs)) {
return 1;
}
lhs += ubyte.sizeof;
rhs += ubyte.sizeof;
}
return 0;
}
编辑:我已经阅读了 SSE,但我不想使用它,原因有 3 个:
- 它不可移植。
- 它需要在 ASM 中编程。
- 比较指令假定您的数据是浮点型,如果某些数据恰好与 NaN 的模式匹配,则可能会出现问题。
I need a comparison function for blocks of memory for doing binary searches on arrays of bytes in the D programming language. It does not need to have any useful semantics. It only needs to be fast and be a valid comparison function (one that produces a total ordering). The blocks of memory to be compared are already known to be the same length.
C's memcmp
is actually pretty slow because it tries to preserve useful string comparison semantics, which I don't need. Below is the best I've come up with so far. Does anyone know of anything better, preferably w/o dipping into non-portable CPU-specific instructions?
// Faster than C's memcmp because it doesn't preserve any meaningful
// semantics. It's just a completely arbitrary, but really fast,
// comparison function.
int memoryCompare(const(void)* lhs, const(void)* rhs, size_t n) {
for(; n >= uint.sizeof; n -= uint.sizeof) {
if( *(cast(uint*) lhs) < *(cast(uint*) rhs)) {
return -1;
} else if( *(cast(uint*) lhs) > *(cast(uint*) rhs)) {
return 1;
}
lhs += uint.sizeof;
rhs += uint.sizeof;
}
for(; n >= ubyte.sizeof; n -= ubyte.sizeof) {
if( *(cast(ubyte*) lhs) < *(cast(ubyte*) rhs)) {
return -1;
} else if( *(cast(ubyte*) lhs) > *(cast(ubyte*) rhs)) {
return 1;
}
lhs += ubyte.sizeof;
rhs += ubyte.sizeof;
}
return 0;
}
Edit: I've read up on SSE and I don't want to use it for 3 reasons:
- It's not portable.
- It requires programming in ASM.
- The comparison instructions assume your data is floating points, which might be problematic if some data happens to match the pattern for a NaN.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
您可以尝试:
<强>编辑:如果第一个循环是瓶颈,展开可能就是答案。结合在值相等的情况下将条件数量减半,展开 4 次,我得到如下结果:
当然,还剩下
(n / uint.sizeof) % 4
迭代要做,您可以通过交错一个开关混合到这个循环中,我把它留给读者邪恶的笑容作为练习。You could try:
Edit: if the first loop is the bottleneck, unrolling might be the answer. Combined with halving the number of conditions in case of equal values, for unrolling 4 times I get something like:
Of course that leaves
(n / uint.sizeof) % 4
iterations left to do, which you can mix into this loop by interleaving a switch, I left that as exercise for the reader evil grin.我对此了解不多,但有一些向量指令可以一次将指令应用于多个字节。您可以使用这些结果来执行超快的 memcmp。我不知道您需要哪些说明,但如果您查找 Larrabee 新说明或查看这篇文章,您可能会找到您正在寻找的内容 http://www.ddj.com/architect/216402188
注意:这个 CPU 不是 ATM AFAIK
-编辑 - 现在我确信有一个指令集(尝试查看 SSE 或 SSE2 ),如果它们对齐,则可以一次比较 16 个字节。
你可以尝试这个纯c++代码。
这里的优点是增加指针,这样您就可以取消引用它而不应用索引(它的 ptr_offset[index] 与 ptr_offset)。上面使用了一个模板,因此您可以在 64 位机器上使用 64 位。汇编中的 CMP 实际上只是减去检查 N 和 Z 标志。我只是在我的版本中进行比较,而不是比较 N 和减少 N。
I dont know much about it but there are vector instructions which can apply instructions to many bytes at once. You can use those results to do a blazing fast memcmp. I dont know which instructions you would need but if you look up Larrabee New Instructions or check out this article you may find what you are looking for http://www.ddj.com/architect/216402188
NOTE: This CPU isnt out ATM AFAIK
-Edit- Right now i am positive there is an instruction sets (try looking at SSE or SSE2) which may compare 16 bytes at once if they are aligned.
You can try this pure c++ code.
The advantage here is your increasing the pointer so you can dereference it and not apply an index (its ptr_offset[index] vs ptr_offset). The above uses a template so you can use 64 bits on 64 bit machines. and CMPs in assembly are really just subtracts checking the N and Z flags. Instead of comparing N and decreasing N i just compare in my version.
很大程度上取决于您的系统和数据。我们只能做出这么多假设。您使用什么处理器?它必须是直接的 C 代码吗? CPU寄存器的宽度是多少? CPU的缓存结构是怎样的?等等
这还取决于您的数据有多么不同。如果每个缓冲区的第一个字节不太可能相同,那么函数的速度就毫无意义,因为从逻辑上讲,它不会到达函数的其余部分。如果前 n-1 个字节通常是 sme,那么它就变得更加重要。
无论您如何进行测试,您都不太可能看到太多变化。
无论如何,这是我自己的一个小实现,它可能会也可能不会比你自己的更快(或者,考虑到我只是在进行过程中完成它,它可能会或可能不会工作;)):
Well a lot depends upon your system and data. There is only so many assumptions we can make. What processor are you using? Does it have to be straight C code? How wide are the CPU registers? What is the cache structure of the CPU? etc etc
It also depends on how different your data is. If it is very unlikely that the first byte from each buffer is the same then the speed of the function is pretty meaningless as, logically, it won't get to the rest of the function. If it is likely that the first n-1 bytes are usually the sme then it becomes more important.
All in you are unlikely to see much change regardless of how you do the test.
Anyway this is a little implementation of my own, it may, or may not, be faster than your own (or, given i just made it up as i went along, it may, or may not, work ;)):
这个问题的答案对您有帮助吗?
如果编译器支持将 memcmp() 实现为内在/内置函数,那么您似乎很难超越它。
我不得不承认我对 D 知之甚少甚至一无所知,所以我不知道 D 编译器是否支持内在函数。
Does the answer to this question help you at all?
If the compiler has support for implementing memcmp() as an intrinsic/builtin function, it seems that you would have a hard time besting it.
I have to admit to knowing little to nothing about D, so I have no idea whether the D compiler has support for intrinsics.
我认为 memcmp 被指定进行逐字节比较,无论数据类型如何。您确定您的编译器的实现保留了字符串语义吗?不应该。
I think memcmp is specified to do a byte-by-byte comparison, regardless of data type. Are you sure your compiler's implementation preserves string semantics? It shouldn't.
如果您相信编译器的优化,您可以尝试对 Acidzombie24s 建议进行一些修改:
Here's the Complete gcc -O3 optimizationinnerloop in x86 assembler code
对于上述的 C 版本,仅传递 char 数组指针:
If you trust your compiler's optimisation you might try a few modifications to acidzombie24s suggestion:
Here's the entire gcc -O3 optimised inner loop in x86 assembler code
for a C version of the above, only passing char array pointers: