为什么处理一个未分类的数组的速度与现代X86-64 clang处理排序的数组相同?
我发现了这个受欢迎的〜9岁问题,决定仔细检查其结果。
因此,我有AMD Ryzen 9 5950X,Clang ++ 10和Linux,我从问题中复制了代码,这是我得到的:
排序-0.549702S :
~/d/so_sorting_faster$ cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out
std::sort(data, data + arraySize);
0.549702
sum = 314931600000
unsorted -0.546554S :
~/d/so_sorting_faster $ cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out
// std::sort(data, data + arraySize);
0.546554
sum = 314931600000
我很确定,未分类版本的速度更快的事实只是噪音,但似乎不再慢了。
因此, CPU 的体系结构发生了什么变化(因此它不再慢了数量级)?
以下是多个运行的结果:
Unsorted: 0.543557 0.551147 0.541722 0.555599
Sorted: 0.542587 0.559719 0.53938 0.557909
以防万一,这是我
#include <algorithm>
#include <ctime>
#include <iostream>
int main()
{
// Generate data
const unsigned arraySize = 32768;
int data[arraySize];
for (unsigned c = 0; c < arraySize; ++c)
data[c] = std::rand() % 256;
// !!! With this, the next loop runs faster.
// std::sort(data, data + arraySize);
// Test
clock_t start = clock();
long long sum = 0;
for (unsigned i = 0; i < 100000; ++i)
{
// Primary loop
for (unsigned c = 0; c < arraySize; ++c)
{
if (data[c] >= 128)
sum += data[c];
}
}
double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;
std::cout << elapsedTime << std::endl;
std::cout << "sum = " << sum << std::endl;
return 0;
}
。
的
Unsorted
cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out
// std::sort(data, data + arraySize);
10.3814
Sorted:
cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out
std::sort(data, data + arraySize);
10.6885
主
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您链接的问题中的几个答案讨论了将代码重写为无分支,从而避免了任何分支预测问题。这就是您更新的编译器正在做的。
具体来说,带有
-O3
vecordectionals> vecordizes 内部循环。 请参阅Godbolt上的代码,大会的第36-67行。该代码有点复杂,但是您绝对看不到的一件事是data [c]&gt; = 128
测试上的任何条件分支。取而代之的是,它使用矢量比较指令(pcmpgtd
),其输出是带有1 s匹配元素的掩码,而0 s则不匹配。随后的pand
使用此掩码将非匹配元素替换为0,因此当无条件添加到总和中时,它们不会贡献任何内容。粗糙的c ++等效物实际上是
代码可让两个运行的64位
sum
s,对于数组的均匀元素和奇数元素,以便它们可以并行累积,然后在末尾添加在一起循环。一些额外的复杂性是照顾签名 - 将32位
数据
元素延伸到64位;这就是pxor xmm5,xmm5之类的序列; PCMPGTD XMM5,XMM4; punpckldq xmm4,xmm5
完成。打开-mavx2
,您会看到一个更简单的vpmovsxdq ymm5,xmm5
。该代码看起来也很长,因为循环已经展开,处理
data
的8个元素。Several of the answers in the question you link talk about rewriting the code to be branchless and thus avoiding any branch prediction issues. That's what your updated compiler is doing.
Specifically, clang++ 10 with
-O3
vectorizes the inner loop. See the code on godbolt, lines 36-67 of the assembly. The code is a little bit complicated, but one thing you definitely don't see is any conditional branch on thedata[c] >= 128
test. Instead it uses vector compare instructions (pcmpgtd
) whose output is a mask with 1s for matching elements and 0s for non-matching. The subsequentpand
with this mask replaces the non-matching elements by 0, so that they do not contribute anything when unconditionally added to the sum.The rough C++ equivalent would be
The code actually keeps two running 64-bit
sum
s, for the even and odd elements of the array, so that they can be accumulated in parallel and then added together at the end of the loop.Some of the extra complexity is to take care of sign-extending the 32-bit
data
elements to 64 bits; that's what sequences likepxor xmm5, xmm5 ; pcmpgtd xmm5, xmm4 ; punpckldq xmm4, xmm5
accomplish. Turn on-mavx2
and you'll see a simplervpmovsxdq ymm5, xmm5
in its place.The code also looks long because the loop has been unrolled, processing 8 elements of
data
per iteration.