在数组中查找两个最小 int64 元素的最快方法
我的数组大小从 1000 到 10000 (1k .. 10k)。每个元素都是 int64。我的任务是找到数组中两个最小的元素,即最小元素和剩余元素中的最小值。
我想为 Intel Core2 或 Corei7(CPU 模式为 64 位)获得最快的 C++ 单线程代码。
这个函数(从数组中获取 2 个最小的值)是热点,它嵌套在两个或三个 for 循环中,迭代次数巨大。
当前代码如下:
int f()
{
int best; // index of the minimum element
int64 min_cost = 1LL << 61;
int64 second_min_cost = 1LL << 62;
for (int i = 1; i < width; i++) {
int64 cost = get_ith_element_from_array(i); // it is inlined
if (cost < min_cost) {
best = i;
second_min_cost = min_cost;
min_cost = cost;
} else if (cost < second_min_cost) {
second_min_cost = cost;
}
}
save_min_and_next(min_cost, best, second_min_cost);
}
I have arrays with sizes from 1000 to 10000 (1k .. 10k). Each element is int64. My task is to find two smallest elements of the arrays, the minimum element and the minimum from the remaining.
I want to get fastest possible single-threaded code in C++ for Intel Core2 or Corei7 (cpu mode is 64 bit).
This function (getting the 2 smallest from array) is the hotspot, it is nested in two or three for loops with huge iteration count.
Current code is like:
int f()
{
int best; // index of the minimum element
int64 min_cost = 1LL << 61;
int64 second_min_cost = 1LL << 62;
for (int i = 1; i < width; i++) {
int64 cost = get_ith_element_from_array(i); // it is inlined
if (cost < min_cost) {
best = i;
second_min_cost = min_cost;
min_cost = cost;
} else if (cost < second_min_cost) {
second_min_cost = cost;
}
}
save_min_and_next(min_cost, best, second_min_cost);
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
查看
partial_sort
和nth_element
如果您只想要第二低的值,则 nth_element 为你的家伙
Look at
partial_sort
andnth_element
If you only wanted the second lowest value, nth_element is your guy
尝试反转 if:
并且您可能应该使用相同的值初始化 min_cost 和 secondary_min_cost ,使用 int64 的最大值(或者更好地使用 qbert220 的建议)
Try inverting the if:
And you should probably initialize min_cost and second_min_cost with the same value, using the max value of int64 (or even better use the suggestion of qbert220)
一些小事情(可能已经发生了,但我想可能值得尝试)。
稍微展开循环 - 比如说以 8 的步幅进行迭代(即一次缓存行),预取主体中的下一个缓存行,然后处理 8 个项目。为了避免大量检查,请确保结束条件是 8 的倍数,并且应在循环外部处理剩余的项目(少于 8) - 展开...
对于不感兴趣的项目,您是在身体里做两次检查,也许你可以修剪到1? ,如果
cost
小于second_min
,则也检查min
- 否则无需打扰...Some small things (which may be happening already, but may be worth trying I guess).
Unroll the loop slightly - say for example iterate in strides of 8 (i.e. cache line at a time), pre-fetch the next cache line in the body, then process the 8 items. To avoid lots of checks, ensure the end condition is a multiple of 8, and the left over items (less than 8) should be processed outside of the loop - unrolled...
For the items of no interest, you are doing two checks in the body, may be you can trim to 1? i.e. if the
cost
is less thansecond_min
, then checkmin
as well - else no need to bother...您最好先检查 secondary_min_cost,因为它是唯一需要修改结果的条件。这样,您将在主循环中获得一个分支,而不是两个。这应该有很大帮助。
除此之外,几乎没有什么需要优化的,你已经接近最佳状态了。展开可能会有所帮助,但我怀疑它在这种情况下会带来任何显着的优势。
所以,它变成:
You'd better check second_min_cost first, since it is the only condition which requires to modify the result. This way, you'll get one branch, instead of 2, into your main loop. This should help quite a bit.
Other than that, there is very little to optimise, your are already close to optimal. Unrolling may help, but i doubt it will bring any significant advantage in this scenario.
So, it becomes :
确保您的数组读取是自愿的,这样就不会引入不必要的缓存未命中。
假设数组读取很简单,这段代码可能应该非常接近现代 CPU 上的带宽限制。您需要分析和/或计算它是否仍有 CPU 优化的空间。
Make sure your array-reading is will-behaved so it doesn't introduce needless cache-misses.
This code should probably be very close to bandwidth-bound on modern CPU:s, assuming the array-reading is simple. You need to profile and/or calculate if it still seems to have any headroom for CPU optimizations.
您所拥有的是
O(n)
并且对于随机数据来说是最佳的。这意味着,您已经拥有最快的了。改进这一点的唯一方法是为数组赋予某些属性,例如,始终保持其排序或将其设为堆。
What you have there, is
O(n)
and optimal for random data. That means, you already have the fastest.The only way you can improve this is by giving certain properties to your array, for example, keeping it sorted at all times or by making it a heap.
优点是您的算法会扫描一次数字。你是最理想的。
缓慢的一个重要原因可能来自元素的排列方式。如果它们在一个数组中,我的意思是一个 C 数组(或 C++ 向量),其中所有元素都是连续的,并且您向前扫描它们,那么从内存角度来看,您也是最佳的。否则,你可能会有一些惊喜。例如,如果您的元素位于链表中或分散聚集,那么您可能会因内存访问而受到惩罚。
The good point is that your algorithm scans the numbers once. You're optimal.
An important source of slowness could come from the way your elements are arranged. If they are in an array, I mean a C array (or C++ vector) where all the elements are contiguous and you scan them forward, then memory-wise you're optimal too. Otherwise, you could have some surprises. For instance, if your elements are in a linked list, or scatter gathered, then you can have penalty for memory accesses.