使用SSE加速lower_bound函数
在我目前正在从事的一个项目中,我经常需要在排序数组中找到可以插入元素的最低可能索引(如 C++ 中的 std::lower_bound )。 使用 SSE 来加速我的算法对我来说似乎很有吸引力,因为我正在使用 uint32 数组,其大小通常是处理器缓存行的大小。 我以前从未使用过 SSE 指令,因此我无法弄清楚该函数的 SSE 实现是什么样子。请给出提示,帮助我用 SSE 最佳地写出它。
In a project I'm currently working on I often need to find the lowest possible index in a sorted array at which an element can be inserted (like std::lower_bound in C++).
It seems pretty attractive to me to use SSE to speed up my algorithm since I'm working with uint32 arrays which size is typically the size of a processor cache line.
I've never used SSE instructions before, so I can't manage to figure out what an SSE implementation of this function would looks like. Please give hints to help me write it out optimally wtih SSE.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
没有像 std::lower_bound 这样的东西可以使用 SSE 很好地扩展。 SSE 使事情变得更快的原因是它允许您同时进行多项计算。例如,单个 SSE 指令可能会导致同时进行 4 个乘法运算。但是,
std::lower_bound
的操作方式无法并行化,因为算法中的每一步都需要前面步骤的比较结果。另外,它已经是 O(lg n),因此不太可能成为瓶颈。此外,在转向内联汇编之前,您应该知道,每当您使用内联汇编时,您都会击败程序该部分可能发生的大多数编译器优化,并且通常结果是您的程序会变慢 - 编译器通常会编写更好的汇编程序比我们人类做的还要多。
如果您想使用 SSE,最好使用内在函数——编译器提供的特殊“函数”或关键字,它们调用 SSE 指令,但允许进行优化。此类内在函数可在 Microsoft 的 Visual C++ 以及 GNU 编译器集合 。 (可能是大多数编译器。请参阅编译器的文档)
您应该尝试从一开始就不需要调用它,而不是尝试使用 SSE 来加速 std::lower_bound 。例如,如果您不断使用
lower_bound
将元素插入到向量中,您应该知道您有效创建的是 插入排序,而且是一个很差的插入排序,需要二次时间。您可能最好简单地将新元素放在向量的末尾,然后在需要排序时对向量进行排序,这会将事情减少到 O(n lg n) 排序。如果您的数据访问模式使您过于频繁地使用,那么您应该使用 std::set 之类的东西来代替,它为插入提供 O(lg n) 操作,而不是 O( n + lg n) 您当前使用向量获得的插入。当然,记得进行基准测试:)
Nothing like
std::lower_bound
is going to scale well using SSE. The reason SSE makes things faster is that it allows you to do several calculations at once. For example, a single SSE instruction might result in 4 multiply operations going on at once. However, the waystd::lower_bound
operates cannot be parallelized, because each step in the algorithm requires the comparison results of previous steps. Plus, it's already O(lg n), and as a result unlikely to be a bottleneck.Moreover, before moving to inline assembly, you should know that whenever you use inline assembly, you defeat most compiler optimizations that might occur on that section of your program, and often as a result your program will be slower -- compilers usually write better assembler than us humans do.
If you want to use SSE, you are better off using intrinsics -- special "functions" or keywords provided by the compiler, which call the SSE instruction but otherwise allow optimizations to occur. Such intrinsics are available in Microsoft's Visual C++, as well as the GNU Compiler Collection. (And probably most any compiler. Consult your compiler's documentation)
Rather than trying to speed up
std::lower_bound
using SSE, you should try to not need to call it in the first place. For example, if you're constantly inserting elements into the vector usinglower_bound
, you should know what you've effectively created is insertion sort, and a poor insertion sort at that, which will require quadradic time. You would likely be better off simply putting your new elements on the end of the vector, and then sorting the vector when you need it sorted, which reduces things to an O(n lg n) sort. If your data access patterns are such that you would be resorting too often then you should use something like astd::set
instead, which provides O(lg n) operations for insertions, rather than the O(n + lg n) insertions you're currently getting with the vectors.And of course, remember to benchmark :)