优化 Long.bitCount
我有一个程序对 Long.bitCount() 进行大量调用,以至于在一个 CPU 内核上占用了 33% 的周期。有没有比Sun JDK版本更快的实现方法?
我已经尝试过:
- 这个算法 (我认为这正是JDK 实现了它)
- 查找 28 和 222 之间各种大小的表(一次查看几位并添加结果)
但我做不到比带有手动展开循环的 216 条目查找表更好(大约 27% CPU)。
还可以如何针对 Java 进行优化?
注意:这个问题是关于Java特定的优化,但是这个类似的(与语言无关的)问题 还有许多其他算法。
I have a program that is making a huge number of calls to Long.bitCount(), so many that it is taking 33% of cycles on one CPU core. Is there a way to implement it that is faster than the Sun JDK version?
I have tried:
- This algorithm (I think this is exactly how the JDK implements it)
- lookup tables of various sizes between 28 and 222 (looking at a few bits at a time and adding the results)
But I couldn't do any better than a 216-entry lookup table with a manually-unrolled loop (about 27% CPU.)
How else might this be optimized for Java?
Note: this question is about Java-specific optimization, but this similar (language-agnostic) question has many other algorithms.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
如果您使用的是最新的 x86 CPU,则有一条相关指令 popcnt。
在最新版本的 Java 中,Long.bitCount() 使用此指令。
只需使用 -XX:+UsePopCountInstruction (这是最近版本中的默认设置)
但是,在 JRE 6.0_u18 到 7.0_u5 中存在一些错误:
https://bugs.java.com/bugdatabase/view_bug.do?bug_id= 7063674
If you are on a recent x86 CPU there is an instruction for this, popcnt.
In recent versions of Java, Long.bitCount() uses this instruction.
Just use -XX:+UsePopCountInstruction (this is the default in recent versions)
However, there are some bugs with it in JRE 6.0_u18 through 7.0_u5:
https://bugs.java.com/bugdatabase/view_bug.do?bug_id=7063674
这似乎是 GPU 可以完美解决的问题之一。它应该能够将您的时间减少几个数量级。
否则我认为你可能需要在更高的层面上处理它。让多个线程同时处理不同的数据段(我相信您已经这样做了),在收集数据时处理数据,共享多个系统的工作,诸如此类。
This seems like one of those problems that is simply perfect for the GPU to work on. It should be able to slash your time by a couple orders of magnitude.
Otherwise I think you may have to deal with it at a higher level. Having multiple threads working on different segments of data at a time (which I'm sure you already do), processing the data while you are collecting it, sharing the work around multiple systems--something like that.
如果您的计算机具有可以处理比 64 位的某些倍数更宽的数据的整数 ALU(也称为 SIMD,例如 SSE2 或 VMX),则您可以一次计算多个 64 位元素的位计数。
不幸的是,这将要求您使用比 Java 更低级的语言提供特定于机器的实现。
If you machine has an integer ALU that can process data wider than some multiples of 64 bits (also known as SIMD, such as SSE2 or VMX), you can compute the bit counts on several 64-bit elements at once.
Unfortunately, this will require you to provide machine-specific implementations in a lower-level language than Java.
我怀疑您的应用程序受内存限制而不是受 CPU 限制,即它花费更多时间从内存中获取值而不是计算它们的位数。在这种情况下,您应该尝试减小工作集的大小或改进访问局部性以减少缓存未命中(如果算法允许)。
I suspect that your app is memory-bound rather than CPU-bound, i.e. it spends more time fetching the values from memory than counting their bits. In that case you should try to reduce the size of the working set or improve access locality to reduce cache misses (if the algorithm allows it).
我现在正在使用这种方法,它一次交错四个 popcnt 操作。
它基于 此 C 实现。
这稍微优于查找表版本,并且不消耗缓存。
I am now using this method, which interleaves four popcnt operations at a time.
It is based on this C implementation.
This outperforms the lookup table version slightly, and consumes no cache.
我不是该主题的专家,但如果您还没有看过这些页面,它们可能会有所帮助:
http://www.reddit.com/r/programming/comments/84sht/fast_bit_couting_algorithms/
http://www-graphics.stanford.edu/~seander/bithacks.html
您可能还想浏览一下许多图形库,尤其是那些较低级别和/或直接与硬件对话。
编辑:如果您可以选择编写低级平台特定代码,并且可以针对特定架构,那么看起来您可以使用相对较新引入的 POPCNT 指令(在一些最新的 AMD 和 Intel 处理器上可用)来提高潜在的速度。 http://kent-vandervelden.blogspot.com /2009/10/counting-bits-population-count-and.html 和另一篇带有基准的文章:http ://www.strchr.com/crc32_popcnt
I'm no expert in the subject, but in case you haven't seen these pages, they may help:
http://www.reddit.com/r/programming/comments/84sht/fast_bit_couting_algorithms/
http://www-graphics.stanford.edu/~seander/bithacks.html
You may also want to poke around the many graphics libraries out there, especially those that are lower-level and/or speak directly to hardware.
EDIT: looks like you can use the relatively newly introduced POPCNT instruction (available on some recent AMD and Intel processors) for a potential speed increase, if you have the option to write low-level platform-specific code, and can target that specific architecture. http://kent-vandervelden.blogspot.com/2009/10/counting-bits-population-count-and.html and another article with benchmarks: http://www.strchr.com/crc32_popcnt
根据我的理解:
我只会使用 33% 作为指标,因为对小方法进行分析确实可以改变整体性能。所以我会在一些大数据集上运行算法并查看总时间。我会根据总时间变化来考虑优化的效率。我还将包括一个警告阶段,以便 JIT 可以进行优化。
事实上,无论如何,位计数似乎是算法的关键部分之一……如果你优化所有内容,并设法将所有关键部分的速度提高 10 倍,那么你仍然可以为该部分分析近 33% 的内容。从本质上来说这并不坏。
受到此链接 http://bmagic.sourceforge.net/bmsse2opt.html 的启发,您可以尝试如果我没记错的话,现在使用所有 intel/AMD 处理器中存在的 SSE 指令(否则你总是可以故障恢复到 JAVA)。这篇文章的一个有趣的部分是......大多数时候,无论如何它都是内存限制的。但我仍然会尝试看看这对你有何作用。
GPU 非常适合超快的处理速度(是 CPU 核心的一百倍)和带宽。主要问题是将数据推送到 CPU 专用内存并返回结果。但如果你不只是执行位计数而是更多的操作,这可能会带来巨大的收益。
反正没有捷径,你必须尝试多种方法,看看哪种方法带来的收益最大。不要计算百分比,而要计算花费的总时间。
From my understanding:
I would use the 33% as an indicator only as profiling for small methods could really change the overall performance. So i would run the algorithm on some big dataset and see the total time. And I would consider the efficiancies of my optimization based on that total time changes. I would also include a warning up phase so that the JIT can do it's optimisations.
In fact the bit counting thing seem to be one of the key part of your algorithm anyway... if you optimize everything, and manage to get 10 time faster for all key part, you still profile something near 33% for this part. That's not bad by essence.
Inspiring from this link http://bmagic.sourceforge.net/bmsse2opt.html you could try to use SSE instruction present in all intel/AMD processor now if I remember right (you could alway failback to JAVA otherwise). An interresting part concerning the article is... That most of the time, it is memory bound anyway. But I would still try to see how this could work for you.
A GPU would be a perfect fit for insanely fast processing (easy hundred time one of a CPU core) and bandwidth. Main problem would be pushing data to CPU dedicated memory and getting result back. But if you don't just perform bit counting but more more operation, this could bring huge gains.
There is not shortcut anyway, you must try several approach and see what bring the most gain. Don't count % through but total time spent.
与其优化此函数,不如优化此函数的使用可能会更好。例如,您可以保留一个柜台。
这避免了通过跟踪设置的位数来扫描数据。这将开销转移到设置或清除位的频率上,并且使得获取设置的位数变得微不足道。它出现在您的用例中,后者更常见。
Rather than optimise this function, you are likely to be better off optimising the usage of this function. E.g. you could keep a counter.
This avoids scanning the data by keeping track of the number of the count of bits set. This moves the overhead to how often bits and set or cleared and makes getting the number of bits set trivial. It appears in your use case, the later is much more often.