当前位置：文江博客话题详情

优化 Long.bitCount

发布于 2024-10-15 07:21:40 字数 545 浏览 10 评论 0原文

我有一个程序对 Long.bitCount() 进行大量调用，以至于在一个 CPU 内核上占用了 33% 的周期。有没有比Sun JDK版本更快的实现方法？

我已经尝试过：

这个算法（我认为这正是JDK 实现了它）
查找 2⁸ 和 2²² 之间各种大小的表（一次查看几位并添加结果）

但我做不到比带有手动展开循环的 2¹⁶ 条目查找表更好（大约 27% CPU）。
还可以如何针对 Java 进行优化？

注意：这个问题是关于Java特定的优化，但是这个类似的（与语言无关的）问题还有许多其他算法。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

又爬满兰若 2024-10-22 07:21:40

如果您使用的是最新的 x86 CPU，则有一条相关指令 popcnt。

在最新版本的 Java 中，Long.bitCount() 使用此指令。
只需使用 -XX:+UsePopCountInstruction （这是最近版本中的默认设置）

但是，在 JRE 6.0_u18 到 7.0_u5 中存在一些错误：
https://bugs.java.com/bugdatabase/view_bug.do?bug_id= 7063674

回复收藏 0 原文

携君以终年 2024-10-22 07:21:40

这似乎是 GPU 可以完美解决的问题之一。它应该能够将您的时间减少几个数量级。

否则我认为你可能需要在更高的层面上处理它。让多个线程同时处理不同的数据段（我相信您已经这样做了），在收集数据时处理数据，共享多个系统的工作，诸如此类。

回复收藏 0 原文

天涯沦落人 2024-10-22 07:21:40

如果您的计算机具有可以处理比 64 位的某些倍数更宽的数据的整数 ALU（也称为 SIMD，例如 SSE2 或 VMX），则您可以一次计算多个 64 位元素的位计数。

不幸的是，这将要求您使用比 Java 更低级的语言提供特定于机器的实现。

回复收藏 0 原文

淡淡の花香 2024-10-22 07:21:40

我怀疑您的应用程序受内存限制而不是受 CPU 限制，即它花费更多时间从内存中获取值而不是计算它们的位数。在这种情况下，您应该尝试减小工作集的大小或改进访问局部性以减少缓存未命中（如果算法允许）。

回复收藏 0 原文

不顾 2024-10-22 07:21:40

我现在正在使用这种方法，它一次交错四个 popcnt 操作。
它基于此 C 实现。

private static final long M0=0x5555555555555555L,
                          M1=0x3333333333333333L,
                          M2=0x0f0f0f0f0f0f0f0fL;
public void store4Tags(long tag0, long tag1, long tag2, long tag3) {
    long count0 = tag0,
         count1 = tag1,
         count2 = tag2,
         count3 = tag3;
    count0 = (count0 & M0) + ((count0 >>> 1) & M0);
    count1 = (count1 & M0) + ((count1 >>> 1) & M0);
    count2 = (count2 & M0) + ((count2 >>> 1) & M0);
    count3 = (count3 & M0) + ((count3 >>> 1) & M0);

    count0 = (count0 & M1) + ((count0 >>> 2) & M1);
    count1 = (count1 & M1) + ((count1 >>> 2) & M1);
    count2 = (count2 & M1) + ((count2 >>> 2) & M1);
    count3 = (count3 & M1) + ((count3 >>> 2) & M1);

    count0 = (count0 + (count0 >>> 4)) & M2;
    count1 = (count1 + (count1 >>> 4)) & M2;
    count2 = (count2 + (count2 >>> 4)) & M2;
    count3 = (count3 + (count3 >>> 4)) & M2;

    count0 += count0 >>> 8;
    count1 += count1 >>> 8;
    count2 += count2 >>> 8;
    count3 += count3 >>> 8;

    count0 += count0 >>> 16;
    count1 += count1 >>> 16;
    count2 += count2 >>> 16;
    count3 += count3 >>> 16;

    count0 += count0 >>> 32;
    count1 += count1 >>> 32;
    count2 += count2 >>> 32;
    count3 += count3 >>> 32;

    storeWithPopCnt(tag0, 0x3f & (int) count0);
    storeWithPopCnt(tag1, 0x3f & (int) count1);
    storeWithPopCnt(tag2, 0x3f & (int) count2);
    storeWithPopCnt(tag3, 0x3f & (int) count3);
}

这稍微优于查找表版本，并且不消耗缓存。

I am now using this method, which interleaves four popcnt operations at a time.
It is based on this C implementation.

private static final long M0=0x5555555555555555L,
                          M1=0x3333333333333333L,
                          M2=0x0f0f0f0f0f0f0f0fL;
public void store4Tags(long tag0, long tag1, long tag2, long tag3) {
    long count0 = tag0,
         count1 = tag1,
         count2 = tag2,
         count3 = tag3;
    count0 = (count0 & M0) + ((count0 >>> 1) & M0);
    count1 = (count1 & M0) + ((count1 >>> 1) & M0);
    count2 = (count2 & M0) + ((count2 >>> 1) & M0);
    count3 = (count3 & M0) + ((count3 >>> 1) & M0);

    count0 = (count0 & M1) + ((count0 >>> 2) & M1);
    count1 = (count1 & M1) + ((count1 >>> 2) & M1);
    count2 = (count2 & M1) + ((count2 >>> 2) & M1);
    count3 = (count3 & M1) + ((count3 >>> 2) & M1);

    count0 = (count0 + (count0 >>> 4)) & M2;
    count1 = (count1 + (count1 >>> 4)) & M2;
    count2 = (count2 + (count2 >>> 4)) & M2;
    count3 = (count3 + (count3 >>> 4)) & M2;

    count0 += count0 >>> 8;
    count1 += count1 >>> 8;
    count2 += count2 >>> 8;
    count3 += count3 >>> 8;

    count0 += count0 >>> 16;
    count1 += count1 >>> 16;
    count2 += count2 >>> 16;
    count3 += count3 >>> 16;

    count0 += count0 >>> 32;
    count1 += count1 >>> 32;
    count2 += count2 >>> 32;
    count3 += count3 >>> 32;

    storeWithPopCnt(tag0, 0x3f & (int) count0);
    storeWithPopCnt(tag1, 0x3f & (int) count1);
    storeWithPopCnt(tag2, 0x3f & (int) count2);
    storeWithPopCnt(tag3, 0x3f & (int) count3);
}

This outperforms the lookup table version slightly, and consumes no cache.

回复收藏 0 原文

岁月苍老的讽刺 2024-10-22 07:21:40

我不是该主题的专家，但如果您还没有看过这些页面，它们可能会有所帮助：

http://www.reddit.com/r/programming/comments/84sht/fast_bit_couting_algorithms/

http://www-graphics.stanford.edu/~seander/bithacks.html

您可能还想浏览一下许多图形库，尤其是那些较低级别和/或直接与硬件对话。

编辑：如果您可以选择编写低级平台特定代码，并且可以针对特定架构，那么看起来您可以使用相对较新引入的 POPCNT 指令（在一些最新的 AMD 和 Intel 处理器上可用）来提高潜在的速度。 http://kent-vandervelden.blogspot.com /2009/10/counting-bits-population-count-and.html 和另一篇带有基准的文章：http ://www.strchr.com/crc32_popcnt

回复收藏 0 原文

倾｀听者〃 2024-10-22 07:21:40

根据我的理解：

我只会使用 33% 作为指标，因为对小方法进行分析确实可以改变整体性能。所以我会在一些大数据集上运行算法并查看总时间。我会根据总时间变化来考虑优化的效率。我还将包括一个警告阶段，以便 JIT 可以进行优化。

事实上，无论如何，位计数似乎是算法的关键部分之一……如果你优化所有内容，并设法将所有关键部分的速度提高 10 倍，那么你仍然可以为该部分分析近 33% 的内容。从本质上来说这并不坏。

受到此链接 http://bmagic.sourceforge.net/bmsse2opt.html 的启发，您可以尝试如果我没记错的话，现在使用所有 intel/AMD 处理器中存在的 SSE 指令（否则你总是可以故障恢复到 JAVA）。这篇文章的一个有趣的部分是......大多数时候，无论如何它都是内存限制的。但我仍然会尝试看看这对你有何作用。

GPU 非常适合超快的处理速度（是 CPU 核心的一百倍）和带宽。主要问题是将数据推送到 CPU 专用内存并返回结果。但如果你不只是执行位计数而是更多的操作，这可能会带来巨大的收益。

反正没有捷径，你必须尝试多种方法，看看哪种方法带来的收益最大。不要计算百分比，而要计算花费的总时间。

回复收藏 0 原文

被翻牌 2024-10-22 07:21:40

与其优化此函数，不如优化此函数的使用可能会更好。例如，您可以保留一个柜台。

public void set(int n) {
   if(!get(n)) bitCount++;
   // set the bit
}
public void clear(int n) {
   if(get(n)) bitCount--;
   // clear the bit
}
public int bitCount() {
   return bitCount;
}

这避免了通过跟踪设置的位数来扫描数据。这将开销转移到设置或清除位的频率上，并且使得获取设置的位数变得微不足道。它出现在您的用例中，后者更常见。

Rather than optimise this function, you are likely to be better off optimising the usage of this function. E.g. you could keep a counter.

public void set(int n) {
   if(!get(n)) bitCount++;
   // set the bit
}
public void clear(int n) {
   if(get(n)) bitCount--;
   // clear the bit
}
public int bitCount() {
   return bitCount;
}

This avoids scanning the data by keeping track of the number of the count of bits set. This moves the overhead to how often bits and set or cleared and makes getting the number of bits set trivial. It appears in your use case, the later is much more often.

回复收藏 0 原文

~没有更多了~