我目前正在开发一个基于 Java 的网络编码库 (http://en.wikipedia.org/wiki/Network_coding)。这非常消耗 CPU 资源,因此需要一些帮助来优化编码阶段。我本质上所做的是创建原始数据的随机线性组合,其中加法是异或,乘法是伽罗瓦域乘法(在 GF(2^16) 中)。
我已经尽我所能进行优化了。例如,我使用这样的技巧: http://groups.google.com/group/comp.dsp/browse_thread/thread/cba57ae9db9971fd/7cd21eec39ddae1a?hl=en&lnk=gst&q=Sarwate+Galois#7cd21eec39ddae1a 到使乘法更快。
因此,我正在寻找有关如何进一步优化这一点的技巧。很难进行分析,因为我使用的分析器没有给您任何关于哪个操作最昂贵的提示(例如,它是数组查找还是异或)。所以我现在正在随机尝试不同的想法并测试它是否可以提高整体性能。
更具体地说,我需要帮助的一些潜在改进领域是:
- 如何确保 Java 可以跳过数组操作的边界检查?
- 如何获取HotSpot优化后实际执行的字节码?
这是算法的核心。脱离上下文可能很难理解,但如果您发现我正在执行任何不必要的昂贵操作,请告诉我!
int messageFragmentStart = 0;
int messageFragmentEnd = fragmentCharSize;
int coefficientIndex = fragmentID * messageFragmentsPerDataBlock;
final int resultArrayIndexStart = fragmentID * fragmentCharSize;
for (int messageFragmentIndex = 0; messageFragmentIndex < messageFragmentsPerDataBlock; messageFragmentIndex++) {
final int coefficientLogValue = coefficientLogValues[coefficientIndex++];
int resultArrayIndex = resultArrayIndexStart;
for (int i = messageFragmentStart; i < messageFragmentEnd; i++) {
final int logSum = coefficientLogValue + logOfDataToEncode[i];
final int messageMultipliedByCoefficient = expTable[logSum];
resultArray[resultArrayIndex++] ^= messageMultipliedByCoefficient;
}
messageFragmentStart += fragmentCharSize;
messageFragmentEnd = Math.min(messageFragmentEnd + fragmentCharSize, maxTotalLength);
}
I'm currently developing a Java-based library for network coding (http://en.wikipedia.org/wiki/Network_coding). This is very CPU-intensive and therefore need some help optimizing the encoding stage. What I'm essentially doing is that I'm creating random-linear combinations of the original data where addition is XOR and multiplication is a Galois-field multiplication (in GF(2^16)).
I've come as far as I'm capable with the optimizations. For instance I'm using tricks like this: http://groups.google.com/group/comp.dsp/browse_thread/thread/cba57ae9db9971fd/7cd21eec39ddae1a?hl=en&lnk=gst&q=Sarwate+Galois#7cd21eec39ddae1a to make the multiplications faster.
I'm therefore looking for tips on how to optimize this further. It's hard to profile since the profilers I've used doesn't give you any hints on which operation is the most expensive (e.g. is it the array-lookup or the XOR). So I'm at the point where I'm sort of randomly trying out different ideas and test if it improves the overall performance.
More specifically some potential areas of improvement that I need help on are:
- How can I make sure that Java can skip the bounds-checking on the array-operations?
- How can I retrieve the bytecode that actually executes after the HotSpot is done optimizing?
Here's the core of the algorithm. It might be hard to understand out of context but if you see any unnecessarily expensive operations I'm doing then please let me know!
int messageFragmentStart = 0;
int messageFragmentEnd = fragmentCharSize;
int coefficientIndex = fragmentID * messageFragmentsPerDataBlock;
final int resultArrayIndexStart = fragmentID * fragmentCharSize;
for (int messageFragmentIndex = 0; messageFragmentIndex < messageFragmentsPerDataBlock; messageFragmentIndex++) {
final int coefficientLogValue = coefficientLogValues[coefficientIndex++];
int resultArrayIndex = resultArrayIndexStart;
for (int i = messageFragmentStart; i < messageFragmentEnd; i++) {
final int logSum = coefficientLogValue + logOfDataToEncode[i];
final int messageMultipliedByCoefficient = expTable[logSum];
resultArray[resultArrayIndex++] ^= messageMultipliedByCoefficient;
}
messageFragmentStart += fragmentCharSize;
messageFragmentEnd = Math.min(messageFragmentEnd + fragmentCharSize, maxTotalLength);
}
发布评论
评论(1)
您不能让 Java 放弃 JLS 中指定的边界检查。但在大多数情况下,只要边界检查很简单,JIT 就能够避免这种情况(例如
i) - 如果不是,则无法避免这种情况(我假设有一个可以玩不安全的物体吗?)。
对于你的第二个问题,这里有 this 应该可以很好地满足目的。
但无论如何,从你的代码来看,这个问题对于向量化来说似乎是微不足道的,遗憾的是 JVM 不太擅长/根本不擅长。因此,使用编译内在函数在 c/c++ 中实现相同的代码(您甚至可以尝试 ICC/GCC 的自动矢量化)可能会导致一些相当明显的加速 - 假设我们没有完全内存限制。所以我用C++实现它并使用JNI仅供参考。
You can't make Java forgo the bounds checking as its specified in the JLS. But in most cases the JIT is able to avoid this as long as the bounds check is simple (eg
i < array.length
) - if not, there's no way to avoid it (well I assume one could play with unsafe objects?).For your second problem there's this here which should fulfill the purposes just fine.
But anyhow from your code it seems like this problem is trivial to vectorize and sadly the JVM isn't very good at it/does it at all. Hence implementing the same code in c/c++ using compile intrinsics (you could even try the auto vectorization of ICC/GCC) could lead to some quite noticeable speedups - assuming we're not completely memory bound. So I'd implement it in C++ and use JNI just for reference.