Java:微优化数组操作

发布于 2024-09-04 10:41:28 字数 3148 浏览 4 评论 0原文

我正在尝试制作一个简单的前馈神经网络的 Java 端口。
这显然涉及大量的数值计算,因此我正在尝试尽可能优化我的中央循环。结果在 float 数据类型的限制内应该是正确的。

我当前的代码如下所示(已删除错误处理和初始化):

/**
 * Simple implementation of a feedforward neural network. The network supports
 * including a bias neuron with a constant output of 1.0 and weighted synapses
 * to hidden and output layers.
 * 
 * @author Martin Wiboe
 */
public class FeedForwardNetwork {
private final int outputNeurons;    // No of neurons in output layer
private final int inputNeurons;     // No of neurons in input layer
private int largestLayerNeurons;    // No of neurons in largest layer
private final int numberLayers;     // No of layers
private final int[] neuronCounts;   // Neuron count in each layer, 0 is input
                                // layer.
private final float[][][] fWeights; // Weights between neurons.
                                    // fWeight[fromLayer][fromNeuron][toNeuron]
                                    // is the weight from fromNeuron in
                                    // fromLayer to toNeuron in layer
                                    // fromLayer+1.
private float[][] neuronOutput;     // Temporary storage of output from previous layer


public float[] compute(float[] input) {
    // Copy input values to input layer output
    for (int i = 0; i < inputNeurons; i++) {
        neuronOutput[0][i] = input[i];
    }

    // Loop through layers
    for (int layer = 1; layer < numberLayers; layer++) {

        // Loop over neurons in the layer and determine weighted input sum
        for (int neuron = 0; neuron < neuronCounts[layer]; neuron++) {
            // Bias neuron is the last neuron in the previous layer
            int biasNeuron = neuronCounts[layer - 1];

            // Get weighted input from bias neuron - output is always 1.0
            float activation = 1.0F * fWeights[layer - 1][biasNeuron][neuron];

            // Get weighted inputs from rest of neurons in previous layer
            for (int inputNeuron = 0; inputNeuron < biasNeuron; inputNeuron++) {
                activation += neuronOutput[layer-1][inputNeuron] * fWeights[layer - 1][inputNeuron][neuron];
            }

            // Store neuron output for next round of computation
            neuronOutput[layer][neuron] = sigmoid(activation);
        }
    }

    // Return output from network = output from last layer
    float[] result = new float[outputNeurons];
    for (int i = 0; i < outputNeurons; i++)
        result[i] = neuronOutput[numberLayers - 1][i];

    return result;
}

private final static float sigmoid(final float input) {
    return (float) (1.0F / (1.0F + Math.exp(-1.0F * input)));
}
}

我正在使用 -server 选项运行 JVM,到目前为止,我的代码比类似的 C 代码慢 25% 到 50%。我可以做些什么来改善这种情况?

谢谢

Martin Wiboe

编辑#1:在看到大量回复后,我可能应该澄清我们场景中的数字。在典型的运行过程中,该方法将使用不同的输入被调用大约 50.000 次。典型的网络有 numberLayers = 3 层,分别有 190、2 和 1 个神经元。因此,最里面的循环将有大约 2*191+3=385 次迭代(当计算第 0 层和第 1 层中添加的偏置神经元时)

编辑#1:在实现本线程中的各种建议后,我们的实现实际上与 C 版本一样快(大约 2% 以内)。感谢您的帮助!所有建议都有帮助,但由于我只能将一个答案标记为正确答案,因此我会将其交给@Durandal,以提供数组优化建议并成为唯一预先计算 for 循环的答案标头。

I am trying to make a Java port of a simple feed-forward neural network.
This obviously involves lots of numeric calculations, so I am trying to optimize my central loop as much as possible. The results should be correct within the limits of the float data type.

My current code looks as follows (error handling & initialization removed):

/**
 * Simple implementation of a feedforward neural network. The network supports
 * including a bias neuron with a constant output of 1.0 and weighted synapses
 * to hidden and output layers.
 * 
 * @author Martin Wiboe
 */
public class FeedForwardNetwork {
private final int outputNeurons;    // No of neurons in output layer
private final int inputNeurons;     // No of neurons in input layer
private int largestLayerNeurons;    // No of neurons in largest layer
private final int numberLayers;     // No of layers
private final int[] neuronCounts;   // Neuron count in each layer, 0 is input
                                // layer.
private final float[][][] fWeights; // Weights between neurons.
                                    // fWeight[fromLayer][fromNeuron][toNeuron]
                                    // is the weight from fromNeuron in
                                    // fromLayer to toNeuron in layer
                                    // fromLayer+1.
private float[][] neuronOutput;     // Temporary storage of output from previous layer


public float[] compute(float[] input) {
    // Copy input values to input layer output
    for (int i = 0; i < inputNeurons; i++) {
        neuronOutput[0][i] = input[i];
    }

    // Loop through layers
    for (int layer = 1; layer < numberLayers; layer++) {

        // Loop over neurons in the layer and determine weighted input sum
        for (int neuron = 0; neuron < neuronCounts[layer]; neuron++) {
            // Bias neuron is the last neuron in the previous layer
            int biasNeuron = neuronCounts[layer - 1];

            // Get weighted input from bias neuron - output is always 1.0
            float activation = 1.0F * fWeights[layer - 1][biasNeuron][neuron];

            // Get weighted inputs from rest of neurons in previous layer
            for (int inputNeuron = 0; inputNeuron < biasNeuron; inputNeuron++) {
                activation += neuronOutput[layer-1][inputNeuron] * fWeights[layer - 1][inputNeuron][neuron];
            }

            // Store neuron output for next round of computation
            neuronOutput[layer][neuron] = sigmoid(activation);
        }
    }

    // Return output from network = output from last layer
    float[] result = new float[outputNeurons];
    for (int i = 0; i < outputNeurons; i++)
        result[i] = neuronOutput[numberLayers - 1][i];

    return result;
}

private final static float sigmoid(final float input) {
    return (float) (1.0F / (1.0F + Math.exp(-1.0F * input)));
}
}

I am running the JVM with the -server option, and as of now my code is between 25% and 50% slower than similar C code. What can I do to improve this situation?

Thank you,

Martin Wiboe

Edit #1: After seeing the vast amount of responses, I should probably clarify the numbers in our scenario. During a typical run, the method will be called about 50.000 times with different inputs. A typical network would have numberLayers = 3 layers with 190, 2 and 1 neuron, respectively. The innermost loop will therefore have about 2*191+3=385 iterations (when counting the added bias neuron in layers 0 and 1)

Edit #1: After implementing the various suggestions in this thread, our implementation is practically as fast as the C version (within ~2 %). Thanks for all the help! All of the suggestions have been helpful, but since I can only mark one answer as the correct one, I will give it to @Durandal for both suggesting array optimizations and being the only one to precalculate the for loop header.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

初心未许 2024-09-11 10:41:28

一些提示。

  • 在最里面的循环中,考虑如何遍历 CPU 缓存并重新排列矩阵,以便按顺序访问最外面的数组。这将导致您按顺序访问缓存,而不是到处跳转。高速缓存命中可以比高速缓存未命中快两个数量级。
    例如,重构 fWeights,使其作为

激活 += NeuronOutput[layer-1][inputNeuron] * fWeights[layer - 1][neuron][inputNeuron] 进行访问;

  • 不要在循环内(每次)执行可以在循环外完成(一次)的工作。当您可以将其放入局部变量时,不要每次都执行 [layer -1] 查找。您的 IDE 应该能够轻松地重构它。

  • Java中的多维数组不如C中的高效。它们实际上是单维数组的多层。您可以重构代码,以便仅使用一维数组。

  • 当您可以将结果数组作为参数传递时,不要返回新数组。 (节省了每次调用时创建新对象的麻烦)。

  • 与其到处执行layer-1,为什么不使用layer1作为layer-1并使用layer1+1代替layer。

Some tips.

  • in your inner most loop, think about how you are traversing your CPU cache and re-arrange your matrix so you are accessing the outer most array sequentially. This will result in you accessing your cache in order rather than jumping all over the place. A cache hit can be two orders of magniture faster than a cache miss.
    e.g restructure fWeights so it is accessed as

activation += neuronOutput[layer-1][inputNeuron] * fWeights[layer - 1][neuron][inputNeuron];

  • don't perform work inside the loop (every time) which can be done outside the loop (once). Don't perform the [layer -1] lookup every time when you can place this in a local variable. Your IDE should be able to refactor this easily.

  • multi-dimensional arrays in Java are not as efficient as they are in C. They are actually multiple layers of single dimensional arrays. You can restructure the code so you're only using a single dimensional array.

  • don't return a new array when you can pass the result array as an argument. (Saves creating a new object on each call).

  • rather than peforming layer-1 all over the place, why not use layer1 as layer-1 and using layer1+1 instead of layer.

听,心雨的声音 2024-09-11 10:41:28

抛开实际的数学计算不谈,Java 中的数组索引本身就是一个性能消耗者。考虑一下 Java 没有真正的多维数组,而是将它们实现为数组的数组。在最里面的循环中,您可以访问多个索引,其中一些索引实际上在该循环中是恒定的。部分数组访问可以移到循环之外:

final int[] neuronOutputSlice = neuronOutput[layer - 1];
final int[][] fWeightSlice = fWeights[layer - 1];
for (int inputNeuron = 0; inputNeuron < biasNeuron; inputNeuron++) {
    activation += neuronOutputSlice[inputNeuron] * fWeightsSlice[inputNeuron][neuron];
}

服务器 JIT 可能会执行类似的代码不变移动,找出答案的唯一方法是更改​​并分析它。在客户端 JIT 上,无论如何这都会提高性能。
您可以尝试的另一件事是预先计算 for 循环退出条件,如下所示:

for (int neuron = 0; neuron < neuronCounts[layer]; neuron++) { ... }
// transform to precalculated exit condition (move invariant array access outside loop)
for (int neuron = 0, neuronCount = neuronCounts[layer]; neuron < neuronCount; neuron++) { ... }

JIT 可能已经为您执行此操作,因此请分析是否有帮助。

与 1.0F 相乘是否有让我困惑的地方?:

float activation = 1.0F * fWeights[layer - 1][biasNeuron][neuron];

其他可能以牺牲可读性为代价提高速度的东西:手动内联 sigmoid() 函数(JIT 对内联有非常严格的限制,并且函数可能更大) 。
向后运行循环可能会稍微快一些(当然它不会改变结果),因为针对零测试循环索引比检查局部变量要便宜一些(最里面的循环又是一个潜在的候选者,但不要期望输出在所有情况下都 100% 相同,因为添加浮点数 a + b + c 可能与 a + c + b 不同)。

Disregarding the actual math, the array indexing in Java can be a performance hog in itself. Consider that Java has no real multidimensional arrays, but rather implements them as array of arrays. In your innermost loop, you access over multiple indices, some of which are in fact constant in that loop. Part of the array access can be move outside of the loop:

final int[] neuronOutputSlice = neuronOutput[layer - 1];
final int[][] fWeightSlice = fWeights[layer - 1];
for (int inputNeuron = 0; inputNeuron < biasNeuron; inputNeuron++) {
    activation += neuronOutputSlice[inputNeuron] * fWeightsSlice[inputNeuron][neuron];
}

It is possible that the server JIT performs a similar code invariant movement, the only way to find out is change and profile it. On the client JIT this should improve performance no matter what.
Another thing you can try is to precalculate the for-loop exit conditions, like this:

for (int neuron = 0; neuron < neuronCounts[layer]; neuron++) { ... }
// transform to precalculated exit condition (move invariant array access outside loop)
for (int neuron = 0, neuronCount = neuronCounts[layer]; neuron < neuronCount; neuron++) { ... }

Again the JIT may already do this for you, so profile if it helps.

Is there a point to multiplying with 1.0F that eludes me here?:

float activation = 1.0F * fWeights[layer - 1][biasNeuron][neuron];

Other things that could potentially improve speed at cost of readability: inline sigmoid() function manually (the JIT has a very tight limit for inlining and the function might be larger).
It can be slightly faster to run a loop backwards (where it doesnt change the outcome of course), since testing the loop index against zero is a little cheaper than checking against a local variable (the innermost loop is a potentical candidate again, but dont expect the output to be 100% identical in all cases, since adding floats a + b + c is potentially not the same as a + c + b).

方圜几里 2024-09-11 10:41:28

首先,不要这样做:

// Copy input values to input layer output
for (int i = 0; i < inputNeurons; i++) {
    neuronOutput[0][i] = input[i];
}

但是这样:

System.arraycopy( input, 0, neuronOutput[0], 0, inputNeurons );

For a start, don't do this:

// Copy input values to input layer output
for (int i = 0; i < inputNeurons; i++) {
    neuronOutput[0][i] = input[i];
}

But this:

System.arraycopy( input, 0, neuronOutput[0], 0, inputNeurons );
空宴 2024-09-11 10:41:28

我要调查的第一件事是查看 Math.exp 是否会减慢您的速度。请参阅这篇关于 Math.exp 近似的文章了解本机替代方案。

First thing I would look into is seeing if Math.exp is slowing you down. See this post on a Math.exp approximation for a native alternative.

¢好甜 2024-09-11 10:41:28

用整数步长传递函数替换昂贵的浮点 sigmoid 传递函数。

sigmoid 传递函数是有机模拟突触学习的模型,而它又似乎是阶跃函数的模型。

历史上的先例是,Hinton 直接根据关于真实突触的认知科学理论的第一原理设计了反向传播算法,而这些理论又基于真实的模拟测量,结果是 S 型曲线。

但sigmoid传递函数似乎是数字阶跃函数的有机模型,当然不能直接有机实现。

不要对模型进行建模,而是用阶跃函数的直接数字实现(小于零 = -1,大于零 = +1)替换有机 sigmoid 传递函数的昂贵浮点实现。

输入图片此处描述
大脑做不到这一点,但反向传播可以!

这不仅可以线性且显着地提高单次学习迭代的性能,还可以减少训练网络所需的学习迭代次数:支持学习本质上是数字化的证据。

也支持了计算机科学本质上很酷的论点。

Replace the expensive floating point sigmoid transfer function with an integer step transfer function.

The sigmoid transfer function is a model of organic analog synaptic learning, which in turn seems to be a model of a step function.

The historical precedent for this is that Hinton designed the back-prop algorithm directly from the first principles of cognitive science theories about real synapses, which in turn were based on real analog measurements, which turn out to be sigmoid.

But the sigmoid transfer function seems to be an organic model of the digital step function, which of course cannot be directly implemented organically.

Rather than model a model, replace the expensive floating point implementation of the organic sigmoid transfer function with the direct digital implementation of a step function (less than zero = -1, greater than zero = +1).

enter image description here
The brain cannot do this, but backprop can!

This not only linearly and drastically improves performance of a single learning iteration, it also reduces the number of learning iterations required to train the network: supporting evidence that learning is inherently digital.

Also supports the argument that Computer Science is inherently cool.

依 靠 2024-09-11 10:41:28

纯粹基于代码检查,最内部的循环必须计算对三维参数的引用并进行大量操作。根据您的数组尺寸,您可能会遇到缓存问题,因为每次循环迭代都必须在内存中跳转。也许您可以重新排列维度,以便内部循环尝试访问比现在更接近的内存元素?

无论如何,在进行任何更改之前,请分析您的代码并查看真正的瓶颈在哪里。

Purely based upon code inspection, your inner most loop has to compute references to a three-dimensional parameter and its being done a lot. Depending upon your array dimensions could you possibly be having cache issues due to have to jump around memory with each loop iteration. Maybe you could rearrange the dimensions so the inner loop tries to access memory elements that are closer to one another than they are now?

In any case, profile your code before making any changes and see where the real bottleneck is.

遮了一弯 2024-09-11 10:41:28

我建议使用定点系统而不是浮点系统。在几乎所有处理器上,使用 int 都比 float 更快。最简单的方法是将所有内容向左移动一定量(4 或 5 是很好的起点)并将底部 4 位视为小数。

你最里面的循环正在做浮点数学,所以这可能会给你带来很大的提升。

I suggest using a fixed point system rather than a floating point system. On almost all processors using int is faster than float. The simplest way to do this is simply shift everything left by a certain amount (4 or 5 are good starting points) and treat the bottom 4 bits as the decimal.

Your innermost loop is doing floating point maths so this may give you quite a boost.

洒一地阳光 2024-09-11 10:41:28

优化的关键是首先衡量时间花在哪里。通过调用 System.nanoTime() 包围算法的各个部分:

long start_time = System.nanoTime();
doStuff();
long time_taken = System.nanoTime() - start_time;

我猜想虽然使用 System.arraycopy() 会有所帮助,但您会在内部循环中找到真正的成本。

根据您发现的内容,您可能会考虑用整数算术替换浮点算术。

The key to optimization is to first measure where the time is spent. Surround various parts of your algorithm with calls to System.nanoTime():

long start_time = System.nanoTime();
doStuff();
long time_taken = System.nanoTime() - start_time;

I'd guess that while using System.arraycopy() would help a bit, you'll find your real costs in the inner loop.

Depending on what you find, you might consider replacing the float arithmetic with integer arithmetic.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文