进行梯度下降时检查梯度

发布于 2024-11-18 09:20:38 字数 2433 浏览 6 评论 0原文

我正在尝试实现前馈反向传播自动编码器（使用梯度下降进行训练），并想验证我是否正确计算了梯度。这个教程建议一次计算每个参数的导数：grad_i(theta) = (J(theta_i+epsilon) - J(theta_i-epsilon)) / (2*epsilon)。我已经在 Matlab 中编写了一段示例代码来做到这一点，但运气不佳——从导数计算出的梯度与数值上找到的梯度之间的差异往往很大（>> 4 个有效数字）。

如果有人可以提供任何建议，我将非常感谢您的帮助（无论是在我计算梯度还是如何执行检查方面）。因为我极大地简化了代码以使其更具可读性，所以我没有包含偏差，并且不再绑定权重矩阵。

首先，我初始化变量：

numHidden = 200;
numVisible = 784;
low = -4*sqrt(6./(numHidden + numVisible));
high = 4*sqrt(6./(numHidden + numVisible));
encoder = low + (high-low)*rand(numVisible, numHidden);
decoder = low + (high-low)*rand(numHidden, numVisible);

接下来，给定一些输入图像 x，进行前馈传播：

a = sigmoid(x*encoder);
z = sigmoid(a*decoder); % (reconstruction of x)

我使用的损失函数是标准 Σ(0.5*(z - x)^2) ）：

% first calculate the error by finding the derivative of sum(0.5*(z-x).^2), 
% which is (f(h)-x)*f'(h), where z = f(h), h = a*decoder, and 
% f = sigmoid(x). However, since the derivative of the sigmoid is 
% sigmoid*(1 - sigmoid), we get:
error_0 = (z - x).*z.*(1-z);

% The gradient \Delta w_{ji} = error_j*a_i
gDecoder = error_0'*a;

% not important, but included for completeness
% do back-propagation one layer down
error_1 = (error_0*encoder).*a.*(1-a);
gEncoder = error_1'*x;

最后，检查梯度是否正确（在本例中，只需为解码器执行此操作）：

epsilon = 10e-5;
check = gDecoder(:); % the values we obtained above
for i = 1:size(decoder(:), 1)
    % calculate J+
    theta = decoder(:); % unroll
    theta(i) = theta(i) + epsilon;
    decoderp = reshape(theta, size(decoder)); % re-roll
    a = sigmoid(x*encoder);
    z = sigmoid(a*decoderp);
    Jp = sum(0.5*(z - x).^2);

    % calculate J-
    theta = decoder(:);
    theta(i) = theta(i) - epsilon;
    decoderp = reshape(theta, size(decoder));
    a = sigmoid(x*encoder);
    z = sigmoid(a*decoderp);
    Jm = sum(0.5*(z - x).^2);

    grad_i = (Jp - Jm) / (2*epsilon);
    diff = abs(grad_i - check(i));
    fprintf('%d: %f <=> %f: %f\n', i, grad_i, check(i), diff);
end

在 MNIST 数据集（对于第一个条目）上运行此操作会得到如下结果：

2: 0.093885 <=> 0.028398: 0.065487
3: 0.066285 <=> 0.031096: 0.035189
5: 0.053074 <=> 0.019839: 0.033235
6: 0.108249 <=> 0.042407: 0.065843
7: 0.091576 <=> 0.009014: 0.082562

原文

I'm trying to implement a feed-forward backpropagating autoencoder (training with gradient descent) and wanted to verify that I'm calculating the gradient correctly. This tutorial suggests calculating the derivative of each parameter one at a time: grad_i(theta) = (J(theta_i+epsilon) - J(theta_i-epsilon)) / (2*epsilon). I've written a sample piece of code in Matlab to do just this, but without much luck -- the differences between the gradient calculated from the derivative and the gradient numerically found tend to be largish (>> 4 significant figures).

If anyone can offer any suggestions, I would greatly appreciate the help (either in my calculation of the gradient or how I perform the check). Because I've simplified the code greatly to make it more readable, I haven't included a biases, and am no longer tying the weight matrices.

First, I initialize the variables:

numHidden = 200;
numVisible = 784;
low = -4*sqrt(6./(numHidden + numVisible));
high = 4*sqrt(6./(numHidden + numVisible));
encoder = low + (high-low)*rand(numVisible, numHidden);
decoder = low + (high-low)*rand(numHidden, numVisible);

Next, given some input image x, do feed-forward propagation:

a = sigmoid(x*encoder);
z = sigmoid(a*decoder); % (reconstruction of x)

The loss function I'm using is the standard Σ(0.5*(z - x)^2)):

% first calculate the error by finding the derivative of sum(0.5*(z-x).^2), 
% which is (f(h)-x)*f'(h), where z = f(h), h = a*decoder, and 
% f = sigmoid(x). However, since the derivative of the sigmoid is 
% sigmoid*(1 - sigmoid), we get:
error_0 = (z - x).*z.*(1-z);

% The gradient \Delta w_{ji} = error_j*a_i
gDecoder = error_0'*a;

% not important, but included for completeness
% do back-propagation one layer down
error_1 = (error_0*encoder).*a.*(1-a);
gEncoder = error_1'*x;

And finally, check that the gradient is correct (in this case, just do it for the decoder):

epsilon = 10e-5;
check = gDecoder(:); % the values we obtained above
for i = 1:size(decoder(:), 1)
    % calculate J+
    theta = decoder(:); % unroll
    theta(i) = theta(i) + epsilon;
    decoderp = reshape(theta, size(decoder)); % re-roll
    a = sigmoid(x*encoder);
    z = sigmoid(a*decoderp);
    Jp = sum(0.5*(z - x).^2);

    % calculate J-
    theta = decoder(:);
    theta(i) = theta(i) - epsilon;
    decoderp = reshape(theta, size(decoder));
    a = sigmoid(x*encoder);
    z = sigmoid(a*decoderp);
    Jm = sum(0.5*(z - x).^2);

    grad_i = (Jp - Jm) / (2*epsilon);
    diff = abs(grad_i - check(i));
    fprintf('%d: %f <=> %f: %f\n', i, grad_i, check(i), diff);
end

Running this on the MNIST dataset (for the first entry) gives results such as:

2: 0.093885 <=> 0.028398: 0.065487
3: 0.066285 <=> 0.031096: 0.035189
5: 0.053074 <=> 0.019839: 0.033235
6: 0.108249 <=> 0.042407: 0.065843
7: 0.091576 <=> 0.009014: 0.082562

分享到QQ

分享到微博