DerivativeCheck 因 minFunc 失败

发布于 2024-11-18 22:00:05 字数 2666 浏览 0 评论 0原文

我正在尝试使用 minFunc 训练单层自动编码器,虽然成本函数似乎有所减少,但启用后,DerivativeCheck 会失败。我使用的代码尽可能接近教科书的值,尽管非常简化。

我使用的损失函数是平方误差:

$ J(W; x) = \frac{1}{2}||a^{l} - x||^2 $

和 $a^{l} $ 等于 $\sigma(W^{T}x)$,其中 $\sigma$ 是 sigmoid 函数。因此,梯度应为:

$ \delta = (a^{l} - x)*a^{l}(1 - a^{l}) $

$ \nabla_{W} = \delta(a^{l- 1})^T $

注意,为了简化事情,我完全放弃了偏见。虽然这会导致性能不佳,但它不会影响梯度检查,因为我只查看权重矩阵。此外,我还绑定了编码器和解码器矩阵,因此实际上只有一个权重矩阵。

我用于损失函数的代码是(编辑:我已经对我的循环进行了矢量化并稍微清理了代码):

% loss function passed to minFunc
function [ loss, grad ] = calcLoss(theta, X, nHidden)
  [nInstances, nVars] = size(X);

  % we get the variables a single vector, so need to roll it into a weight matrix
  W = reshape(theta(1:nVars*nHidden), nVars, nHidden);
  Wp = W; % tied weight matrix

  % encode each example (nInstances)
  hidden = sigmoid(X*W);

  % decode each sample (nInstances)
  output = sigmoid(hidden*Wp);

  % loss function: sum(-0.5.*(x - output).^2)
  % derivative of loss: -(x - output)*f'(o)
  % if f is sigmoid, then f'(o) = output.*(1-output)
  diff = X - output;
  error = -diff .* output .* (1 - output);
  dW = hidden*error';

  loss = 0.5*sum(diff(:).^2, 2) ./ nInstances;

  % need to unroll gradient matrix back into a single vector
  grad = dW(:) ./ nInstances;
end

下面是我用来运行优化器的代码(对于单次,因为所有训练样本的运行时间都相当长):

examples = 5000;
fprintf('loading data..\n');
images = readMNIST('train-images-idx3-ubyte', examples) / 255.0;

data = images(:, :, 1:examples);

% each row is a different training sample
X = reshape(data, examples, 784);

% initialize weight matrix with random values
% W: (R^{784} -> R^{10}), W': (R^{10} -> R^{784})
numHidden = 10; % NOTE: this is extremely small to speed up DerivativeCheck
numVisible = 784;
low = -4*sqrt(6./(numHidden + numVisible));
high = 4*sqrt(6./(numHidden + numVisible));
W = low + (high-low)*rand(numVisible, numHidden);

% run optimization
options = {};
options.Display = 'iter';
options.GradObj = 'on';
options.MaxIter = 10;
mfopts.MaxFunEvals = ceil(options.MaxIter * 2.5);
options.DerivativeCheck = 'on';
options.Method = 'lbfgs';    
[ x, f, exitFlag, output] = minFunc(@calcLoss, W(:), options, X, numHidden);

我使用 DerivitiveCheck 得到的结果通常小于 0,但大于 0.1。我尝试过使用批量梯度下降的类似代码,并获得了稍微好一点的结果(有些<0.0001,但肯定不是全部)。

我不确定我的数学或代码是否犯了错误。任何帮助将不胜感激!

更新

我在代码中发现了一个小拼写错误(没有出现在下面的代码中),导致性能异常糟糕。不幸的是,我仍然得到不太好的结果。例如,两个梯度之间的比较:

calculate     check
0.0379        0.0383
0.0413        0.0409
0.0339        0.0342
0.0281        0.0282
0.0322        0.0320

差异高达 0.04,我认为这仍然失败。

I'm trying to train a single layer of an autoencoder using minFunc, and while the cost function appears to decrease, when enabled, the DerivativeCheck fails. The code I'm using is as close to textbook values as possible, though extremely simplified.

The loss function I'm using is the squared-error:

$ J(W; x) = \frac{1}{2}||a^{l} - x||^2 $

with $a^{l}$ equal to $\sigma(W^{T}x)$, where $\sigma$ is the sigmoid function. The gradient should therefore be:

$ \delta = (a^{l} - x)*a^{l}(1 - a^{l}) $

$ \nabla_{W} = \delta(a^{l-1})^T $

Note, that to simplify things, I've left off the bias altogether. While this will cause poor performance, it shouldn't affect the gradient check, as I'm only looking at the weight matrix. Additionally, I've tied the encoder and decoder matrices, so there is effectively a single weight matrix.

The code I'm using for the loss function is (edit: I've vectorized the loop I had and cleaned code up a little):

% loss function passed to minFunc
function [ loss, grad ] = calcLoss(theta, X, nHidden)
  [nInstances, nVars] = size(X);

  % we get the variables a single vector, so need to roll it into a weight matrix
  W = reshape(theta(1:nVars*nHidden), nVars, nHidden);
  Wp = W; % tied weight matrix

  % encode each example (nInstances)
  hidden = sigmoid(X*W);

  % decode each sample (nInstances)
  output = sigmoid(hidden*Wp);

  % loss function: sum(-0.5.*(x - output).^2)
  % derivative of loss: -(x - output)*f'(o)
  % if f is sigmoid, then f'(o) = output.*(1-output)
  diff = X - output;
  error = -diff .* output .* (1 - output);
  dW = hidden*error';

  loss = 0.5*sum(diff(:).^2, 2) ./ nInstances;

  % need to unroll gradient matrix back into a single vector
  grad = dW(:) ./ nInstances;
end

Below is the code I use to run the optimizer (for a single time, as the runtime is fairly long with all training samples):

examples = 5000;
fprintf('loading data..\n');
images = readMNIST('train-images-idx3-ubyte', examples) / 255.0;

data = images(:, :, 1:examples);

% each row is a different training sample
X = reshape(data, examples, 784);

% initialize weight matrix with random values
% W: (R^{784} -> R^{10}), W': (R^{10} -> R^{784})
numHidden = 10; % NOTE: this is extremely small to speed up DerivativeCheck
numVisible = 784;
low = -4*sqrt(6./(numHidden + numVisible));
high = 4*sqrt(6./(numHidden + numVisible));
W = low + (high-low)*rand(numVisible, numHidden);

% run optimization
options = {};
options.Display = 'iter';
options.GradObj = 'on';
options.MaxIter = 10;
mfopts.MaxFunEvals = ceil(options.MaxIter * 2.5);
options.DerivativeCheck = 'on';
options.Method = 'lbfgs';    
[ x, f, exitFlag, output] = minFunc(@calcLoss, W(:), options, X, numHidden);

The results I get with the DerivitiveCheck on are generally less than 0, but greater than 0.1. I've tried similar code using batch gradient descent, and get slightly better results (some are < 0.0001, but certainly not all).

I'm not sure if I made either a mistake with my math or code. Any help would be greatly appreciated!

update

I discovered a small typo in my code (which doesn't appear in the code below) causing exceptionally bad performance. Unfortunately, I'm still getting getting less-than-good results. For example, comparison between the two gradients:

calculate     check
0.0379        0.0383
0.0413        0.0409
0.0339        0.0342
0.0281        0.0282
0.0322        0.0320

with differences of up to 0.04, which I'm assuming is still failing.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

好吧,我想我可能已经解决了这个问题。一般来说,梯度的差异<1。 1e-4,虽然我至少有一个 6e-4。有谁知道这是否仍然可以接受?

为了得到这个结果,我重写了代码,并且没有绑定权重矩阵(我不确定这样做是否总是会导致导数检查失败)。我还包括了偏见,​​因为它们并没有让事情变得太复杂。

我在调试时意识到的另一件事是,代码中真的很容易出错。例如,我花了一段时间才捕捉到:

grad_W1 = error_h*X';

而不是:

grad_W1  = X*error_h';

虽然这两行之间的区别只是 grad_W1 的转置,但由于需要将参数打包/解包到单个向量中,所以 Matlab 没有办法抱怨关于 grad_W1 的尺寸错误。

我还包含了我自己的导数检查,它给出的答案与 minFunc 的答案略有不同(我的导数检查给出的差异均低于 1e-4)。

fwdprop.m:

function [ hidden, output ] = fwdprop(W1, bias1, W2, bias2, X)
  hidden = sigmoid(bsxfun(@plus, W1'*X, bias1));
  output = sigmoid(bsxfun(@plus, W2'*hidden, bias2));
 end

calcLoss.m:

function [ loss, grad ] = calcLoss(theta, X, nHidden)
  [nVars, nInstances] = size(X);
  [W1, bias1, W2, bias2] = unpackParams(theta, nVars, nHidden);
  [hidden, output] = fwdprop(W1, bias1, W2, bias2, X);
  err = output - X;
  delta_o = err .* output .* (1.0 - output);
  delta_h = W2*delta_o .* hidden .* (1.0 - hidden);

  grad_W1 = X*delta_h';
  grad_bias1 = sum(delta_h, 2);
  grad_W2 = hidden*delta_o';
  grad_bias2 = sum(delta_o, 2);

  loss = 0.5*sum(err(:).^2);
  grad = packParams(grad_W1, grad_bias1, grad_W2, grad_bias2);
end

unpackParams.m:

function [ W1, bias1, W2, bias2 ] = unpackParams(params, nVisible, nHidden)
  mSize = nVisible*nHidden;

  W1 = reshape(params(1:mSize), nVisible, nHidden);
  offset = mSize;    

  bias1 = params(offset+1:offset+nHidden);
  offset = offset + nHidden;

  W2 = reshape(params(offset+1:offset+mSize), nHidden, nVisible);
  offset = offset + mSize;

  bias2 = params(offset+1:end);
end

packParams.m

function [ params ] = packParams(W1, bias1, W2, bias2)
  params = [W1(:); bias1; W2(:); bias2(:)];
end

checkDeriv.m:

function [check] = checkDeriv(X, theta, nHidden, epsilon)
  [nVars, nInstances] = size(X);

  [W1, bias1, W2, bias2] = unpackParams(theta, nVars, nHidden);
  [hidden, output] = fwdprop(W1, bias1, W2, bias2, X);
  err = output - X;
  delta_o = err .* output .* (1.0 - output);
  delta_h = W2*delta_o .* hidden .* (1.0 - hidden);

  grad_W1 = X*delta_h';
  grad_bias1 = sum(delta_h, 2);
  grad_W2 = hidden*delta_o';
  grad_bias2 = sum(delta_o, 2);

  check = zeros(size(theta, 1), 2);
  grad = packParams(grad_W1, grad_bias1, grad_W2, grad_bias2);
  for i = 1:size(theta, 1)
      Jplus = calcHalfDeriv(X, theta(:), i, nHidden, epsilon);
      Jminus = calcHalfDeriv(X, theta(:), i, nHidden, -epsilon);

      calcGrad = (Jplus - Jminus)/(2*epsilon);
      check(i, :) = [calcGrad grad(i)];
  end
end

checkHalfDeriv.m:

function [ loss ] = calcHalfDeriv(X, theta, i, nHidden, epsilon)
  theta(i) = theta(i) + epsilon;

  [nVisible, nInstances] = size(X);
  [W1, bias1, W2, bias2] = unpackParams(theta, nVisible, nHidden);
  [hidden, output] = fwdprop(W1, bias1, W2, bias2, X);

  err = output - X;
  loss = 0.5*sum(err(:).^2);
end

更新

好的,我也弄清楚了为什么捆绑权重会导致问题。我想直接降到 [ W1;偏差1;偏差2],因为W2 = W1'。这样我就可以通过查看 W1 来简单地重新创建 W2。然而,由于 $\theta$ 的值被 epsilon 改变,这实际上同时改变了两个矩阵。正确的解决方案是简单地将 W1 作为单独的参数传递,同时减少 $\theta$。

更新2

好吧,这就是我在深夜发帖得到的结果。虽然第一次更新确实导致事情正确通过,但这不是正确的解决方案。

我认为正确的做法是实际计算W1和W2的梯度,然后将W1的最终梯度设置为grad_W1到grad_W2。挥手的论点是,由于权重矩阵同时用于编码和解码,因此它的权重必须受到两个梯度的影响。然而,我还没有考虑到这一点的实际理论后果。

如果我使用自己的导数检查运行此程序,它会通过 10e-4 阈值。它比以前使用 minFunc 的导数检查要好得多,尽管仍然比我不绑定权重更糟糕。

Okay, I think I might have solved the problem. Generally the differences in the gradients are < 1e-4, though I do have at least one which is 6e-4. Does anyone know if this is still acceptable?

To get this result, I rewrote the code and without tying the weight matrices (I'm not sure if doing so will always cause the derivative check to fail). I've also included biases, as they didn't complicate things too badly.

Something else I realized when debugging is that it's really easy to make a mistake in the code. For example, it took me a while to catch:

grad_W1 = error_h*X';

instead of:

grad_W1  = X*error_h';

While the difference between these two lines is just the transpose of grad_W1, because of the requirement of packing/unpacking the parameters into a single vector, there's no way for Matlab to complain about grad_W1 being the wrong dimensions.

I've also included my own derivative check which gives slightly different answers than minFunc's (my deriviate check gives differences that are all below 1e-4).

fwdprop.m:

function [ hidden, output ] = fwdprop(W1, bias1, W2, bias2, X)
  hidden = sigmoid(bsxfun(@plus, W1'*X, bias1));
  output = sigmoid(bsxfun(@plus, W2'*hidden, bias2));
 end

calcLoss.m:

function [ loss, grad ] = calcLoss(theta, X, nHidden)
  [nVars, nInstances] = size(X);
  [W1, bias1, W2, bias2] = unpackParams(theta, nVars, nHidden);
  [hidden, output] = fwdprop(W1, bias1, W2, bias2, X);
  err = output - X;
  delta_o = err .* output .* (1.0 - output);
  delta_h = W2*delta_o .* hidden .* (1.0 - hidden);

  grad_W1 = X*delta_h';
  grad_bias1 = sum(delta_h, 2);
  grad_W2 = hidden*delta_o';
  grad_bias2 = sum(delta_o, 2);

  loss = 0.5*sum(err(:).^2);
  grad = packParams(grad_W1, grad_bias1, grad_W2, grad_bias2);
end

unpackParams.m:

function [ W1, bias1, W2, bias2 ] = unpackParams(params, nVisible, nHidden)
  mSize = nVisible*nHidden;

  W1 = reshape(params(1:mSize), nVisible, nHidden);
  offset = mSize;    

  bias1 = params(offset+1:offset+nHidden);
  offset = offset + nHidden;

  W2 = reshape(params(offset+1:offset+mSize), nHidden, nVisible);
  offset = offset + mSize;

  bias2 = params(offset+1:end);
end

packParams.m

function [ params ] = packParams(W1, bias1, W2, bias2)
  params = [W1(:); bias1; W2(:); bias2(:)];
end

checkDeriv.m:

function [check] = checkDeriv(X, theta, nHidden, epsilon)
  [nVars, nInstances] = size(X);

  [W1, bias1, W2, bias2] = unpackParams(theta, nVars, nHidden);
  [hidden, output] = fwdprop(W1, bias1, W2, bias2, X);
  err = output - X;
  delta_o = err .* output .* (1.0 - output);
  delta_h = W2*delta_o .* hidden .* (1.0 - hidden);

  grad_W1 = X*delta_h';
  grad_bias1 = sum(delta_h, 2);
  grad_W2 = hidden*delta_o';
  grad_bias2 = sum(delta_o, 2);

  check = zeros(size(theta, 1), 2);
  grad = packParams(grad_W1, grad_bias1, grad_W2, grad_bias2);
  for i = 1:size(theta, 1)
      Jplus = calcHalfDeriv(X, theta(:), i, nHidden, epsilon);
      Jminus = calcHalfDeriv(X, theta(:), i, nHidden, -epsilon);

      calcGrad = (Jplus - Jminus)/(2*epsilon);
      check(i, :) = [calcGrad grad(i)];
  end
end

checkHalfDeriv.m:

function [ loss ] = calcHalfDeriv(X, theta, i, nHidden, epsilon)
  theta(i) = theta(i) + epsilon;

  [nVisible, nInstances] = size(X);
  [W1, bias1, W2, bias2] = unpackParams(theta, nVisible, nHidden);
  [hidden, output] = fwdprop(W1, bias1, W2, bias2, X);

  err = output - X;
  loss = 0.5*sum(err(:).^2);
end

Update

Okay, I've also figured out why tying the weights was causing issues. I wanted to go down to just [ W1; bias1; bias2 ] since W2 = W1'. This way I could simply recreate W2 by looking at W1. However, because the values of $\theta$ are changed by epsilon, this was in effect changing both matrices at the same time. The proper solution is to simply pass W1 as a separate parameter while at the same time reducing $\theta$.

Update 2

Okay, this is what I get for posting too late at night. While the first update does indeed cause things to pass correctly, it's not the correct solution.

I think the correct thing to do is to actually calculate the gradients for W1 and W2, and then set the final gradient of W1 to grad_W1 to grad_W2. The hand-waving argument is that since the weight matrix is acting to both encode and decode, its weights must be affected by both gradients. I haven't thought through the actual theoretical ramifications of this yet, however.

If I run this using my own derivative check, it passes the 10e-4 threshold. It does much better than before with minFunc's derivative check, though still worse than if I don't tie the weights.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文