Tuesday, March 8, 2016

How to devise an auto-encoder

I think Autoencoder is an important type of MLP since one way it's to some extent an extension of the traditional classification method utilizing MLP, another way it's the key to understanding some ingenious training ideal which weights high in application of Deep Learning.

In this post I will guide to build a very simple autoencoder from implementation perspective. I have to state first that I am not intended to break Ng's law that do not share source code, typically when most content of this post is borrowed from his famous website http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial. This problem is most of the time, even I honestly write the code myself, but some time later I have no idea of my original thinking. So I write here mainly for reference. The encouragement is be familiar with it, not directly copy it.

The implementation is in Matlab script, typically, the work needs to be done in the sparseAutoencoderCost.m. I would think I have had implemented the autoencoder correctly, however no guarantee:

function [cost,grad] = sparseAutoencoderCost(theta, visibleSize, hiddenSize, ...
                                             lambda, sparsityParam, beta, data)

% visibleSize: the number of input units (probably 64)
% hiddenSize: the number of hidden units (probably 25)
% lambda: weight decay parameter
% sparsityParam: The desired average activation for the hidden units (denoted in the lecture
%                           notes by the greek alphabet rho, which looks like a lower-case "p").
% beta: weight of sparsity penalty term
% data: Our 64x10000 matrix containing the training data.  So, data(:,i) is the i-th training example.
 
% The input theta is a vector (because minFunc expects the parameters to be a vector).
% We first convert theta to the (W1, W2, b1, b2) matrix/vector format, so that this
% follows the notation convention of the lecture notes.

W1 = reshape(theta(1:hiddenSize*visibleSize), hiddenSize, visibleSize);
W2 = reshape(theta(hiddenSize*visibleSize+1:2*hiddenSize*visibleSize), visibleSize, hiddenSize);
b1 = theta(2*hiddenSize*visibleSize+1:2*hiddenSize*visibleSize+hiddenSize);
b2 = theta(2*hiddenSize*visibleSize+hiddenSize+1:end);

% Cost and gradient variables (your code needs to compute these values).
% Here, we initialize them to zeros.
cost = 0;
W1grad = zeros(size(W1));
W2grad = zeros(size(W2));
b1grad = zeros(size(b1));
b2grad = zeros(size(b2));

%% ---------- YOUR CODE HERE --------------------------------------
%  Instructions: Compute the cost/optimization objective J_sparse(W,b) for the Sparse Autoencoder,
%                and the corresponding gradients W1grad, W2grad, b1grad, b2grad.
%
% W1grad, W2grad, b1grad and b2grad should be computed using backpropagation.
% Note that W1grad has the same dimensions as W1, b1grad has the same dimensions
% as b1, etc.  Your code should set W1grad to be the partial derivative of J_sparse(W,b) with
% respect to W1.  I.e., W1grad(i,j) should be the partial derivative of J_sparse(W,b)
% with respect to the input parameter W1(i,j).  Thus, W1grad should be equal to the term
% [(1/m) \Delta W^{(1)} + \lambda W^{(1)}] in the last block of pseudo-code in Section 2.2
% of the lecture notes (and similarly for W2grad, b1grad, b2grad).
%
% Stated differently, if we were using batch gradient descent to optimize the parameters,
% the gradient descent update to W1 would be W1 := W1 - alpha * W1grad, and similarly for W2, b1, b2.
%

% Since it's a special case, so iteration is avoided here
% first we calculate the activation of the first hidden layer

% Since dimension of W1 is of hiddenSize*visibleSize, so W'*x is a proposed
% alternative, since the bias term needs to be taken into consideration, we
% modify W to reflect such a effectiveness

% Keep in mind it's intended to utilize the computation in vector form as thorough as possible

ss = size(data, 2);

% Catering for the bias term, so we manipulate the weights and inputs 
% for the input layer
W1_e = [b1 W1];
data_e = [ones(1, ss); data];

% Calculation for the hidden layer
u1 = W1_e * data_e;
x2 = sigmoid(u1);


% The same manipulation applies to the input layer
W2_e = [b2 W2];
x2_e = [ones(1, ss); x2];


% Calculation for the output layer
u2 = W2_e * x2_e;
output = sigmoid(u2);

% Up to now, the forward propagation is completed

% Now it's time to build the cost function

% First the avarage cost without weight decay could be calculated
diff_output = output - data;

cost = mean(sum(diff_output .^ 2)) / 2;

% Take the weight decay into consideration
cost = cost + lambda * (sum(sum(W1 .^ 2)) + sum(sum(W2 .^ 2))) / 2;

% The average activity of the hidden layer can be computed as follows
rho = mean(x2, 2);

sparsity_cost = sparsityParam * log(sparsityParam ./ rho) + ...
    (1 - sparsityParam) * log((1 - sparsityParam) ./ (1 - rho));

cost = cost + beta * sum(sparsity_cost);

% Now we compute the gradient via back-propagation
% We do the computation from an element-wise viewpoint but from an
% vector-wise way
delta_output = (output - data) .* output .* (1 - output);

b2grad = mean(delta_output, 2);

for i = 1 : ss;
    W2grad = W2grad + delta_output(:, i) * x2(:, i)';
end

W2grad = W2grad / ss + lambda * W2;

delta_sparsity = -1 * sparsityParam ./ rho + ...
    (1 - sparsityParam) ./ (1 - rho);

delta_hidden = W2' * delta_output + repmat(beta * delta_sparsity, 1, ss);

delta_hidden = delta_hidden .* x2 .* (1 - x2);

b1grad = mean (delta_hidden, 2);

for i = 1 : ss;
    W1grad = W1grad + delta_hidden(:, i) * data(:, i)';
end

W1grad = W1grad / ss + lambda * W1;


%-------------------------------------------------------------------
% After computing the cost and gradient, we will convert the gradients back
% to a vector format (suitable for minFunc).  Specifically, we will unroll
% your gradient matrices into a vector.

grad = [W1grad(:) ; W2grad(:) ; b1grad(:) ; b2grad(:)];

end

%-------------------------------------------------------------------
% Here's an implementation of the sigmoid function, which you may find useful
% in your computation of the costs and the gradients.  This inputs a (row or
% column) vector (say (z1, z2, z3)) and returns (f(z1), f(z2), f(z3)). 

function sigm = sigmoid(x)
  
    sigm = 1 ./ (1 + exp(-x));
end


It's estimated the most difficulty part is the implementation of Autoencoder, once we have one, some interesting results can be spotted. The first figure belows show some samples to be learnt from, and the second figure is the weights learnt however displayed in an squared image way:



Happy exploring, happy coding.

No comments:

Post a Comment