ANN_FF — Algorithms for feedforward nets.
To provide engines for feedforward ANN exploration, testing and rapid prototyping.
Some flexibility is provided (e.g. the possibility to change the activation or error functions).
(a) The network is visualized as follows: inputs at the left and data (signals) propagating to the right.
(b) N is a row vector containing the number of neurons per layer, input included.
(c) First layer is input (despite the fact that it does not process data it makes implementation clearer).
Layer no. 1 2 ... size(N,'c') . -- o o -> \/ /\ i . -- o o -> o n \ |\ u p \ =====| > t u |/ p t u s / t / s . - o o -> input first output hidden
Note that connections do not jump over layers, they are only between adjacent layers (fully interconnected).
(d) The dimension of N is size(N,'c') so: - input layer have N(1) neurons - first hidden layer have N(2) neurons, ...
- the output layer L have N(size(N,'c')) neurons
(e) The input vector/matrix is x, each pattern is represented by one column.
Only constant size input patterns are accepted.
NOTE: Internally the patterns will be worked with, individually, as column vectors, i.e. each pattern vector is a column of the form: x(:,p), (p being the pattern order number).
(f) Each neuron on first hidden layer have N(1) inputs, ... for layer l in [2, ..., size(N,'c')] each neuron have N(l-1) inputs from previous layer plus one simulating the bias (where applicable, most algorithms assume existence of bias).
(g) The network is fully connected but a connection can be canceled by zeroing the corresponding weight
(note that a training procedure may turn it back to a non-zero value, this is one reason for which some "hooks" are provided, see "ex" parameter below).
This subsection describes the parameters taken by the various functions defined within the toolbox, not all functions require all parameters.
In alphabetical order: af gives activation function and, if required, its (total) derivative. It is either: - a string giving the name of activation function - a two element row vector of strings where af(1) is the string with the name of activation function and af(2) is the name of the derivative.
NOTE: Given an activation function y = f(x), the derivative have to be expressed in terms of y not x).
E.g. given the logistic activation function: 1 y = ----------- 1 + exp(-x) the derivative will be expressed as: dy -- = y (1 - y) dx This form reduces the memory usage and, in the particular case of the logistic activation function, increases speed as well.
This parameter is optional, default value is either "ann_log_activ" or ["ann_log_activ","ann_d_log_activ"] (depending on the function using it), i.e. the activation function, described above, and its derivative (the default functions are already defined in this toolbox).
WARNING: Be very careful how to define a new activation function.
These functions accept patterns as column vectors within y, each element representing the total input on a neuron, and return a similar matrix representing the activation of the whole layer. I.e. the logistic is defined as: y = 1 ./ (1 + %e^(-x)) and note the space between 1 and ./ The derivative of activation function is defined similarly: z = y .* (1 - y) and note the ".*" operator.
Delta_W_old The quantity by which W was changed on previous training pattern.
dW, dW2 the amount of variations of each W element for calculating the error derivatives trough a finite difference approach (see ann_FF_grad and ann_FF_Hess for more information). ef error function.
This parameter is optional, default value is "ann_sum_of_sqr", i.e. the sum-of-squares (already defined within this toolbox). err_deriv_y the error derivative with respect to network outputs. Returns a matrix each column containing the error derivative corresponding to the appropriate pattern.
This parameter is optional, default value is "ann_d_sum_of_sqr" (already defined within this toolbox), i.e. the derivative of sum-of-squares error function. ex is a Scilab program sequence, executed after the weight hypermatrix for each training pattern have been updated.
Its main purpose is to provide hooks in order to change the learning function without having to rewrite it. Typical usages would be: checking for a stop criteria, pruning.
This parameter is optional, default value is [" "] or [" "," "] (some functions have two hooks), i.e. empty string, does nothing. l range of layers between which the network is run.
Two component row vector: l(1) layer into which a pattern will be injected, presented as it would have come from previous layer: l(1)-1. l(2) layer from which the outputs are collected. E.g.: l = [3,3] means input is injected into neurons from layer 3 and their outputs (l(2)=3) are collected to give the result. l = [2,3] means input is injected into first hidden layer (exactly as it would have come from input layer) and output is collected at the outputs of neurons on layer 3. This parameter is optional, default value is [2,size(N,'c')] (whole network).
WARNING: l(1) = 1 does not make sense as it represents the input layer.
lp represents the learning parameters, is a row vector [lp(1), lp(2), ...]
The actual significance of each component may vary, see the respective man pages for representation and typical values.
N row vector, defines the network, i.e. no. of neurons per layer. N(l) represents the number of neurons on layer l. E.g.: N(1) is the size of input vector, N(size(N),'c') is the size of output vector r range of random numbers based on which the connection weights (not biases) are initialized.
Is a two component row vector: r(1) gives the lower limit r(2) gives the upper limit
This parameter is optional, default value is [-1,1].
rb range of random numbers based on which the biases (not other weihts) are initialized.
Is a two component row vector: rb(1) gives the lower limit rb(2) gives the upper limit
This parameter is optional, default value is [0,0], i.e. biases are initialized with 0.
t matrix of targets, one pattern per column. E.g. t(:,p) represents pattern no. p. x matrix of inputs, one pattern per column. E.g. x(:,p) represents pattern no. p
The function names are built as follows: ann prefix for all function names within this toolbox. _FF prefix for all function names designed for feedforward nets. defines the type of algorithm: online uses one pattern at a time, batch uses all patterns at once. _nb postfix for all function names within this toolbox designed for networks without bias. ann_FF_init Build and initialize the weight hypermatrix. ann_FF_Std Standard (vanilla, delta rule) backpropagation algorithm. ann_FF_Mom Backpropagation with momentum. ann_FF_run Runs the network. ann_FF_grad Calculate the error gradient trough a finite difference approach. It is provided for testing purposes only. ann_FF_Jacobian Calculate the Jacobian trough a finite difference approach. It is provided for testing purposes only. ann_FF_Jacobian_BP Calculate the Jacobian trough a backpropagation algorithm. ann_FF_Hess Calculate the Hessian trough a finite difference approach. It is provided for testing purposes only. ann_FF_VHess Calculate the multiplication between a vector and the Hessian trough an efficient finite difference approach. ann_FF_ConjugGrad Conjugate gradients algorithm. ann_FF_SSAB Backpropagation with SuperSAB algorithm.
- Do not use the no-bias networks unless you know what you are doing.
- The most efficient (by far) algorithm is the "Conjugate Gradient", however it may require bootstrapping with another algorithm (see the examples).
- Reduce as much is possible the number of loops and the number of function calls, use instead as much is possible the matrix manipulation capabilities of Scilab.
- You can do a shuffling of training patterns between two calls to the training procedure, use the "ex" hooks provided.
- Be very careful when defining new activation and error functions and test them to make sure they do what are supposed to do.
- don't use sparse matrices unless they are really sparse (< 5%).
- Each layer have associated a hypermatrix of weights.
NOTE: Most algorithms assume existence of bias by default. For each layer l, except l=1, the weight matrix associated with connections from l-1 to l is W(1:N(l),1:N(l-1),l-1) for networks without biases and W(1:N(l),1:N(l-1)+1,l-1) for networks with biases, i.e. biases are stored in first column: W(1:N(l),1,l-1).
The total input to a layer l is: = W(1:N(l),1:N(l-1),l-1)*z(1:N(l-1)) for network without biases = W(1:N(l),1:N(l-1)+1,l-1)*[1; z(1:N(l-1))] for network with biases where z(1:N(l-1)) is output of previous layer (column vector), i.e. bias is simulated as neuron no. 0 on each layer with constant output 1.
W is initialized to: hypermat(max(N),max(N),size(N,'c')-1) for networks without biases, "hypermat(max(N),max(N)+1,size(N,'c')-1)" for networks with biases; the unused entries from W are initialized to zero and left untouched.
- Pattern vectors are passed as columns in a matrix representing a set (of patterns).
- No sanity checks are performed as this will greatly hurt speed. It is assumed that (as least to some extent) you know what you are doing ;-) You can implement them yourself if you wish.
The following conditions have to be met: + targets have to have the same size as output layer, i.e. size(target,'r') = N(size(N,'c')) + inputs have to have the same size as input layer, i.e. size(input,'r') = N(1) + all N(i) have to be positive integers of course (am I paranoid here ? :-) + lp parameter is a row-vector of numbers (actual dimension depends on the algorithm used). + af is a row-vector of string(s) defining a valid activation function (and its derivative) as appropriate for the algorithm used. + ex is a string or a two row-vector of strings, representing valid Scilab set of instructions. + l(1) <= l(2) (see definition of l above) and l(1) >= 2 + r(1) <= r(2) and rb(1) <= rb(2) (see definition of r and rb above), if not then the program may run but you may not get what you would expect when initializing the weights. Warning: In some particular cases this may lead to very subtle errors (e.g. your program may even run without generating any Scilab errors but the results may be meaningless).
- The algorithms themselfs are not described here, there are many books which describes them (e.g. get mine "Matrix ANN" wherever you may find it ;-). - Hypermatrices are slow, however there is no other reasonable way of doing things; tests performed by myself show that using embedded matrices may increase speed but the manipulation of submatrices "by hand" is very tedious and error prone. Of course you may rewrite the algorithms for yourself using embedded matrices if you want to. If you really need speed then go directly to C++ or whatever.