Perceptron Learning

Perceptron Learning and Adaline Learning (adapted from Fundamentals of Neural Networks by L. Fausett)

Perceptron Learning

Things to note:
1) Perceptron learning *can* work for binary data (0,1) as well as real valued input. For discrete input, bipolar inputs (-1,1) are preferred since no learning occurs for desired outputs of zero. One also cannot distinguish from missing data with binary representation.
2) Perceptron learning *does* converge to a non-unique answer if the solution space is linearly seperable.

We will assume all input and output is bipolar(-1,1) . Training vectors will be denoted by S:T , that is, T is the desired output for vector S (S₀, S₁,S₂,.....,S_n). The bias b will be clamped to an input of 1 and will be assumed to be weight W₀.

Initialize weights (you can make them all zero) and set the learning rate to 1 (a=1).

while not terminating condition
{
    for each training vector
{
        X_i = S_ifor i = 0,1,…..,n ;            /* Set activations of input units */
Y_Input= ∑X_i* W_i  ;

        Y=1   if Y_input>0
           =-1 if Y_input<0                    /* if Y_input==0 we can't determine anything so adjust weights*/

        /* Update weights if required */
        if Y<>T                                       /* if predicted output doesn't equal desired output */
        {
            W_i(new)=W_i(old) + aTX_i;
   }
} /*end for */
} /* end while */

Adaline Learning

Things to note:
Adaline learning differs from perceptron learning in several respects:

- The activation function is the identity function, as opposed to a threshold function (like the perceptron)

- Adaline learning is based upon gradient descent methods, that is, the gradient is employed to minimize the least squared error. This learning method has several names: LMS (least mean squared) and Widrow-Hoff.

- Adaline learning can easily be turned into discrete classification learning (just like a perceptron). First, train the net like any Adaline net, then define a threshold function that outputs a 1 if the sign of the activation is >=0, and -1 otherwise.

- Selection of the learning rate (a) is much more important than it is for perceptrons. If a is too large, the minimal surface may be missed, if it is too small then learning can take a long time. The following has been suggested: if N=number of test vectors, then the following should be satisfied: 0,1<=aN,+1,0 . My own experience disagrees with this estimate; I prefer values of a much smaller.

- Adaline learning **will** converge for datasets that are not linearly separable.

The algorithm below is sometimes referred to as iterative or stochastic gradient descent. If differs from the “traditional” gradient descent in that weights are updated after each test vector. Traditional gradient descent would minimize the error due to the weights after seeing all test vectors.

Training vectors will be denoted by S:T , that is, T is the desired output for vector S (S₀, S₁,S₂,.....,S_n). The bias b will be clamped to an input of 1 and will be assumed to be weight W₀.

Initialize weights (you can make them all zero) and set the learning rate to a small value (for example a=0,1).

while not terminating condition        /* when weight changes fall below a specified size */
{
    for each training vector
{
        X_i = S_ifor i = 0,1,…..,n ;            /* Set activations of input units */
Y_Input= ∑X_i* W_i  ;

        /* Update weights */
        W_i(new)=W_i(old) + a*(T-Y_Input)*X_i;

} /*end for */
} /* end while */


Perceptron Learning for logical AND

Input (1,X1,X2)	Net	Out	Target	Weights (W0,W1,W2)
Epic 1				( 0 , 0 , 0)
(1 , 1 , 1 )	0	0	1	(1 , 1 , 1)
(1 , 1 , -1)	1	1	-1	(0 , 2 , 0 )
(1 , -1 , 1)	2	1	-1	(-1 , 1 , 1)
(1 , -1 , -1 )	-3	-1	-1	(-1 , 1 , 1)
Epic 2
( 1 , 1 , 1)	1	1	1	( -1 , 1 , 1)
( 1 , 1 , -1)	-1	-1	-1	( -1 , 1, 1)
( 1 , -1 , 1 )	-1	-1	-1	( -1 , 1 , 1)
( 1 , -1 , -1)	-1	-1	-1	(-1 , 1 , 1)