Perceptron Learning and Adaline
Learning (adapted from Fundamentals of Neural Networks by L. Fausett)
Perceptron
Learning
Things to note:
1) Perceptron learning *can* work for binary data
(0,1) as well as real valued input. For discrete input, bipolar inputs (-1,1) are preferred since no learning occurs for desired
outputs of zero. One also cannot distinguish from missing data with binary
representation.
2) Perceptron learning *does* converge to a
non-unique answer if the solution space is linearly seperable.
We will assume all input and output is bipolar(-1,1) .
Training vectors will be denoted by S:T , that is, T
is the desired output for vector S (S0, S1,S2,.....,Sn). The bias b will be clamped to an input of 1
and will be assumed to be weight W0.
Initialize weights (you can make them all zero) and set the learning rate to
1 (a=1).
while not terminating condition
{
for each training
vector
{
Xi =
Si
for i = 0,1,…..,n ; /* Set activations of input units
*/
Y_Input= ∑Xi
* Wi ;
Y=1
if Y_input>0
=-1 if Y_input<0
/* if Y_input==0 we can't determine anything so
adjust weights*/
/* Update weights if required */
if Y<>T
/* if
predicted output doesn't equal desired output */
{
Wi(new)=Wi(old)
+ aTXi ;
}
} /*end for */
} /* end while */
Adaline
Learning
Things to note:
Adaline learning differs from perceptron
learning in several respects:
-
The
activation function is the identity function, as opposed to a threshold function
(like the perceptron)
-
Adaline learning
is based upon gradient descent methods, that is, the gradient is employed to
minimize the least squared error. This learning method has several names: LMS
(least mean squared) and Widrow-Hoff.
-
Adaline learning
can easily be turned into discrete classification learning (just like a perceptron). First, train the net like any Adaline net, then define a threshold function that outputs
a 1 if the sign of the activation is >=0, and -1 otherwise.
-
Selection of
the learning rate (a) is much more important than it is for perceptrons.
If a is too large, the minimal surface may be missed,
if it is too small then learning can take a long time. The following has been
suggested: if N=number of test vectors, then the following should be
satisfied: 0,1<=aN,+1,0 . My
own experience disagrees with this estimate; I prefer values of a much smaller.
-
Adaline learning
**will** converge for datasets that are not linearly separable.
The algorithm below is sometimes referred to as iterative or stochastic
gradient descent. If differs from the “traditional” gradient
descent in that weights are updated after each test vector. Traditional
gradient descent would minimize the error due to the weights after seeing all
test vectors.
Training vectors will be denoted by S:T , that is, T
is the desired output for vector S (S0, S1,S2,.....,Sn). The bias b will be clamped to an input of 1
and will be assumed to be weight W0.
Initialize weights (you can make them all zero) and set the learning rate to
a small value (for example a=0,1).
while not terminating condition /*
when weight changes fall below a specified size */
{
for each training
vector
{
Xi =
Si
for i = 0,1,…..,n ; /* Set activations of input units
*/
Y_Input= ∑Xi
* Wi ;
/* Update weights */
Wi(new)=Wi(old) + a*(T-Y_Input)*Xi ;
} /*end for */
} /* end while */
Perceptron Learning for logical AND
Input |
Net |
Out |
Target |
Weights |
Epic 1 |
|
|
|
( 0 , 0 , 0) |
(1 , 1 , 1 ) |
0 |
0 |
1 |
(1 , 1 , 1) |
(1 , 1 , -1) |
1 |
1 |
-1 |
(0 , 2 , 0 ) |
(1 , -1 , 1) |
2 |
1 |
-1 |
(-1 , 1 , 1) |
(1 , -1 , -1 ) |
-3 |
-1 |
-1 |
(-1 , 1 , 1) |
Epic 2 |
|
|
|
|
( 1 , 1 , 1) |
1 |
1 |
1 |
( -1 , 1 , 1) |
( 1 , 1 , -1) |
-1 |
-1 |
-1 |
( -1 , 1, 1) |
( 1 , -1 , 1 ) |
-1 |
-1 |
-1 |
( -1 , 1 , 1) |
( 1 , -1 , -1) |
-1 |
-1 |
-1 |
(-1 , 1 , 1) |
|
|
|
|
|