Scott Weaver
Leemon Baird
Marios Polycarpou
Wright-Patterson Air Force Base
WL/AAAT Bldg 635, 2185 Avionics Circle
WPAFB, OH 45433-7301
scott.weaver@uc.edu
For some applications, a neural network's ability to generalize may actually hinder its ability to learn. In neurocontrol, for example, the network is usually trained online using an incremental learning algorithm, with desired input/output data as a teaching signal. The data is obtained in real-time from a continuous dynamical system, causing consecutive training samples to be near one another in input space, and hence highly correlated. This situation causes training to be concentrated in one area of the input space allowing other areas to suffer potentially from unlearning, leading to slower learning of the entire function.
Actually, concentration of learning in one area
would not be a problem if learning at a point x in the input space only affected the network mapping at x, as is the case with lookup tables.
When using neural networks, however, learning at x typically does affect the output at
.
This allows networks to generalize better but it is also responsible for interference [1].
When learning at x causes correct learning at
we say the network is generalizing well, but when learning at x causes incorrect learning a
we say there is interference between training at the two points.
The right balance between generalization and interference is surely different for different applications.
It is safe to say, however, that as consecutive training samples become closer to one another, interference becomes more of a problem, a problem that can be serious during neurocontrol.
Since we have little control over how the training data is presented (because of the application) and we would like some degree of generalization (because there is a continuum of inputs), making networks have less interference is the next logical step.
Networks that are less likely to have problems with interference are said to be local. One approach for obtaining local networks is to choose a network architecture that has good localization properties, regardless of the weight configuration. Finding the balance between generalization and interference may be a difficult task for the network designer. An alternative method is to choose a learning algorithm that adjusts weights in such a way that local properties emerge from the network during learning. To this end, we need a better understanding of what makes local networks local.
Although the concept of local networks is found in the literature [2], what is lacking is a rigorous definition of network localization. Generalization, a related concept, is well defined in [3] for example, but the theoretical framework assumes a finite number of input-output pairs. We propose a measure of network localization based on the extent of interference during learning. This definition will provide a means for comparing the localization of various networks and possibly pave the way for faster learning algorithms.
Consider a network whose input/output characteristics are described by
, where
is the input to the network,
is the weight vector, y is the output of the network,
is a smooth mapping describing the function implemented by the network,
is the input domain, and
is the weight domain.
During supervised learning, the objective is to adjust
such that the network approximates a desired function of
, the samples are given by
, where
is the exemplar (desired) output.
Most descriptions of localization found in the literature ignore the role of the learning algorithm. However, since the goal of making local networks is to reduce interference and since interference is directly affected by the learning mechanism, we propose a measure of localization based on interference and a measure of interference based on learning.
To incorporate the learning mechanism into the definition of interference consider what happens when the network learns for one time step.
The training exemplar (
,
) is chosen to yield an error
that is normalized to be one. This leads to a choice of
.
After one step the new weight will be
where
is the learning rate, and
specifies the direction for weight change.
Figure 1:
Learning at
causes interference at
.
The top (bottom) curve is the network mapping before (after) training at
.
The desired training exemplar is
.
The validity of this definition is taken up later in a simpler yet equivalent formulation.
Interference, according to the above definition, is a sensitivity of the (network) output at
with respect to the output at
due to learning at
.
If training at
has little affect on the output at
, the magnitude of the ratio of (1) will be small; if interference occurs, training at
affects the output significantly at
causing the magnitude of the ratio of (1) to be large.
Figure 1 graphically depicts the network's output before and after learning at
for a finite
. To form the ratio in (1), divide the change in output at
which is
, by the change in output at
which is
If the ratio
remains constant, as
approaches zero,
.
Equation (1) is equivalent to a ratio of two dot products
where
is the gradient vector of the scalar function
with respect to
.
This equation is valid, and hence the above definition of interference is valid, when the quantity
exists and the denominator is non-zero for each
and
.
In the particular case of gradient descent (
) the later condition is satisfied when the network has a bias weight at the output layer and if the output layer has an activation function, it must be monotonic.
Now we are ready to define a measure of network localization.

The network's localization
is a positive quantity.
The smaller the value of
, the more local the network.
Example:
Let us compute
, the localization of the network
using gradient descent on the quadratic cost function
.
The network
is a piecewise constant function with domain
= [0, 1).
Let
,
where for
each basis function
is defined as

If
is such that
then
. If
is such that
then
and it follows from (2) and
that

When j and k are equal, both
and
fall within the influence of the same basis function leading to interference.
If
and
are chosen randomly from
with a uniform distribution, then
and we get
.
One can see that as
; equivalently, as the number of basis functions increase, for this example, the network becomes more local.
The proposed measures of localization and interference are applicable to most network/algorithm combinations. They are valuable when comparing networks and evaluating a network's learning capability. The measure of localization allows the development of learning algorithms that find, in real time, a balance between generalization and localization.
Preliminary simulations performing gradient descent on a new cost function, one that penalizes poor localization as well as poor approximation, show that with highly correlated input exemplars the new learning mechanism learns faster than with the quadratic error cost function alone.
It is important to stress that localization of a network is not a goal in itself. It is merely a tool one can use to accomplish the true goal, which is to reduce the approximation error for all inputs and to do so quickly.