next up previous
Next: References

Proceedings of the American Control Conference, Seattle, Washington, 21-23 June 1995 On the Localization of Feedforward Networks

Scott Weavergif Leemon Bairdgif Marios Polycarpougif

Wright-Patterson Air Force Base
WL/AAAT Bldg 635, 2185 Avionics Circle
WPAFB, OH 45433-7301
scott.weaver@uc.edu

Abstract--

Interference in neural networks occurs when learning in one area of the input space causes unlearning in another area. Networks that are less susceptible to interference are called spatially local networks. These networks are often used in neurocontrol, in online applications, where, because of the real time nature of the task, interference is often a problem. Although there are heuristics as to what makes a network local, there is no theoretical framework for measuring localization. This paper provides a formal definition of interference and localization that will allow us to measure a network's local properties. These definitions will be useful in developing learning algorithms that make networks more local. This may lead to faster learning over the entire input domain.

1. Introduction

For some applications, a neural network's ability to generalize may actually hinder its ability to learn. In neurocontrol, for example, the network is usually trained online using an incremental learning algorithm, with desired input/output data as a teaching signal. The data is obtained in real-time from a continuous dynamical system, causing consecutive training samples to be near one another in input space, and hence highly correlated. This situation causes training to be concentrated in one area of the input space allowing other areas to suffer potentially from unlearning, leading to slower learning of the entire function.

Actually, concentration of learning in one area would not be a problem if learning at a point x in the input space only affected the network mapping at x, as is the case with lookup tables. When using neural networks, however, learning at x typically does affect the output at . This allows networks to generalize better but it is also responsible for interference [1]. When learning at x causes correct learning at we say the network is generalizing well, but when learning at x causes incorrect learning a we say there is interference between training at the two points. The right balance between generalization and interference is surely different for different applications. It is safe to say, however, that as consecutive training samples become closer to one another, interference becomes more of a problem, a problem that can be serious during neurocontrol. Since we have little control over how the training data is presented (because of the application) and we would like some degree of generalization (because there is a continuum of inputs), making networks have less interference is the next logical step.

Networks that are less likely to have problems with interference are said to be local. One approach for obtaining local networks is to choose a network architecture that has good localization properties, regardless of the weight configuration. Finding the balance between generalization and interference may be a difficult task for the network designer. An alternative method is to choose a learning algorithm that adjusts weights in such a way that local properties emerge from the network during learning. To this end, we need a better understanding of what makes local networks local.

Although the concept of local networks is found in the literature [2], what is lacking is a rigorous definition of network localization. Generalization, a related concept, is well defined in [3] for example, but the theoretical framework assumes a finite number of input-output pairs. We propose a measure of network localization based on the extent of interference during learning. This definition will provide a means for comparing the localization of various networks and possibly pave the way for faster learning algorithms.

2. Localization

Consider a network whose input/output characteristics are described by , where is the input to the network, is the weight vector, y is the output of the network, is a smooth mapping describing the function implemented by the network, is the input domain, and is the weight domain. During supervised learning, the objective is to adjust such that the network approximates a desired function of , the samples are given by , where is the exemplar (desired) output.

Most descriptions of localization found in the literature ignore the role of the learning algorithm. However, since the goal of making local networks is to reduce interference and since interference is directly affected by the learning mechanism, we propose a measure of localization based on interference and a measure of interference based on learning.

To incorporate the learning mechanism into the definition of interference consider what happens when the network learns for one time step. The training exemplar (, ) is chosen to yield an error that is normalized to be one. This leads to a choice of . After one step the new weight will be where is the learning rate, and specifies the direction for weight change.

 

  
Figure 1: Learning at causes interference at . The top (bottom) curve is the network mapping before (after) training at . The desired training exemplar is .

The validity of this definition is taken up later in a simpler yet equivalent formulation. Interference, according to the above definition, is a sensitivity of the (network) output at with respect to the output at due to learning at . If training at has little affect on the output at , the magnitude of the ratio of (1) will be small; if interference occurs, training at affects the output significantly at causing the magnitude of the ratio of (1) to be large. Figure 1 graphically depicts the network's output before and after learning at for a finite . To form the ratio in (1), divide the change in output at which is , by the change in output at which is If the ratio remains constant, as approaches zero, .

Equation (1) is equivalent to a ratio of two dot products

 

where is the gradient vector of the scalar function with respect to . This equation is valid, and hence the above definition of interference is valid, when the quantity exists and the denominator is non-zero for each and . In the particular case of gradient descent () the later condition is satisfied when the network has a bias weight at the output layer and if the output layer has an activation function, it must be monotonic. Now we are ready to define a measure of network localization.

The network's localization is a positive quantity. The smaller the value of , the more local the network.

Example: Let us compute , the localization of the network using gradient descent on the quadratic cost function . The network is a piecewise constant function with domain = [0, 1). Let , where for each basis function is defined as

If is such that then . If is such that then and it follows from (2) and that

When j and k are equal, both and fall within the influence of the same basis function leading to interference. If and are chosen randomly from with a uniform distribution, then and we get . One can see that as ; equivalently, as the number of basis functions increase, for this example, the network becomes more local.

3. Discussion and Conclusion

The proposed measures of localization and interference are applicable to most network/algorithm combinations. They are valuable when comparing networks and evaluating a network's learning capability. The measure of localization allows the development of learning algorithms that find, in real time, a balance between generalization and localization. Preliminary simulations performing gradient descent on a new cost function, one that penalizes poor localization as well as poor approximation, show that with highly correlated input exemplars the new learning mechanism learns faster than with the quadratic error cost function alone. It is important to stress that localization of a network is not a goal in itself. It is merely a tool one can use to accomplish the true goal, which is to reduce the approximation error for all inputs and to do so quickly.




next up previous
Next: References



Leemon Baird
Mon Oct 30 12:48:31 MST 1995