There have traditionally been two approaches, or paradigms, for creating useful machine intelligence. The first, the symbolic processing paradigm, assumes that we can tell the computer all the relevant facts in a situation, using a higher-level language than Ada or C. The computer is then expected to logically reason out what it should do, using a large number of IF-THEN type rules. For example, we might tell the computer all the facts about navigation and threat avoidance, and then expect it to logically deduce the best route for an aircraft to fly. Expert systems are the best example of this approach.
The second approach, the supervised learning paradigm, doesn't assume that we know as much. We need only know a set of questions with the right answers. For example, we might not know the best way to program a computer to recognize an infrared picture of a tank, but we do have a large collection of infrared pictures, and we do know whether each picture contains a tank or not. The computer is expected to look at all the examples with answers, and learn on its own how to recognize tanks in general. This is usually done by simulating simple neuron-like equations connected together to form networks, which then learn to recognize patterns. Standard neural networks are the best example of this approach.
Unfortunately, there are many situations where we don't know enough about the world to build an expert system, and we don't even know the correct answers that supervised learning requires. For example, in a flight control system, the question would be the set of all sensor readings at a given time, and the answer would be how the flight control surfaces should move during the next millisecond. Simple neural networks can't learn to fly the plane unless there is a set of known answers, so if we don't know how to build a controller in the first place, simple supervised learning won't help.
That is why there has been much interest recently in a different approach, the reinforcement learning paradigm. In this approach, the computer is simply given a goal to achieve. The computer then learns how to achieve that goal by trial and error. A reinforcement learning program works in real time, with continual feedback from its environment, to achieve a given goal. This real-time, closed-loop, goal-seeking behavior seems to be a crucial aspect of how brains work, and also seems to be important for creating the machine intelligence we need to solve some difficult problems. Therefore we and others have been pursuing this form of machine intelligence, and are excited about the possibility of coming to understand large networks that are reinforcement learning controllers. These may give machine intelligence many of the capabilities that only humans have now.
Surprisingly, there is a solution to this problem. It is based on a field of mathematics called dynamic programming, and it involves just two principles. First, if an action causes something bad to happen immediately, such as crashing the plane, then the controller learns not to do that action in that situation again. So whatever the controller did one millisecond before the crash, it will avoid doing in the future. But that principle doesn't help for all the earlier actions that didn't lead to immediate disaster.
The second principle is that if all the actions in a certain situation leads to bad results, then that situation itself should be considered bad. So if the controller has experienced a certain combination of altitude and airspeed many different times, trying a different action each time, and all actions led to something bad, then it will learn that this situation itself is bad. This is a powerful principle, because it can now learn without crashing. In the future, any time it chooses an action that leads to this particular situation, it will immediately learn that that action is bad, without having to wait for the crash.
By using these two principles, a reinforcement learning system can actually learn to fly a plane, or control a robot, or do any number of tasks. It can first learn on a simulator, then fine tune on the real thing. All of the facts that are learned about actions and situations are typically stored using a neural network, so reinforcement learning research profits from all of the accumulated experience of neural network researchers. Reinforcement learning is not a type of neural network, nor is it an alternative to neural networks. Rather, it is an orthogonal approach that addresses a different, more difficult question, and the two fields can be combined to yield powerful machine-learning systems.
We tested this algorithm by allowing it to solve a problem where a fast missile is targeted at a slower plane. The program was told only that the missile should try to hit the plane, and that the plane should try to avoid the missile. The program was not told how missiles and planes work, and was not told any hints about which strategies to use, so it was forced to learn all of the strategies on its own. For this particular problem, we were also able to calculate the best strategy for the missile and for the plane, and so were able to compare what the program learned to the best possible answer. The computer's solution was then compared to the best possible solution (calculated mathematically), and it was found that the advantage updating system converged to optimal answers. Figure 1 shows the results of pitting a missile against a plane when both are following the policy that the computer learned. If either player were to deviate from the policy, it's performance would be worse. Each of the figures show the same players and strategies, but with different initial positions and velocities.
The advantage updating system learned that, in some cases, it is better for a plane to turn toward the missile, decreasing the distance in the short term, in order to increase the distance between the two in the long term (Figure 2), a tactic sometimes used by pilots. It also learned that a missile should sometimes lead the plane, aiming at a point in front of the plane rather than simply homing in on a heat source or radar signature (Figure 3). This demonstrates that a single advantage learning system can solve a game problem, finding the optimal actions for both players, even when it is given no initial information and must learn everything on its own. These results were presented at the Neural Information Processing Systems Conference in December, 1994.





Figure 1: Progression from initial state to missile impact. The plane is represented by the thick triangle and leaves a connected path. The missile is represented by the thin triangle and leaves a dotted path. Each successive diagram displays a larger geographical area, so the triangles appear smaller.

Figure 2: An example of the plane turning toward the missile in the short term in order to increase the distance between the plane and missile in the long term, a tactic sometimes used by pilots. Included is a graph of distance versus time showing the effects of the plane's learned decisions.
Figure 3: The missile leads the plane, aiming at a point in front of the plane, rather than at the plane itself.
A Small Business Innovation Research (SBIR) contract is currently under way to apply advantage updating to a computer vision system. The reinforcement learning system will decide how to move a motorized camera so that objects in a scene can be recognized quickly. In addition, a one million dollar exploratory development contractual program is planned to start in Fiscal Year 1996 to apply reinforcement learning systems to avionics problems.
The field of reinforcement learning appears to have great potential for creating software that learns on its own to solve difficult problems. In the next few years, both the theory and the application may grow exponentially, with significant impact on both the Air Force and civilian industry.
Crites, R., and Barto, A. (1995). Improving elevator performance using reinforcement learning. Advances In Neural Information Processing Systems 8. D. Touretzky, M. Mozer, and M. Haselmo, ed. MIT Press, Cambridge, MA.
Harmon, M. E., Baird, L. C, & Klopf, A. H.(1994). Advantage updating applied to a differential game. Advances in Neural Information Processing Systems 7. G. Tesauro, D. Touretzky, T. Leen, ed. MIT Press, Cambridge, MA.
Klopf, A. H. (1988). A neuronal model of classical conditioning. Psychobiology, 16(2), 85-125.
Klopf, A. H., Morgan, J. S., & Weaver, S. E. (1993). A hierarchical network of control systems that learn: Modeling nervous system function during classical and instrumental conditioning. Adaptive Behavior, 1(3), 263-319.
Tesauro, G. (1992). Practical issues in temporal difference learning. Machine Learning, 8(3/4), 257-277
Watkins, C, & Dayan, P. (1992). Technical note: Q-learning. Machine Learning, 8(3/4), 279-292.
Zhang, W., & Dietterich, T. (1995). High-performance job-shop scheduling with a time-delay TD(lambda) network. Advances In Neural Information Processing Systems 8. D. Touretzky, M. Mozer, and M. Haselmo, ed. MIT Press, Cambridge, MA.
Wright Laboratory
WL/AAAT, Bldg. 635
2185 Avionics Circle
Wright-Patterson Air Force Base, OH 45433-7301
Voice: 513-255-7649/5800 DSN 785-7649/5800
Fax: 513-476-4302 DSN 785-4302
klopfah@aa.wpafb.af.mil
Mance E. Harmon
Wright Laboratory
WL/AAAT, Bldg. 635
2185 Avionics Circle
Wright-Patterson Air Force Base, OH 45433-7301
harmonme@aa.wpafb.mil
http://ace.aa.wpafb.af.mil/~harmonme/index.html
Leemon C. Baird III
United States Air Force Academy
HQ USAFA/DFCS
2354 Fairchild Dr. Suite 6K41
USAFA, CO 80840-6234
baird@cs.usafa.af.mil
http://kirk.usafa.af.mil/~baird/
Dr. A. Harry Klopf is the senior scientist for machine intelligence, Avionics Directorate, Wright Laboratory, Aeronautical Systems Center, Wright-Patterson Air Force Base, Ohio. Dr. Klopf leads an intramural team of scientists and engineers at Wright Laboratory, with the objective of modeling the learning mechanisms and network architectures of natural intelligence for transitioning to machine intelligence.
Dr. Klopf began his affiliation with the Air Force when he assumed a National Research Council postdoctoral associateship at the Air Force Cambridge Research Laboratories, Hanscom AFB, Massachusetts in 1970. In the postdoctoral position and continuing to the present day, Dr. Klopf's research has focused on reinforcement learning and adaptive behavior as approaches to the analysis of natural intelligence and the synthesis of machine intelligence. Since 1973, Dr. Klopf has been with the Avionics Directorate of Wright Laboratory.
Lt Mance E. Harmon graduated from Mississippi State University with a Bachelor of Science in Computer Science and started performing research at Wright Laboratory when he entered active duty in 1993. His interests include developing and applying reinforcement learning algorithms to problems with continuous time, state, and action sets. More information and downloadable papers are available at http://www.aa.wpafb.af.mil/~harmonme.
Capt Leemon Baird received a B.S. in Computer Science from the U.S. Air Force Academy in 1989 and an M.S. in Computer Science from Northeastern University in 1991. He has performed reinforcement learning research at Draper Laboratory, Boston MA, and Wright Laboratory, Wright-Patterson AFB, and is currently on the faculty of the Department of Computer Science at the Air Force Academy. More information and downloadable papers are available at http://kirk.usafa.af.mil/~baird.