Reinforcement Learning:
An Alternative Approach To Machine Intelligence

Dr. A. Harry Klopf, Lt Mance E. Harmon, and Capt Leemon C. Baird III

Machine Intelligence Paradigms

There are many unsolved problems that computers could solve if the appropriate software existed. Flight control systems for new, nonlinear aircraft, automated manufacturing systems, and sophisticated avionics systems all involve difficult, nonlinear control problems. Many of these problems are currently unsolvable, not because current computers are too slow or have too little memory, but simply because it is too difficult to calculate what the program should do. If a computer could learn to solve the problems through trial and error, that would be of great practical value. Attempts to solve this problem are generally known as artificial intelligence or machine intelligence.

There have traditionally been two approaches, or paradigms, for creating useful machine intelligence. The first, the symbolic processing paradigm, assumes that we can tell the computer all the relevant facts in a situation, using a higher-level language than Ada or C. The computer is then expected to logically reason out what it should do, using a large number of IF-THEN type rules. For example, we might tell the computer all the facts about navigation and threat avoidance, and then expect it to logically deduce the best route for an aircraft to fly. Expert systems are the best example of this approach.

The second approach, the supervised learning paradigm, doesn't assume that we know as much. We need only know a set of questions with the right answers. For example, we might not know the best way to program a computer to recognize an infrared picture of a tank, but we do have a large collection of infrared pictures, and we do know whether each picture contains a tank or not. The computer is expected to look at all the examples with answers, and learn on its own how to recognize tanks in general. This is usually done by simulating simple neuron-like equations connected together to form networks, which then learn to recognize patterns. Standard neural networks are the best example of this approach.

Unfortunately, there are many situations where we don't know enough about the world to build an expert system, and we don't even know the correct answers that supervised learning requires. For example, in a flight control system, the question would be the set of all sensor readings at a given time, and the answer would be how the flight control surfaces should move during the next millisecond. Simple neural networks can't learn to fly the plane unless there is a set of known answers, so if we don't know how to build a controller in the first place, simple supervised learning won't help.

That is why there has been much interest recently in a different approach, the reinforcement learning paradigm. In this approach, the computer is simply given a goal to achieve. The computer then learns how to achieve that goal by trial and error. A reinforcement learning program works in real time, with continual feedback from its environment, to achieve a given goal. This real-time, closed-loop, goal-seeking behavior seems to be a crucial aspect of how brains work, and also seems to be important for creating the machine intelligence we need to solve some difficult problems. Therefore we and others have been pursuing this form of machine intelligence, and are excited about the possibility of coming to understand large networks that are reinforcement learning controllers. These may give machine intelligence many of the capabilities that only humans have now.

Reinforcement Learning Defined

Reinforcement learning is a difficult problem because the controller may perform an action but not be told whether that action was good or bad. For example, a learning autopilot program might be given control of a simulator and told not to crash. It will have to make many decisions each second and then, after acting on thousands of decisions, the aircraft might crash. What should the controller learn from this experience? Which of its many actions were responsible for the crash? It is this problem of assigning blame to individual actions that makes reinforcement learning difficult.

Surprisingly, there is a solution to this problem. It is based on a field of mathematics called dynamic programming, and it involves just two principles. First, if an action causes something bad to happen immediately, such as crashing the plane, then the controller learns not to do that action in that situation again. So whatever the controller did one millisecond before the crash, it will avoid doing in the future. But that principle doesn't help for all the earlier actions that didn't lead to immediate disaster.

The second principle is that if all the actions in a certain situation leads to bad results, then that situation itself should be considered bad. So if the controller has experienced a certain combination of altitude and airspeed many different times, trying a different action each time, and all actions led to something bad, then it will learn that this situation itself is bad. This is a powerful principle, because it can now learn without crashing. In the future, any time it chooses an action that leads to this particular situation, it will immediately learn that that action is bad, without having to wait for the crash.

By using these two principles, a reinforcement learning system can actually learn to fly a plane, or control a robot, or do any number of tasks. It can first learn on a simulator, then fine tune on the real thing. All of the facts that are learned about actions and situations are typically stored using a neural network, so reinforcement learning research profits from all of the accumulated experience of neural network researchers. Reinforcement learning is not a type of neural network, nor is it an alternative to neural networks. Rather, it is an orthogonal approach that addresses a different, more difficult question, and the two fields can be combined to yield powerful machine-learning systems.

Advantage Updating

A reinforcement learning algorithm that we have developed, called advantage updating (Air Force patent pending, Baird 1994), is a particular type of reinforcement learning that is specifically designed for problems with real-time control and many actions per second. It can be applied to both optimization problems, such as autopilots, and game problems, such as dogfights. In game theory, a game is a problem where there are two players, and each must make decisions based on the behavior of the opponent. For example, if an antiaircraft missile is targeted at an aircraft, the best strategy for the aircraft depends on the behavior of the missile, and the best strategy for the missile depends on the behavior of the aircraft. It is usually extremely difficult to calculate the optimal strategies for both players in a game problem, but advantage updating can be used to learn those optimal strategies. An advantage updating system can simulate multiple games with a missile and plane starting in various configurations, and can then learn the optimal actions for both the plane and the missile. After it has learned, the program can be installed in a missile, and it will play perfectly against any plane, or it can be installed in a plane, and it will play perfectly against any missile. If both the missile and plane are controlled by the learning system, then both will play well, and neither will be able to improve its performance by using any other strategy. If there is an optimal strategy, the advantage updating algorithm will always learn it, given sufficient time and memory.

We tested this algorithm by allowing it to solve a problem where a fast missile is targeted at a slower plane. The program was told only that the missile should try to hit the plane, and that the plane should try to avoid the missile. The program was not told how missiles and planes work, and was not told any hints about which strategies to use, so it was forced to learn all of the strategies on its own. For this particular problem, we were also able to calculate the best strategy for the missile and for the plane, and so were able to compare what the program learned to the best possible answer. The computer's solution was then compared to the best possible solution (calculated mathematically), and it was found that the advantage updating system converged to optimal answers. Figure 1 shows the results of pitting a missile against a plane when both are following the policy that the computer learned. If either player were to deviate from the policy, it's performance would be worse. Each of the figures show the same players and strategies, but with different initial positions and velocities.

The advantage updating system learned that, in some cases, it is better for a plane to turn toward the missile, decreasing the distance in the short term, in order to increase the distance between the two in the long term (Figure 2), a tactic sometimes used by pilots. It also learned that a missile should sometimes lead the plane, aiming at a point in front of the plane rather than simply homing in on a heat source or radar signature (Figure 3). This demonstrates that a single advantage learning system can solve a game problem, finding the optimal actions for both players, even when it is given no initial information and must learn everything on its own. These results were presented at the Neural Information Processing Systems Conference in December, 1994.

Figure 1: Progression from initial state to missile impact. The plane is represented by the thick triangle and leaves a connected path. The missile is represented by the thin triangle and leaves a dotted path. Each successive diagram displays a larger geographical area, so the triangles appear smaller.

Figure 2: An example of the plane turning toward the missile in the short term in order to increase the distance between the plane and missile in the long term, a tactic sometimes used by pilots. Included is a graph of distance versus time showing the effects of the plane's learned decisions.

Figure 3: The missile leads the plane, aiming at a point in front of the plane, rather than at the plane itself.

Transitioning the Technology

We have primarily worked in-house on the theoretical underpinnings of intelligent, reinforcement-learning controllers. We have been supported by the Air Force Office of Scientific Research in our work in this area, and have also initiated external contracts supported by Wright Laboratory. The application potential appears to be substantial for this new form of machine intelligence so applications are currently being investigated for navigation, sensor resource management, and automatic target recognition, tracking and pursuit.

A Small Business Innovation Research (SBIR) contract is currently under way to apply advantage updating to a computer vision system. The reinforcement learning system will decide how to move a motorized camera so that objects in a scene can be recognized quickly. In addition, a one million dollar exploratory development contractual program is planned to start in Fiscal Year 1996 to apply reinforcement learning systems to avionics problems.

The field of reinforcement learning appears to have great potential for creating software that learns on its own to solve difficult problems. In the next few years, both the theory and the application may grow exponentially, with significant impact on both the Air Force and civilian industry.

References

Baird, L.C. (1993). Advantage updating (Wright Laboratory Technical Report WL-TR-93-1146, available from the Defense Technical information Center, Cameron Station, Alexandria, VA 22304-6145). Wright-Patterson Air Force Base, OH.

Crites, R., and Barto, A. (1995). Improving elevator performance using reinforcement learning. Advances In Neural Information Processing Systems 8. D. Touretzky, M. Mozer, and M. Haselmo, ed. MIT Press, Cambridge, MA.

Harmon, M. E., Baird, L. C, & Klopf, A. H.(1994). Advantage updating applied to a differential game. Advances in Neural Information Processing Systems 7. G. Tesauro, D. Touretzky, T. Leen, ed. MIT Press, Cambridge, MA.

Klopf, A. H. (1988). A neuronal model of classical conditioning. Psychobiology, 16(2), 85-125.

Klopf, A. H., Morgan, J. S., & Weaver, S. E. (1993). A hierarchical network of control systems that learn: Modeling nervous system function during classical and instrumental conditioning. Adaptive Behavior, 1(3), 263-319.

Tesauro, G. (1992). Practical issues in temporal difference learning. Machine Learning, 8(3/4), 257-277

Watkins, C, & Dayan, P. (1992). Technical note: Q-learning. Machine Learning, 8(3/4), 279-292.

Zhang, W., & Dietterich, T. (1995). High-performance job-shop scheduling with a time-delay TD(lambda) network. Advances In Neural Information Processing Systems 8. D. Touretzky, M. Mozer, and M. Haselmo, ed. MIT Press, Cambridge, MA.

About the Authors

A. Harry Klopf

Wright Laboratory

WL/AAAT, Bldg. 635

2185 Avionics Circle

Wright-Patterson Air Force Base, OH 45433-7301

Voice: 513-255-7649/5800 DSN 785-7649/5800

Fax: 513-476-4302 DSN 785-4302

klopfah@aa.wpafb.af.mil

Mance E. Harmon

Wright Laboratory

WL/AAAT, Bldg. 635

2185 Avionics Circle

Wright-Patterson Air Force Base, OH 45433-7301

harmonme@aa.wpafb.mil

http://ace.aa.wpafb.af.mil/~harmonme/index.html

Leemon C. Baird III

United States Air Force Academy

HQ USAFA/DFCS

2354 Fairchild Dr. Suite 6K41

USAFA, CO 80840-6234

baird@cs.usafa.af.mil

http://kirk.usafa.af.mil/~baird/

Dr. A. Harry Klopf is the senior scientist for machine intelligence, Avionics Directorate, Wright Laboratory, Aeronautical Systems Center, Wright-Patterson Air Force Base, Ohio. Dr. Klopf leads an intramural team of scientists and engineers at Wright Laboratory, with the objective of modeling the learning mechanisms and network architectures of natural intelligence for transitioning to machine intelligence.

Dr. Klopf began his affiliation with the Air Force when he assumed a National Research Council postdoctoral associateship at the Air Force Cambridge Research Laboratories, Hanscom AFB, Massachusetts in 1970. In the postdoctoral position and continuing to the present day, Dr. Klopf's research has focused on reinforcement learning and adaptive behavior as approaches to the analysis of natural intelligence and the synthesis of machine intelligence. Since 1973, Dr. Klopf has been with the Avionics Directorate of Wright Laboratory.

Lt Mance E. Harmon graduated from Mississippi State University with a Bachelor of Science in Computer Science and started performing research at Wright Laboratory when he entered active duty in 1993. His interests include developing and applying reinforcement learning algorithms to problems with continuous time, state, and action sets. More information and downloadable papers are available at http://www.aa.wpafb.af.mil/~harmonme.

Capt Leemon Baird received a B.S. in Computer Science from the U.S. Air Force Academy in 1989 and an M.S. in Computer Science from Northeastern University in 1991. He has performed reinforcement learning research at Draper Laboratory, Boston MA, and Wright Laboratory, Wright-Patterson AFB, and is currently on the faculty of the Department of Computer Science at the Air Force Academy. More information and downloadable papers are available at http://kirk.usafa.af.mil/~baird.