All Packages  Class Hierarchy  This Package  Previous  Next  Index

Class sim.mdp.MDP

java.lang.Object
   |
   +----sim.mdp.MDP

public abstract class MDP
extends Object
implements Watchable, Parsable
a Markov Decision Process or Markov Game that takes a state and action and returns a new state and a reinforcement. It can be either deterministic or nondeterministic. If the next state is fed back in as the state, it can run a simulation. If the state is repeatedly randomized, it can be used for learning with random transitions. If an MDP class is written for which an optimal policy and value function are known, then findAction() and findValue() will return them, otherwise they just return null and zero respectively. Revision 1.01 added the state parameter to the findValAct method

This code is (c) 1996 Leemon Baird and Mance Harmon <leemon@cs.cmu.edu>, http://www.cs.cmu.edu/~baird
The source and object code may be redistributed freely. If the code is modified, please state so in the comments.

Version:
1.12, 23 Aug 97
Author:
Leemon Baird, Mance Harmon

Variable Index

 o action
an action vector (created in parse())
 o nextState
the state vector resulting from doing action in state (created in parse())
 o state
a state vector (created in parse())
 o watchManager
the WatchManager that variables here may be registered with
 o wmName
the prefix string for the name of every watched variable (passed in to setWatchManager)

Constructor Index

 o MDP()

Method Index

 o actionSize()
Return the number of elements in the action vector.
 o BNF(int)
 o findValAct(Matrix, Matrix, FunApp, Matrix, PBoolean)
Find the value and best action of this state.
 o findValue(Matrix, Matrix, PDouble, FunApp, PDouble, Matrix, PDouble, PBoolean, NumExp, Random)
Find the optimum over action for where V(x') is the value of the successor state given state x, R is the reinforcement, gamma is the discount factor.
 o getAction(Matrix, Matrix, Random)
Return the next action possible in a state given the last action performed.
 o getName()
Return the variable "name" that was passed into setWatchManager
 o getParameters(int)
Return a parameter array if BNF(), parse(), and unparse() are to be automated, null otherwise.
 o getState(Matrix, PDouble, Random)
Return the next state to be used for training in an epoch-wise system.
 o getWatchManager()
Return the WatchManager set by setWatchManager().
 o initialAction(Matrix, Matrix, Random)
Return the initial action possible in a state.
 o initialize(int)
Initialize, either partially or completely.
 o initialState(Matrix, Random)
Return an initial state used for the start of an epoch (for epoch-wise training) or for the start of a trial (when training on trajectories).
 o nextState(Matrix, Matrix, Matrix, PDouble, PBoolean, Random)
Find a (possibly stochastic) next state given a state and action, and return the (possibly stochastic) reinforcement received.
 o numActions(Matrix)
Return the number of actions in a given state.
 o numPairs(PDouble)
Return the number of state/action pairs in the MDP for a given dt.
 o numStates(PDouble)
Return the number of states in the given MDP.
 o parse(Parser, int)
Parse the input file to get the parameters for this object.
 o randomAction(Matrix, Matrix, Random)
Generates a random action from those possible.
 o randomState(Matrix, Random)
Generates a random state from those possible and returns it in the vector passed in.
 o setWatchManager(WatchManager, String)
Register all variables with this WatchManager.
 o stateSize()
Return the number of elements in the state vector.
 o unparse(Unparser, int)
Output a description of this object that can be parsed with parse().

Variables

 o watchManager
 protected WatchManager watchManager
the WatchManager that variables here may be registered with

 o wmName
 protected String wmName
the prefix string for the name of every watched variable (passed in to setWatchManager)

 o state
 protected Matrix state
a state vector (created in parse())

 o action
 protected Matrix action
an action vector (created in parse())

 o nextState
 protected Matrix nextState
the state vector resulting from doing action in state (created in parse())

Constructors

 o MDP
 public MDP()

Methods

 o setWatchManager
 public void setWatchManager(WatchManager wm,
                             String name)
Register all variables with this WatchManager. Override this if there are internal variables that should be registered here.

 o getName
 public String getName()
Return the variable "name" that was passed into setWatchManager

 o getWatchManager
 public WatchManager getWatchManager()
Return the WatchManager set by setWatchManager().

 o numStates
 public int numStates(PDouble dt)
Return the number of states in the given MDP. If the true number of states is infinite, then this defines the sample size of a pseudo-epoch. If the number of states is finite, then the number of states might be a function of the time step size dt. For this reason a step size dt is passed into this object. There is no need to override this if epoch-wise training will never be done.

 o stateSize
 public abstract int stateSize()
Return the number of elements in the state vector.

 o initialState
 public abstract void initialState(Matrix state,
                                   Random random) throws MatrixException
Return an initial state used for the start of an epoch (for epoch-wise training) or for the start of a trial (when training on trajectories). This might not always be the same state, but could randomly return one of a set of legal starting states.

Throws: MatrixException
Vector passed in was wrong length.
 o getState
 public void getState(Matrix state,
                      PDouble dt,
                      Random random) throws MatrixException
Return the next state to be used for training in an epoch-wise system. This method is different than nextState() in that nextState() returns the state transitioned to as a function of the dynamics of the system. getState() simply returns another state to be trained upon when performing epoch-wise training. This method should incrementally return unique states until all states for the epoch have been used for training. For example: if state space consists of 20 unique states, then this method will return a unique state until all 20 states have been return. The method would then start over in a new series of the same 20 states. The parameters are the last state used and a time step size. In short, this is an iterator over all states in state space. If the state space is infinite this method should not be used and is not meaningful. There is no need to override this for infinite state spaces.

Throws: MatrixException
Vector passed in was wrong length.
 o actionSize
 public abstract int actionSize()
Return the number of elements in the action vector.

 o initialAction
 public void initialAction(Matrix state,
                           Matrix action,
                           Random random) throws MatrixException
Return the initial action possible in a state. This method is used when one has to iterate over all possible actions in a given state. Given a state, this method should return the initial action possible in that state. There is no need to override this if the action is a scalar ranging from 0 to some maximum value.

Throws: MatrixException
Vector passed in was wrong length.
 o getAction
 public void getAction(Matrix state,
                       Matrix action,
                       Random random) throws MatrixException
Return the next action possible in a state given the last action performed. This performs the same function as that of getState() in the sense that this serves as an iterator over actions instead of states. There is no need to override this if the legal actions are some range of contiguous integers.

Throws: MatrixException
Vector passed in was wrong length.
 o numActions
 public abstract int numActions(Matrix state)
Return the number of actions in a given state. For simplicity this should be the same for all states. However, the state is being passed in to this method so that future code can take advantage of this parameter if necessary.

 o numPairs
 public int numPairs(PDouble dt)
Return the number of state/action pairs in the MDP for a given dt. This is used for epoch-wise training. An epoch would consist of all state/action pairs for a given MDP and may be a function of the step size dt. There is no need to override this if there is an infinite state space so no "epoch" is defined.

 o randomAction
 public void randomAction(Matrix state,
                          Matrix action,
                          Random random) throws MatrixException
Generates a random action from those possible. Accepts a state and passes back an action. Each action variable should be on a seperate row. action should be a vector (single-column matrix): Nx1 There is no need to override this if the legal actions are integers from 0 to numActions(state).

Throws: MatrixException
Vector passed in was wrong length.
 o randomState
 public void randomState(Matrix state,
                         Random random) throws MatrixException
Generates a random state from those possible and returns it in the vector passed in. This should NOT include terminal states where the value is known. There is no need to override this is the MDP is such that random states cannot be jumped to, and all training will be on trajectories starting from legal start states.

Throws: MatrixException
Vector passed in was wrong length.
 o nextState
 public abstract double nextState(Matrix state,
                                  Matrix action,
                                  Matrix newState,
                                  PDouble dt,
                                  PBoolean valueKnown,
                                  Random random) throws MatrixException
Find a (possibly stochastic) next state given a state and action, and return the (possibly stochastic) reinforcement received. All 3 should be vectors (single-column matrices). The duration of the time step, dt, is also returned. Most MDPs will generally make this returned dt a constant, given in the parsed string. But a semi-Markov decision process could return a different dt every time. If the resulting state's value is perfectly known then the flag valueKnown should be set to true.

Throws: MatrixException
if sizes aren't right.
 o findValAct
 public abstract double findValAct(Matrix state,
                                   Matrix action,
                                   FunApp f,
                                   Matrix outputs,
                                   PBoolean valueKnown) throws MatrixException
Find the value and best action of this state. This returns the value of a given state as a double. This also destroys the action that is passed in by replacing it with the best action. This method always returns a value that is a function of state/action pairs. The value associated with these state/action pairs might be Q-values or advantages, but it is not important to know which learning algorithm is being used. This method should simply find the min or max value as a function of the state/action pairs in the given state. For example, if Q-learning is the learning algorithm, then one would find the max Q-value for the given state and return that value. The action associated with that Q-value would be passed back. The state/action pair with the max Q-value should be evaluated last so that findGradients() can be called from within the learning algorithm without having to call function.evaluate().

Throws: MatrixException
column vectors are wrong size or shape
 o findValue
 public abstract double findValue(Matrix state,
                                  Matrix action,
                                  PDouble gamma,
                                  FunApp f,
                                  PDouble dt,
                                  Matrix outputs,
                                  PDouble reinforcement,
                                  PBoolean valueKnown,
                                  NumExp explorationFactor,
                                  Random random) throws MatrixException
Find the optimum over action for where V(x') is the value of the successor state given state x, R is the reinforcement, gamma is the discount factor. This method is used in the object ValIteration (value iteration). The max value over actions () is returned. The state reached after performing the optimal action should be returned 'explorationFactor' percent of the time in the parameter 'state'. The state resulting from a random action will be returned 1-explorationFactor percent of the time. The possibility of explorationFactor==null must be handled. The action parameter must be checked for a null value before implementing. The learning object 'ValueIteration' passes in a null in the place 'action'.

Throws: MatrixException
column vectors are wrong size or shape
 o getParameters
 public Object[][] getParameters(int lang)
Return a parameter array if BNF(), parse(), and unparse() are to be automated, null otherwise.

See Also:
getParameters
 o initialize
 public void initialize(int level)
Initialize, either partially or completely.

See Also:
initialize
 o BNF
 public String BNF(int lang)
 o unparse
 public void unparse(Unparser u,
                     int lang)
Output a description of this object that can be parsed with parse().

 o parse
 public Object parse(Parser p,
                     int lang) throws ParserException
Parse the input file to get the parameters for this object.

Throws: ParserException
parser didn't find the required token

All Packages  Class Hierarchy  This Package  Previous  Next  Index