Reinforcement learning provides a sound framework for credit assignment in un-
known stochastic dynamic environments. For Markov environments a variety of
different reinforcement learning algorithms have been devised to predict and control
the environment (e.g., the TD(A) algorithm of Sutton, 1988, and the Q-Iearning
algorithm of Watkins, 1989). Ties to the theory of dynamic programming (DP) and
the theory of stochastic approximation have been exploited, providing tools that
have allowed these algorithms to be analyzed theoretically (Dayan, 1992; Tsitsiklis,
1994; Jaakkola, Jordan, & Singh, 1994; Watkins & Dayan, 1992).
Although current reinforcement learning algorithms are based on the assumption
that the learning problem can be cast as Markov decision problem (MDP), many
practical problems resist being treated as an MDP. Unfortunately, if the Markov
assumption is removed examples can be found where current algorithms cease to
perform well (Singh, Jaakkola, & Jordan, 1994). Moreover, the theoretical analyses
rely heavily on the Markov assumption.
The non-Markov nature of the environment can arise in many ways. The most direct
extension of MDP's is to deprive the learner of perfect information about the state
of the environment. Much as in the case of Hidden Markov Models (HMM's), the
underlying environment is assumed to be Markov, but the data do not appear to be
Markovian to the learner. This extension not only allows for a tractable theoretical
analysis, but is also appealing for practical purposes. The decision problems we
consider here are of this type.
The analog of the HMM for control problems is the partially observable Markov
decision process (POMDP; see e.g., Monahan, 1982). Unlike HMM's, however,
there is no known computationally tractable procedure for POMDP's. The problem
is that once the state estimates have been obtained, DP must be performed in
the continuous space of probabilities of state occupancies, and this DP process is
computationally infeasible except for small state spaces. In this paper we describe
an alternative approach for POMDP's that avoids the state estimation problem and
works directly in the space of (stochastic) control policies. (See Singh, et al., 1994,
for additional material on stochastic policies.)