Real-time dynamic programming1 (RTDP) uses Markov Decision Processes (MDPs) as a model of the environment. It samples paths through the state space based on the current greedy policy and updates the values along its way. It's an efficient way of real-time planning, since not necessarily the whole state space is visited, and works well for stochastic environments. As such, with some additional work, it is possible to apply RTDP on stochastic gym environments such as frozen lake. In my case, I tried just that. Given a 20x20 map and full observability, RTDP successfully gets a decent reward after a few minutes of training on one core. The source code can be found at the end of the post.
Here's the algorithm:
And this is the environment:
SFFFFFFFHFFFFFFHFHFF FFFFFFFFFFFHFFFFFHFF FFHFFHFHFFFFFFHFFFFH FFHFFHFFFFFFFFHFFHFF FFFHFFFFFFFFFFFFFFHF FFFFHFFFFFHFFFFHFFFH FFFFFFFHFHFFHFFFFFFF HFHFFFFFFFFFFHFFFFFF HFFFFFFFFHHFHFFHHFFF FFFFFFFFFHFHFFFFFFFF FFFFFFFFFFFFHFFFFFFH FFFFFFFHFFFFFFFFFFFH FFFFFFHFFFFFFFFFHHFF HFFHFFFHHFHFFFHHFFFF FFFFFFFFFHFHFFHHHFFF HFFFFFHFFFFFHFHFFFFF HFFFFFFFFFFFFFFFHFFH FHFFFFFFFHFFFFFFFFFF FFHFFFFFFFHFFFFHFHFF FFHFHFFFFFFFHHFFFFFG
For the MDP we need four components: States, actions, transition matrix and costs. Fortunately, since we're using a gym environment, most of this is already given. There are four actions, the states are known and the gym environment exports the transition matrix (including the probabilities). That is, only the costs are missing. Out of the box, frozen lake only gives one positive reward at the very goal. In all other cases, e.g. falling into a hole, zero. One idea for a custom cost mapping could be: 0.1 for each action, 1 for a hole, -1 for the goal (negative cost). A keyword related to negative cost is self-absorbing goals, i.e. goals that have just a single action going back to the goal itself with 0 or even negative cost. A self-absorbing goal is a requirement of stochastic shortest path MDPs (SSP MDPs)4, which is pretty much what we have here. We want the path with the lowest cost (which is the same as a shortest path with lowest cost=time). By the way, gamma needs to be set to 1.
As a result of the above mapping, RTDP achieves an average reward of 0.48 (over 10000 evaluations). This is great, since this is not an easy stochastic environment - you only have a chance of 33% of actually ending up where you intend to be! But this is also where RTDP excels. A faster alternative to RTDP would be discretizing the transition probabilities and applying D* Lite.
Learning to act using real-time dynamic programming, 1994, Barto et al ↩
Bounded real-time dynamic programming: RTDP with monotone upper bounds and performance guarantees, 2005, McMahan et al ↩
Planning with Markov Decision Processes: An AI Perspective, 2012, Mausam and Kobolov ↩
Solving Stochastic Shortest-Path Problems with RTDP, 2002, Bonet and Geffner ↩