# finite_mdp

## SYNOPSIS

Solve small, finite Markov Decision Process (MDP) models.

This library provides several ways of describing an MDP model (see FiniteMDP::Model) and some reasonably efficient implementations of policy iteration and value iteration to solve it (see FiniteMDP::Solver).

### Usage

#### Example 1: Recycling Robot

The following shows how to solve the recycling robot model (example 3.7) from <cite>Sutton and Barto (1998). Reinforcement Learning: An Introduction</cite>.

<blockquote> At each time step, the robot decides whether it should (1) actively search for a can, (2) remain stationary and wait for someone to bring it a can, or (3) go back to home base to recharge its battery. The best way to find cans is to actively search for them, but this runs down the robot's battery, whereas waiting does not. Whenever the robot is searching, the possibility exists that its battery will become depleted. In this case the robot must shut down and wait to be rescued (producing a low reward). The agent makes its decisions solely as a function of the energy level of the battery. It can distinguish two levels, high and low. </blockquote>

The transition model is described in Table 3.1, which can be fed directly into FiniteMDP using the FiniteMDP::TableModel, as follows.

```
require 'finite_mdp'
alpha = 0.1 # Pr(stay at high charge if searching | now have high charge)
beta = 0.1 # Pr(stay at low charge if searching | now have low charge)
r_search = 2 # reward for searching
r_wait = 1 # reward for waiting
r_rescue = -3 # reward (actually penalty) for running out of charge
model = FiniteMDP::TableModel.new [
[:high, :search, :high, alpha, r_search],
[:high, :search, :low, 1-alpha, r_search],
[:low, :search, :high, 1-beta, r_rescue],
[:low, :search, :low, beta, r_search],
[:high, :wait, :high, 1, r_wait],
[:high, :wait, :low, 0, r_wait],
[:low, :wait, :high, 0, r_wait],
[:low, :wait, :low, 1, r_wait],
[:low, :recharge, :high, 1, 0],
[:low, :recharge, :low, 0, 0]]
solver = FiniteMDP::Solver.new(model, 0.95) # discount factor 0.95
solver.policy_iteration 1e-4
solver.policy #=> {:high=>:search, :low=>:recharge}
```

#### Example 2: Grid Worlds

A more complicated example: the grid world from <cite>Russel and Norvig (2003). Artificial Intelligence: A Modern Approach</cite>, Chapter 17.

Here we describe the model as a class that implements the FiniteMDP::Model interface. The model contains terminal states, which we represent with a special absorbing state with zero reward, called :stop.

```
require 'finite_mdp'
class AIMAGridModel
include FiniteMDP::Model
#
# @param [Array<Array<Float, nil>>] grid rewards at each point, or nil if a
# grid square is an obstacle
#
# @param [Array<[i, j]>] terminii coordinates of the terminal states
#
def initialize grid, terminii
@grid, @terminii = grid, terminii
end
attr_reader :grid, :terminii
# every position on the grid is a state, except for obstacles, which are
# indicated by a nil in the grid
def states
is, js = (0...grid.size).to_a, (0...grid.first.size).to_a
is.product(js).select {|i, j| grid[i][j]} + [:stop]
end
# can move north, east, south or west on the grid
MOVES = {
'^' => [-1, 0],
'>' => [ 0, 1],
'v' => [ 1, 0],
'<' => [ 0, -1]}
# agent can move north, south, east or west (unless it's in the :stop
# state); if it tries to move off the grid or into an obstacle, it stays
# where it is
def actions state
if state == :stop || terminii.member?(state)
[:stop]
else
MOVES.keys
end
end
# define the transition model
def transition_probability state, action, next_state
if state == :stop || terminii.member?(state)
(action == :stop && next_state == :stop) ? 1 : 0
else
# agent usually succeeds in moving forward, but sometimes it ends up
# moving left or right
move = case action
when '^' then [['^', 0.8], ['<', 0.1], ['>', 0.1]]
when '>' then [['>', 0.8], ['^', 0.1], ['v', 0.1]]
when 'v' then [['v', 0.8], ['<', 0.1], ['>', 0.1]]
when '<' then [['<', 0.8], ['^', 0.1], ['v', 0.1]]
end
move.map {|m, pr|
m_state = [state[0] + MOVES[m][0], state[1] + MOVES[m][1]]
m_state = state unless states.member?(m_state) # stay in bounds
pr if m_state == next_state
}.compact.inject(:+) || 0
end
end
# reward is given by the grid cells; zero reward for the :stop state
def reward state, action, next_state
state == :stop ? 0 : grid[state[0]][state[1]]
end
# helper for functions below
def hash_to_grid hash
0.upto(grid.size-1).map{|i| 0.upto(grid[i].size-1).map{|j| hash[[i,j]]}}
end
# print the values in a grid
def pretty_value value
hash_to_grid(Hash[value.map {|s, v| [s, "%+.3f" % v]}]).map{|row|
row.map{|cell| cell || ' '}.join(' ')}
end
# print the policy using ASCII arrows
def pretty_policy policy
hash_to_grid(policy).map{|row| row.map{|cell|
(cell.nil? || cell == :stop) ? ' ' : cell}.join(' ')}
end
end
# the grid from Figures 17.1, 17.2(a) and 17.3
model = AIMAGridModel.new(
[[-0.04, -0.04, -0.04, +1],
[-0.04, nil, -0.04, -1],
[-0.04, -0.04, -0.04, -0.04]],
[[0, 3], [1, 3]]) # terminals (the +1 and -1 states)
# sanity check: successor state probabilities must sum to 1
model.check_transition_probabilities_sum
solver = FiniteMDP::Solver.new(model, 1) # discount factor 1
solver.value_iteration(1e-5, 100) #=> true if converged
puts model.pretty_policy(solver.policy)
# output: (matches Figure 17.2(a))
# > > >
# ^ ^
# ^ < < <
puts model.pretty_value(solver.value)
# output: (matches Figure 17.3)
# 0.812 0.868 0.918 1.000
# 0.762 0.660 -1.000
# 0.705 0.655 0.611 0.388
FiniteMDP::TableModel.from_model(model)
#=> [[0, 0], "v", [0, 0], 0.1, -0.04]
# [[0, 0], "v", [0, 1], 0.1, -0.04]
# [[0, 0], "v", [1, 0], 0.8, -0.04]
# [[0, 0], "<", [0, 0], 0.9, -0.04]
# [[0, 0], "<", [1, 0], 0.1, -0.04]
# [[0, 0], ">", [0, 0], 0.1, -0.04]
# [[0, 0], ">", [0, 1], 0.8, -0.04]
# [[0, 0], ">", [1, 0], 0.1, -0.04]
# ...
# [:stop, :stop, :stop, 1, 0]
```

Note that python code for this model is also available from the book's authors at aima.cs.berkeley.edu/python/mdp.html

## REQUIREMENTS

This gem requires ruby 2.2 or higher.

## INSTALLATION

```
gem install finite_mdp
```

## LICENSE

(The MIT License)

Copyright © 2016 John Lees-Miller

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the 'Software'), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.