# finite_mdp ## SYNOPSIS

Solve small, finite Markov Decision Process (MDP) models.

This library provides several ways of describing an MDP model (see FiniteMDP::Model) and some reasonably efficient implementations of policy iteration and value iteration to solve it (see FiniteMDP::Solver).

### Usage

#### Example 1: Recycling Robot

The following shows how to solve the recycling robot model (example 3.7) from <cite>Sutton and Barto (1998). Reinforcement Learning: An Introduction</cite>.

<blockquote> At each time step, the robot decides whether it should (1) actively search for a can, (2) remain stationary and wait for someone to bring it a can, or (3) go back to home base to recharge its battery. The best way to find cans is to actively search for them, but this runs down the robot's battery, whereas waiting does not. Whenever the robot is searching, the possibility exists that its battery will become depleted. In this case the robot must shut down and wait to be rescued (producing a low reward). The agent makes its decisions solely as a function of the energy level of the battery. It can distinguish two levels, high and low. </blockquote>

The transition model is described in Table 3.1, which can be fed directly into FiniteMDP using the FiniteMDP::TableModel, as follows.

``````require 'finite_mdp'

alpha    = 0.1 # Pr(stay at high charge if searching | now have high charge)
beta     = 0.1 # Pr(stay at low charge if searching | now have low charge)
r_search = 2   # reward for searching
r_wait   = 1   # reward for waiting
r_rescue = -3  # reward (actually penalty) for running out of charge

model = FiniteMDP::TableModel.new [
[:high, :search,   :high, alpha,   r_search],
[:high, :search,   :low,  1-alpha, r_search],
[:low,  :search,   :high, 1-beta,  r_rescue],
[:low,  :search,   :low,  beta,    r_search],
[:high, :wait,     :high, 1,       r_wait],
[:high, :wait,     :low,  0,       r_wait],
[:low,  :wait,     :high, 0,       r_wait],
[:low,  :wait,     :low,  1,       r_wait],
[:low,  :recharge, :high, 1,       0],
[:low,  :recharge, :low,  0,       0]]

solver = FiniteMDP::Solver.new(model, 0.95) # discount factor 0.95
solver.policy_iteration 1e-4
solver.policy #=> {:high=>:search, :low=>:recharge}
``````

#### Example 2: Grid Worlds

A more complicated example: the grid world from <cite>Russel and Norvig (2003). Artificial Intelligence: A Modern Approach</cite>, Chapter 17.

Here we describe the model as a class that implements the FiniteMDP::Model interface. The model contains terminal states, which we represent with a special absorbing state with zero reward, called :stop.

``````require 'finite_mdp'

class AIMAGridModel
include FiniteMDP::Model

#
# @param [Array<Array<Float, nil>>] grid rewards at each point, or nil if a
#        grid square is an obstacle
#
# @param [Array<[i, j]>] terminii coordinates of the terminal states
#
def initialize grid, terminii
@grid, @terminii = grid, terminii
end

# every position on the grid is a state, except for obstacles, which are
# indicated by a nil in the grid
def states
is, js = (0...grid.size).to_a, (0...grid.first.size).to_a
is.product(js).select {|i, j| grid[i][j]} + [:stop]
end

# can move north, east, south or west on the grid
MOVES = {
'^' => [-1,  0],
'>' => [ 0,  1],
'v' => [ 1,  0],
'<' => [ 0, -1]}

# agent can move north, south, east or west (unless it's in the :stop
# state); if it tries to move off the grid or into an obstacle, it stays
# where it is
def actions state
if state == :stop || terminii.member?(state)
[:stop]
else
MOVES.keys
end
end

# define the transition model
def transition_probability state, action, next_state
if state == :stop || terminii.member?(state)
(action == :stop && next_state == :stop) ? 1 : 0
else
# agent usually succeeds in moving forward, but sometimes it ends up
# moving left or right
move = case action
when '^' then [['^', 0.8], ['<', 0.1], ['>', 0.1]]
when '>' then [['>', 0.8], ['^', 0.1], ['v', 0.1]]
when 'v' then [['v', 0.8], ['<', 0.1], ['>', 0.1]]
when '<' then [['<', 0.8], ['^', 0.1], ['v', 0.1]]
end
move.map {|m, pr|
m_state = [state + MOVES[m], state + MOVES[m]]
m_state = state unless states.member?(m_state) # stay in bounds
pr if m_state == next_state
}.compact.inject(:+) || 0
end
end

# reward is given by the grid cells; zero reward for the :stop state
def reward state, action, next_state
state == :stop ? 0 : grid[state][state]
end

# helper for functions below
def hash_to_grid hash
0.upto(grid.size-1).map{|i| 0.upto(grid[i].size-1).map{|j| hash[[i,j]]}}
end

# print the values in a grid
def pretty_value value
hash_to_grid(Hash[value.map {|s, v| [s, "%+.3f" % v]}]).map{|row|
row.map{|cell| cell || '      '}.join(' ')}
end

# print the policy using ASCII arrows
def pretty_policy policy
hash_to_grid(policy).map{|row| row.map{|cell|
(cell.nil? || cell == :stop) ? ' ' : cell}.join(' ')}
end
end

# the grid from Figures 17.1, 17.2(a) and 17.3
model = AIMAGridModel.new(
[[-0.04, -0.04, -0.04,    +1],
[-0.04,   nil, -0.04,    -1],
[-0.04, -0.04, -0.04, -0.04]],
[[0, 3], [1, 3]]) # terminals (the +1 and -1 states)

# sanity check: successor state probabilities must sum to 1
model.check_transition_probabilities_sum

solver = FiniteMDP::Solver.new(model, 1) # discount factor 1
solver.value_iteration(1e-5, 100) #=> true if converged

puts model.pretty_policy(solver.policy)
# output: (matches Figure 17.2(a))
# > > >
# ^   ^
# ^ < < <

puts model.pretty_value(solver.value)
# output: (matches Figure 17.3)
#  0.812  0.868  0.918  1.000
#  0.762         0.660 -1.000
#  0.705  0.655  0.611  0.388

FiniteMDP::TableModel.from_model(model)
#=> [[0, 0], "v", [0, 0], 0.1, -0.04]
#   [[0, 0], "v", [0, 1], 0.1, -0.04]
#   [[0, 0], "v", [1, 0], 0.8, -0.04]
#   [[0, 0], "<", [0, 0], 0.9, -0.04]
#   [[0, 0], "<", [1, 0], 0.1, -0.04]
#   [[0, 0], ">", [0, 0], 0.1, -0.04]
#   [[0, 0], ">", [0, 1], 0.8, -0.04]
#   [[0, 0], ">", [1, 0], 0.1, -0.04]
#   ...
#   [:stop, :stop, :stop, 1, 0]
``````

Note that python code for this model is also available from the book's authors at aima.cs.berkeley.edu/python/mdp.html

## REQUIREMENTS

Tested on

• ruby 1.8.7 (2010-06-23 patchlevel 299) [i686-linux]

• ruby 1.9.2p0 (2010-08-18 revision 29036) [i686-linux]

• ruby 1.9.2p180 (2011-02-18 revision 30909) [x86_64-linux]

## INSTALLATION

``````gem install finite_mdp
``````