# Class: FiniteMDP::Solver

Inherits:
Object
• Object
show all
Defined in:
lib/finite_mdp/solver.rb

## Overview

Find optimal values and policies using policy iteration and/or value iteration. The methods here are suitable for finding deterministic policies for infinite-horizon problems.

The computations are carried out on an intermediate form of the given model, which is stored using nested arrays:

``````model[state_num][action_num] = [[next_state_num, probability, reward], ...]
``````

The solver assigns numbers to each state and each action automatically. Note that the successor state data are stored in sparse format, and any transitions that are in the given model but have zero probability are not stored.

TODO implement backward induction for finite horizon problems

TODO maybe implement a 'dense' storage format for models with many successor states, probably as a different solver class

## Instance Attribute Summary collapse

Discount factor, in (0, 1].

The model being solved; read only; do not change the model while it is being solved.

## Instance Method Summary collapse

• Refine the estimate of the value function for the current policy.

• Evaluate the value function for the current policy by solving a linear system of n equations in n unknowns, where n is the number of states in the model.

• Make policy greedy with respect to the current value function.

• constructor

A new instance of Solver.

• Current estimate of the optimal action for each state.

• Solve with policy iteration using approximate (iterative) policy evaluation.

• Solve with policy iteration using exact policy evaluation.

• Current state-action value estimates; whereas #value returns \$V(s)\$, this returns \$Q(s,a)\$, in the usual notation.

• Current value estimate for each state.

• Value iteration; call #value_iteration_single up to `max_iters` times until the largest change in the value function (`delta`) is less than `tolerance`.

• A single iteration of value iteration.

## Constructor Details

### #initialize(model, discount, policy: nil, value: Hash.new(0)) ⇒ Solver

Returns a new instance of Solver.

Parameters:

• model (Model)
• discount (Float)

in (0, 1]

• policy (Hash<state, action>, nil) (defaults to: nil)

initial policy; if nil, an arbitrary action is selected for each state

• value (Hash<state, Float>) (defaults to: Hash.new(0))

initial value for each state; defaults to zero for every state

 ``` 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68``` ```# File 'lib/finite_mdp/solver.rb', line 38 def initialize(model, discount, policy: nil, value: Hash.new(0)) @discount = discount # get the model data into a more compact form for calculation; this means # that we number the states and actions for faster lookups (avoid most of # the hashing) @model = if model.is_a?(FiniteMDP::ArrayModel) model else FiniteMDP::ArrayModel.from_model(model) end # convert initial values and policies to compact form @array_value = @model.states.map { |state| value[state] } @array_policy = if policy @model.states.map do |state| @model.actions(state).index(policy[state]) end else [0] * @model.num_states end raise 'some initial values are missing' if @array_value.any?(&:nil?) raise 'some initial policy actions are missing' if @array_policy.any?(&:nil?) @policy_A = nil end```

## Instance Attribute Details

Returns discount factor, in (0, 1].

Returns:

• (Float)

discount factor, in (0, 1]

 ``` 79 80 81``` ```# File 'lib/finite_mdp/solver.rb', line 79 def discount @discount end```

Returns the model being solved; read only; do not change the model while it is being solved.

Returns:

• the model being solved; read only; do not change the model while it is being solved

 ``` 74 75 76``` ```# File 'lib/finite_mdp/solver.rb', line 74 def model @model end```

## Instance Method Details

### #evaluate_policy ⇒ Float

Refine the estimate of the value function for the current policy. This is done by iterating the Bellman equations; see also #evaluate_policy_exact for a different approach.

This is the 'policy evaluation' step in Figure 4.3 of Sutton and Barto (1998).

function

Returns:

• (Float)

largest absolute change (over all states) in the value

 ``` 136 137 138 139 140 141 142 143 144 145``` ```# File 'lib/finite_mdp/solver.rb', line 136 def evaluate_policy delta = 0.0 model.array.each_with_index do |actions, state_n| next_state_ns = actions[@array_policy[state_n]] new_value = backup(next_state_ns) delta = [delta, (@array_value[state_n] - new_value).abs].max @array_value[state_n] = new_value end delta end```

### #evaluate_policy_exact ⇒ nil

Evaluate the value function for the current policy by solving a linear system of n equations in n unknowns, where n is the number of states in the model.

This routine currently uses dense linear algebra, so it requires that the full n-by-n matrix be stored in memory. This may be a problem for moderately large n.

All of the coefficients (A and b in Ax = b) are computed first call, but subsequent calls recompute only those rows for which the policy has changed since the last call.

Returns:

• (nil)
 ``` 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185``` ```# File 'lib/finite_mdp/solver.rb', line 162 def evaluate_policy_exact if @policy_A # update only those rows for which the policy has changed @policy_A_action.zip(@array_policy) .each_with_index do |(old_action_n, new_action_n), state_n| next if old_action_n == new_action_n update_policy_Ab state_n, new_action_n end else # initialise the A and the b for Ax = b num_states = model.num_states @policy_A = NMatrix.float(num_states, num_states) @policy_A_action = [-1] * num_states @policy_b = NVector.float(num_states) @array_policy.each_with_index do |action_n, state_n| update_policy_Ab state_n, action_n end end value = @policy_b / @policy_A # solve linear system @array_value = value.to_a nil end```

### #improve_policy(tolerance: Float::EPSILON) ⇒ Integer

Make policy greedy with respect to the current value function.

This is the 'policy improvement' step in Figure 4.3 of Sutton and Barto (1998).

Parameters:

• tolerance (Float) (defaults to: Float::EPSILON)

non-negative tolerance; for the policy to change, the action must be at least this much better than the current action

Returns:

• (Integer)

number of states that changed

 ``` 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216``` ```# File 'lib/finite_mdp/solver.rb', line 199 def improve_policy(tolerance: Float::EPSILON) changed = 0 model.array.each_with_index do |actions, state_n| a_max = nil v_max = -Float::MAX actions.each_with_index do |next_state_ns, action_n| v = backup(next_state_ns) if v > v_max + tolerance a_max = action_n v_max = v end end raise "no feasible actions in state #{state_n}" unless a_max changed += 1 if @array_policy[state_n] != a_max @array_policy[state_n] = a_max end changed end```

### #policy ⇒ Hash<state, action>

Current estimate of the optimal action for each state.

made to the returned object will not affect the solver

Returns:

• (Hash<state, action>)

from states to actions; read only; any changes

 ``` 119 120 121 122 123``` ```# File 'lib/finite_mdp/solver.rb', line 119 def policy Hash[model.states.zip(@array_policy).map do |state, action_n| [state, model.actions(state)[action_n]] end] end```

### #policy_iteration(value_tolerance:, policy_tolerance: value_tolerance / 2.0, max_value_iters: nil, max_policy_iters: nil) {|num_policy_iters, num_value_iters, delta| ... } ⇒ Boolean

Solve with policy iteration using approximate (iterative) policy evaluation.

Parameters:

• value_tolerance (Float)

small positive number; the policy evaluation phase ends if the largest change in the value function (`delta`) is below this tolerance

• policy_tolerance (Float) (defaults to: value_tolerance / 2.0)

small positive number; when comparing actions during policy improvement, ignore value function differences smaller than this tolerance; this helps with convergence when there are several equivalent or extremely similar actions

• max_value_iters (Integer, nil) (defaults to: nil)

terminate the policy evaluation phase after this many iterations, even if the value function has not converged; nil means that there is no limit on the number of iterations in each policy evaluation phase

• max_policy_iters (Integer, nil) (defaults to: nil)

terminate after this many iterations, even if a stable policy has not been obtained; nil means that there is no limit on the number of iterations

Yields:

• (num_policy_iters, num_value_iters, delta)

at the end of each policy evaluation iteration

Yield Parameters:

• num_policy_iters (Integer)

policy improvement iterations done so far

• actions_changed (Integer?)

number of actions that changed in the policy improvement phase, if any

• num_value_iters (Integer)

policy evaluation iterations done so far for the current policy improvement iteration

• delta (Float)

largest change in the value function in the last policy evaluation iteration

Returns:

• (Boolean)

true iff a stable policy was obtained

 ``` 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346``` ```# File 'lib/finite_mdp/solver.rb', line 319 def policy_iteration(value_tolerance:, policy_tolerance: value_tolerance / 2.0, max_value_iters: nil, max_policy_iters: nil) num_actions_changed = nil num_policy_iters = 0 loop do # policy evaluation num_value_iters = 0 loop do value_delta = evaluate_policy num_value_iters += 1 if block_given? yield(num_policy_iters, num_actions_changed, num_value_iters, value_delta) end break if value_delta < value_tolerance break if max_value_iters && num_value_iters >= max_value_iters end # policy improvement num_actions_changed = improve_policy(tolerance: policy_tolerance) num_policy_iters += 1 return true if num_actions_changed == 0 return false if max_policy_iters && num_policy_iters >= max_policy_iters end end```

### #policy_iteration_exact(max_iters: nil) {|num_iters| ... } ⇒ Boolean

Solve with policy iteration using exact policy evaluation.

Parameters:

• max_iters (Integer, nil) (defaults to: nil)

terminate after this many iterations, even if a stable policy has not been obtained; nil means that there is no limit on the number of iterations

Yields:

• (num_iters)

at the end of each iteration

Yield Parameters:

• num_iters (Integer)

policy improvement iterations done so far

• num_actions_changed (Integer)

number of actions that changed in the last policy improvement phase

Returns:

• (Boolean)

true iff a stable policy was obtained

 ``` 364 365 366 367 368 369 370 371 372 373 374``` ```# File 'lib/finite_mdp/solver.rb', line 364 def policy_iteration_exact(max_iters: nil) num_iters = 0 loop do evaluate_policy_exact num_actions_changed = improve_policy num_iters += 1 yield num_iters, num_actions_changed if block_given? return true if num_actions_changed == 0 return false if max_iters && num_iters >= max_iters end end```

### #state_action_value ⇒ Hash<[state, action], Float>

Current state-action value estimates; whereas #value returns \$V(s)\$, this returns \$Q(s,a)\$, in the usual notation.

Returns:

• (Hash<[state, action], Float>)
 ``` 100 101 102 103 104 105 106 107 108 109 110 111``` ```# File 'lib/finite_mdp/solver.rb', line 100 def state_action_value q = {} model.states.each_with_index do |state, state_n| model.actions(state).each_with_index do |action, action_n| q_sa = model.array[state_n][action_n].map do |next_state_n, pr, r| pr * (r + @discount * @array_value[next_state_n]) end.inject(:+) q[[state, action]] = q_sa end end q end```

### #value ⇒ Hash<state, Float>

Current value estimate for each state.

The result is converted from the solver's internal representation, so you cannot affect the solver by changing the result.

made to the returned object will not affect the solver

Returns:

• (Hash<state, Float>)

from states to values; read only; any changes

 ``` 90 91 92``` ```# File 'lib/finite_mdp/solver.rb', line 90 def value Hash[model.states.zip(@array_value)] end```

### #value_iteration(tolerance:, max_iters: nil) {|num_iters, delta| ... } ⇒ Boolean

Value iteration; call #value_iteration_single up to `max_iters` times until the largest change in the value function (`delta`) is less than `tolerance`.

Parameters:

• tolerance (Float)

small positive number

• max_iters (Integer, nil) (defaults to: nil)

terminate after this many iterations, even if the value function has not converged; nil means that there is no limit on the number of iterations

Yields:

• (num_iters, delta)

at the end of each iteration

Yield Parameters:

• num_iters (Integer)

iterations done so far

• delta (Float)

largest change in the value function in the last iteration

Returns:

• (Boolean)

true iff iteration converged to within tolerance

 ``` 267 268 269 270 271 272 273 274 275 276 277 278 279``` ```# File 'lib/finite_mdp/solver.rb', line 267 def value_iteration(tolerance:, max_iters: nil) delta = Float::MAX num_iters = 0 loop do delta = value_iteration_single num_iters += 1 break if delta < tolerance break if max_iters && num_iters >= max_iters yield num_iters, delta if block_given? end delta < tolerance end```

### #value_iteration_single ⇒ Float

A single iteration of value iteration.

This is the algorithm from Figure 4.5 of Sutton and Barto (1998). It is mostly equivalent to calling #evaluate_policy and then #improve_policy, but it is somewhat more efficient.

function

Returns:

• (Float)

largest absolute change (over all states) in the value

 ``` 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245``` ```# File 'lib/finite_mdp/solver.rb', line 228 def value_iteration_single delta = 0.0 model.array.each_with_index do |actions, state_n| a_max = nil v_max = -Float::MAX actions.each_with_index do |next_state_ns, action_n| v = backup(next_state_ns) if v > v_max a_max = action_n v_max = v end end delta = [delta, (@array_value[state_n] - v_max).abs].max @array_value[state_n] = v_max @array_policy[state_n] = a_max end delta end```