Class: FiniteMDP::Solver

Inherits:
Object
  • Object
show all
Defined in:
lib/finite_mdp/solver.rb

Overview

Find optimal values and policies using policy iteration and/or value iteration. The methods here are suitable for finding deterministic policies for infinite-horizon problems.

The computations are carried out on an intermediate form of the given model, which is stored using nested arrays:

model[state_num][action_num] = [[next_state_num, probability, reward], ...]

The solver assigns numbers to each state and each action automatically. Note that the successor state data are stored in sparse format, and any transitions that are in the given model but have zero probability are not stored.

TODO implement backward induction for finite horizon problems

TODO maybe implement a 'dense' storage format for models with many successor states, probably as a different solver class

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(model, discount, policy = nil, value = Hash.new(0)) ⇒ Solver

Returns a new instance of Solver.

Parameters:

  • model (Model)
  • discount (Float)

    in (0, 1]

  • policy (Hash<state, action>, nil) (defaults to: nil)

    initial policy; if nil, an arbitrary action is selected for each state

  • value (Hash<state, Float>) (defaults to: Hash.new(0))

    initial value for each state; defaults to zero for every state


32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
# File 'lib/finite_mdp/solver.rb', line 32

def initialize model, discount, policy=nil, value=Hash.new(0)
  @model = model
  @discount = discount

  # get the model data into a more compact form for calculation; this means
  # that we number the states and actions for faster lookups (avoid most of
  # the hashing); the 'next states' map is still stored in sparse format
  # (that is, as a hash)
  model_states = model.states
  state_to_num = Hash[model_states.zip((0...model_states.size).to_a)]
  @array_model = model_states.map {|state|
    model.actions(state).map {|action|
      model.next_states(state, action).map {|next_state|
        pr = model.transition_probability(state, action, next_state)
        [state_to_num[next_state], pr, 
          model.reward(state, action, next_state)] if pr > 0
      }.compact
    }
  }

  # convert initial values and policies to compact form
  @array_value = model_states.map {|state| value[state]}
  if policy
    action_to_num = model_states.map{|state|
      actions = model.actions(state)
      Hash[actions.zip((0...actions.size).to_a)]
    }
    @array_policy = action_to_num.zip(model_states).
                          map {|a_to_n, state| a_to_n[policy[state]]}
  else
    # default to the first action, arbitrarily
    @array_policy = [0]*model_states.size
  end

  raise 'some initial values are missing' if
    @array_value.any? {|v| v.nil?}
  raise 'some initial policy actions are missing' if
    @array_policy.any? {|a| a.nil?}

  @policy_A = nil
end

Instance Attribute Details

#modelModel (readonly)

Returns the model being solved; read only; do not change the model while it is being solved.

Returns:

  • (Model)

    the model being solved; read only; do not change the model while it is being solved


78
79
80
# File 'lib/finite_mdp/solver.rb', line 78

def model
  @model
end

Instance Method Details

#evaluate_policyFloat

Refine the estimate of the value function for the current policy. This is done by iterating the Bellman equations; see also #evaluate_policy_exact for a different approach.

This is the 'policy evaluation' step in Figure 4.3 of Sutton and Barto (1998).

function

Returns:

  • (Float)

    largest absolute change (over all states) in the value


136
137
138
139
140
141
142
143
144
145
# File 'lib/finite_mdp/solver.rb', line 136

def evaluate_policy
  delta = 0.0
  @array_model.each_with_index do |actions, state_n|
    next_state_ns = actions[@array_policy[state_n]]
    new_value = backup(next_state_ns)
    delta = [delta, (@array_value[state_n] - new_value).abs].max
    @array_value[state_n] = new_value
  end
  delta
end

#evaluate_policy_exactnil

Evaluate the value function for the current policy by solving a linear system of n equations in n unknowns, where n is the number of states in the model.

This routine currently uses dense linear algebra, so it requires that the full n-by-n matrix be stored in memory. This may be a problem for moderately large n.

All of the coefficients (A and b in Ax = b) are computed first call, but subsequent calls recompute only those rows for which the policy has changed since the last call.

Returns:

  • (nil)

162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
# File 'lib/finite_mdp/solver.rb', line 162

def evaluate_policy_exact
  if @policy_A    # update only those rows for which the policy has changed

    @policy_A_action.zip(@array_policy).
      each_with_index do |(old_action_n, new_action_n), state_n|
      next if old_action_n == new_action_n
      update_policy_Ab state_n, new_action_n
    end
  else
    # initialise the A and the b for Ax = b
    num_states = @array_model.size
    @policy_A = NMatrix.float(num_states, num_states)
    @policy_A_action = [-1]*num_states
    @policy_b = NVector.float(num_states)

    @array_policy.each_with_index do |action_n, state_n|
      update_policy_Ab state_n, action_n
    end
  end

  value = @policy_b / @policy_A # solve linear system
  @array_value = value.to_a
  nil
end

#improve_policyBoolean

Make policy greedy with respect to the current value function.

This is the 'policy improvement' step in Figure 4.3 of Sutton and Barto (1998).

Returns:

  • (Boolean)

    false iff the policy changed for any state


195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
# File 'lib/finite_mdp/solver.rb', line 195

def improve_policy
  stable = true
  @array_model.each_with_index do |actions, state_n|
    a_max = nil
    v_max = -Float::MAX
    actions.each_with_index do |next_state_ns, action_n|
      v = backup(next_state_ns)
      if v > v_max
        a_max = action_n
        v_max = v
      end
    end
    raise "no feasible actions in state #{state_n}" unless a_max
    stable = false if @array_policy[state_n] != a_max
    @array_policy[state_n] = a_max
  end
  stable
end

#policyHash<state, action>

Current estimate of the optimal action for each state.

made to the returned object will not affect the solver

Returns:

  • (Hash<state, action>)

    from states to actions; read only; any changes


120
121
122
123
# File 'lib/finite_mdp/solver.rb', line 120

def policy
  Hash[model.states.zip(@array_policy).map{|state, action_n|
    [state, model.actions(state)[action_n]]}]
end

#policy_iteration(value_tolerance, max_value_iters = nil, max_policy_iters = nil) {|num_policy_iters, num_value_iters, delta| ... } ⇒ Boolean

Solve with policy iteration using approximate (iterative) policy evaluation.

Parameters:

  • value_tolerance (Float)

    small positive number; the policy evaluation phase ends if the largest change in the value function (delta) is below this tolerance

  • max_value_iters (Integer, nil) (defaults to: nil)

    terminate the policy evaluation phase after this many iterations, even if the value function has not converged; nil means that there is no limit on the number of iterations in each policy evaluation phase

  • max_policy_iters (Integer, nil) (defaults to: nil)

    terminate after this many iterations, even if a stable policy has not been obtained; nil means that there is no limit on the number of iterations

Yields:

  • (num_policy_iters, num_value_iters, delta)

    at the end of each policy evaluation iteration

Yield Parameters:

  • num_policy_iters (Integer)

    policy improvement iterations done so far

  • num_value_iters (Integer)

    policy evaluation iterations done so far for the current policy improvement iteration

  • delta (Float)

    largest change in the value function in the last policy evaluation iteration

Returns:

  • (Boolean)

    true iff a stable policy was obtained


307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
# File 'lib/finite_mdp/solver.rb', line 307

def policy_iteration value_tolerance, max_value_iters=nil,
  max_policy_iters=nil

  stable = false
  num_policy_iters = 0
  loop do
    # policy evaluation
    num_value_iters = 0
    loop do
      value_delta = evaluate_policy
      num_value_iters += 1

      break if value_delta < value_tolerance
      break if max_value_iters && num_value_iters >= max_value_iters
      yield num_policy_iters, num_value_iters, value_delta if block_given?
    end

    # policy improvement
    stable = improve_policy
    num_policy_iters += 1
    break if stable
    break if max_policy_iters && num_policy_iters >= max_policy_iters
  end
  stable
end

#policy_iteration_exact(max_iters = nil) {|num_iters| ... } ⇒ Boolean

Solve with policy iteration using exact policy evaluation.

Parameters:

  • max_iters (Integer, nil) (defaults to: nil)

    terminate after this many iterations, even if a stable policy has not been obtained; nil means that there is no limit on the number of iterations

Yields:

  • (num_iters)

    at the end of each iteration

Yield Parameters:

  • num_iters (Integer)

    policy improvement iterations done so far

Returns:

  • (Boolean)

    true iff a stable policy was obtained


346
347
348
349
350
351
352
353
354
355
356
357
358
# File 'lib/finite_mdp/solver.rb', line 346

def policy_iteration_exact max_iters=nil
  stable = false
  num_iters = 0
  loop do
    evaluate_policy_exact
    stable = improve_policy
    num_iters += 1
    break if stable
    break if max_iters && num_iters >= max_iters
    yield num_iters if block_given?
  end
  stable
end

#state_action_valueHash<[state, action], Float>

Current state-action value estimates; whereas #value returns $V(s)$, this returns $Q(s,a)$, in the usual notation.

Returns:

  • (Hash<[state, action], Float>)

99
100
101
102
103
104
105
106
107
108
109
110
111
112
# File 'lib/finite_mdp/solver.rb', line 99

def state_action_value
  q = {}
  states = model.states
  @array_model.each_with_index do |actions, state_n|
    state = states[state_n]
    state_actions = model.actions(state)
    actions.each_with_index do |next_state_ns, action_n|
      q_sa = next_state_ns.map {|next_state_n, pr, r|
        pr * (r + @discount * @array_value[next_state_n])}.inject(:+)
      q[[state, state_actions[action_n]]] = q_sa
    end
  end
  q
end

#valueHash<state, Float>

Current value estimate for each state.

The result is converted from the solver's internal representation, so you cannot affect the solver by changing the result.

made to the returned object will not affect the solver

Returns:

  • (Hash<state, Float>)

    from states to values; read only; any changes


89
90
91
# File 'lib/finite_mdp/solver.rb', line 89

def value
  Hash[model.states.zip(@array_value)]
end

#value_iteration(tolerance, max_iters = nil) {|num_iters, delta| ... } ⇒ Boolean

Value iteration; call #value_iteration_single up to max_iters times until the largest change in the value function (delta) is less than tolerance.

Parameters:

  • tolerance (Float)

    small positive number

  • max_iters (Integer, nil) (defaults to: nil)

    terminate after this many iterations, even if the value function has not converged; nil means that there is no limit on the number of iterations

Yields:

  • (num_iters, delta)

    at the end of each iteration

Yield Parameters:

  • num_iters (Integer)

    iterations done so far

  • delta (Float)

    largest change in the value function in the last iteration

Returns:

  • (Boolean)

    true iff iteration converged to within tolerance


263
264
265
266
267
268
269
270
271
272
273
274
275
# File 'lib/finite_mdp/solver.rb', line 263

def value_iteration tolerance, max_iters=nil
  delta = Float::MAX
  num_iters = 0
  loop do
    delta = value_iteration_single
    num_iters += 1

    break if delta < tolerance
    break if max_iters && num_iters >= max_iters
    yield num_iters, delta if block_given?
  end
  delta < tolerance
end

#value_iteration_singleFloat

A single iteration of value iteration.

This is the algorithm from Figure 4.5 of Sutton and Barto (1998). It is mostly equivalent to calling #evaluate_policy and then #improve_policy, but it is somewhat more efficient.

function

Returns:

  • (Float)

    largest absolute change (over all states) in the value


224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
# File 'lib/finite_mdp/solver.rb', line 224

def value_iteration_single
  delta = 0.0
  @array_model.each_with_index do |actions, state_n|
    a_max = nil
    v_max = -Float::MAX
    actions.each_with_index do |next_state_ns, action_n|
      v = backup(next_state_ns)
      if v > v_max
        a_max = action_n
        v_max = v
      end
    end
    delta = [delta, (@array_value[state_n] - v_max).abs].max
    @array_value[state_n] = v_max
    @array_policy[state_n] = a_max
  end
  delta
end