Class: FiniteMDP::Solver

Inherits:
Object
  • Object
show all
Defined in:
lib/finite_mdp/solver.rb

Overview

Find optimal values and policies using policy iteration and/or value iteration. The methods here are suitable for finding deterministic policies for infinite-horizon problems.

The computations are carried out on an intermediate form of the given model, which is stored using nested arrays:

model[state_num][action_num] = [[next_state_num, probability, reward], ...]

The solver assigns numbers to each state and each action automatically. Note that the successor state data are stored in sparse format, and any transitions that are in the given model but have zero probability are not stored.

TODO implement backward induction for finite horizon problems

TODO maybe implement a 'dense' storage format for models with many successor states, probably as a different solver class

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(model, discount, policy: nil, value: Hash.new(0)) ⇒ Solver

Returns a new instance of Solver.

Parameters:

  • model (Model)
  • discount (Float)

    in (0, 1]

  • policy (Hash<state, action>, nil) (defaults to: nil)

    initial policy; if nil, an arbitrary action is selected for each state

  • value (Hash<state, Float>) (defaults to: Hash.new(0))

    initial value for each state; defaults to zero for every state


38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
# File 'lib/finite_mdp/solver.rb', line 38

def initialize(model, discount, policy: nil, value: Hash.new(0))
  @discount = discount

  # get the model data into a more compact form for calculation; this means
  # that we number the states and actions for faster lookups (avoid most of
  # the hashing)
  @model =
    if model.is_a?(FiniteMDP::ArrayModel)
      model
    else
      FiniteMDP::ArrayModel.from_model(model)
    end

  # convert initial values and policies to compact form
  @array_value = @model.states.map { |state| value[state] }
  @array_policy =
    if policy
      @model.states.map do |state|
        @model.actions(state).index(policy[state])
      end
    else
      [0] * @model.num_states
    end

  raise 'some initial values are missing' if
    @array_value.any?(&:nil?)
  raise 'some initial policy actions are missing' if
    @array_policy.any?(&:nil?)

  @policy_A = nil
end

Instance Attribute Details

#discountFloat (readonly)

Returns discount factor, in (0, 1].

Returns:

  • (Float)

    discount factor, in (0, 1]


79
80
81
# File 'lib/finite_mdp/solver.rb', line 79

def discount
  @discount
end

#modelArrayModel (readonly)

Returns the model being solved; read only; do not change the model while it is being solved.

Returns:

  • (ArrayModel)

    the model being solved; read only; do not change the model while it is being solved


74
75
76
# File 'lib/finite_mdp/solver.rb', line 74

def model
  @model
end

Instance Method Details

#evaluate_policyFloat

Refine the estimate of the value function for the current policy. This is done by iterating the Bellman equations; see also #evaluate_policy_exact for a different approach.

This is the 'policy evaluation' step in Figure 4.3 of Sutton and Barto (1998).

function

Returns:

  • (Float)

    largest absolute change (over all states) in the value


136
137
138
139
140
141
142
143
144
145
# File 'lib/finite_mdp/solver.rb', line 136

def evaluate_policy
  delta = 0.0
  model.array.each_with_index do |actions, state_n|
    next_state_ns = actions[@array_policy[state_n]]
    new_value = backup(next_state_ns)
    delta = [delta, (@array_value[state_n] - new_value).abs].max
    @array_value[state_n] = new_value
  end
  delta
end

#evaluate_policy_exactnil

Evaluate the value function for the current policy by solving a linear system of n equations in n unknowns, where n is the number of states in the model.

This routine currently uses dense linear algebra, so it requires that the full n-by-n matrix be stored in memory. This may be a problem for moderately large n.

All of the coefficients (A and b in Ax = b) are computed first call, but subsequent calls recompute only those rows for which the policy has changed since the last call.

Returns:

  • (nil)

162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
# File 'lib/finite_mdp/solver.rb', line 162

def evaluate_policy_exact
  if @policy_A
    # update only those rows for which the policy has changed
    @policy_A_action.zip(@array_policy)
      .each_with_index do |(old_action_n, new_action_n), state_n|
      next if old_action_n == new_action_n
      update_policy_Ab state_n, new_action_n
    end
  else
    # initialise the A and the b for Ax = b
    num_states = model.num_states
    @policy_A = NMatrix.float(num_states, num_states)
    @policy_A_action = [-1] * num_states
    @policy_b = NVector.float(num_states)

    @array_policy.each_with_index do |action_n, state_n|
      update_policy_Ab state_n, action_n
    end
  end

  value = @policy_b / @policy_A # solve linear system
  @array_value = value.to_a
  nil
end

#improve_policy(tolerance: Float::EPSILON) ⇒ Integer

Make policy greedy with respect to the current value function.

This is the 'policy improvement' step in Figure 4.3 of Sutton and Barto (1998).

Parameters:

  • tolerance (Float) (defaults to: Float::EPSILON)

    non-negative tolerance; for the policy to change, the action must be at least this much better than the current action

Returns:

  • (Integer)

    number of states that changed


199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
# File 'lib/finite_mdp/solver.rb', line 199

def improve_policy(tolerance: Float::EPSILON)
  changed = 0
  model.array.each_with_index do |actions, state_n|
    a_max = nil
    v_max = -Float::MAX
    actions.each_with_index do |next_state_ns, action_n|
      v = backup(next_state_ns)
      if v > v_max + tolerance
        a_max = action_n
        v_max = v
      end
    end
    raise "no feasible actions in state #{state_n}" unless a_max
    changed += 1 if @array_policy[state_n] != a_max
    @array_policy[state_n] = a_max
  end
  changed
end

#policyHash<state, action>

Current estimate of the optimal action for each state.

made to the returned object will not affect the solver

Returns:

  • (Hash<state, action>)

    from states to actions; read only; any changes


119
120
121
122
123
# File 'lib/finite_mdp/solver.rb', line 119

def policy
  Hash[model.states.zip(@array_policy).map do |state, action_n|
    [state, model.actions(state)[action_n]]
  end]
end

#policy_iteration(value_tolerance:, policy_tolerance: value_tolerance / 2.0, max_value_iters: nil, max_policy_iters: nil) {|num_policy_iters, num_value_iters, delta| ... } ⇒ Boolean

Solve with policy iteration using approximate (iterative) policy evaluation.

Parameters:

  • value_tolerance (Float)

    small positive number; the policy evaluation phase ends if the largest change in the value function (delta) is below this tolerance

  • policy_tolerance (Float) (defaults to: value_tolerance / 2.0)

    small positive number; when comparing actions during policy improvement, ignore value function differences smaller than this tolerance; this helps with convergence when there are several equivalent or extremely similar actions

  • max_value_iters (Integer, nil) (defaults to: nil)

    terminate the policy evaluation phase after this many iterations, even if the value function has not converged; nil means that there is no limit on the number of iterations in each policy evaluation phase

  • max_policy_iters (Integer, nil) (defaults to: nil)

    terminate after this many iterations, even if a stable policy has not been obtained; nil means that there is no limit on the number of iterations

Yields:

  • (num_policy_iters, num_value_iters, delta)

    at the end of each policy evaluation iteration

Yield Parameters:

  • num_policy_iters (Integer)

    policy improvement iterations done so far

  • actions_changed (Integer?)

    number of actions that changed in the policy improvement phase, if any

  • num_value_iters (Integer)

    policy evaluation iterations done so far for the current policy improvement iteration

  • delta (Float)

    largest change in the value function in the last policy evaluation iteration

Returns:

  • (Boolean)

    true iff a stable policy was obtained


319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
# File 'lib/finite_mdp/solver.rb', line 319

def policy_iteration(value_tolerance:,
  policy_tolerance: value_tolerance / 2.0, max_value_iters: nil,
  max_policy_iters: nil)

  num_actions_changed = nil
  num_policy_iters = 0
  loop do
    # policy evaluation
    num_value_iters = 0
    loop do
      value_delta = evaluate_policy
      num_value_iters += 1
      if block_given?
        yield(num_policy_iters, num_actions_changed, num_value_iters,
          value_delta)
      end

      break if value_delta < value_tolerance
      break if max_value_iters && num_value_iters >= max_value_iters
    end

    # policy improvement
    num_actions_changed = improve_policy(tolerance: policy_tolerance)
    num_policy_iters += 1
    return true if num_actions_changed == 0
    return false if max_policy_iters && num_policy_iters >= max_policy_iters
  end
end

#policy_iteration_exact(max_iters: nil) {|num_iters| ... } ⇒ Boolean

Solve with policy iteration using exact policy evaluation.

Parameters:

  • max_iters (Integer, nil) (defaults to: nil)

    terminate after this many iterations, even if a stable policy has not been obtained; nil means that there is no limit on the number of iterations

Yields:

  • (num_iters)

    at the end of each iteration

Yield Parameters:

  • num_iters (Integer)

    policy improvement iterations done so far

  • num_actions_changed (Integer)

    number of actions that changed in the last policy improvement phase

Returns:

  • (Boolean)

    true iff a stable policy was obtained


364
365
366
367
368
369
370
371
372
373
374
# File 'lib/finite_mdp/solver.rb', line 364

def policy_iteration_exact(max_iters: nil)
  num_iters = 0
  loop do
    evaluate_policy_exact
    num_actions_changed = improve_policy
    num_iters += 1
    yield num_iters, num_actions_changed if block_given?
    return true if num_actions_changed == 0
    return false if max_iters && num_iters >= max_iters
  end
end

#state_action_valueHash<[state, action], Float>

Current state-action value estimates; whereas #value returns $V(s)$, this returns $Q(s,a)$, in the usual notation.

Returns:

  • (Hash<[state, action], Float>)

100
101
102
103
104
105
106
107
108
109
110
111
# File 'lib/finite_mdp/solver.rb', line 100

def state_action_value
  q = {}
  model.states.each_with_index do |state, state_n|
    model.actions(state).each_with_index do |action, action_n|
      q_sa = model.array[state_n][action_n].map do |next_state_n, pr, r|
        pr * (r + @discount * @array_value[next_state_n])
      end.inject(:+)
      q[[state, action]] = q_sa
    end
  end
  q
end

#valueHash<state, Float>

Current value estimate for each state.

The result is converted from the solver's internal representation, so you cannot affect the solver by changing the result.

made to the returned object will not affect the solver

Returns:

  • (Hash<state, Float>)

    from states to values; read only; any changes


90
91
92
# File 'lib/finite_mdp/solver.rb', line 90

def value
  Hash[model.states.zip(@array_value)]
end

#value_iteration(tolerance:, max_iters: nil) {|num_iters, delta| ... } ⇒ Boolean

Value iteration; call #value_iteration_single up to max_iters times until the largest change in the value function (delta) is less than tolerance.

Parameters:

  • tolerance (Float)

    small positive number

  • max_iters (Integer, nil) (defaults to: nil)

    terminate after this many iterations, even if the value function has not converged; nil means that there is no limit on the number of iterations

Yields:

  • (num_iters, delta)

    at the end of each iteration

Yield Parameters:

  • num_iters (Integer)

    iterations done so far

  • delta (Float)

    largest change in the value function in the last iteration

Returns:

  • (Boolean)

    true iff iteration converged to within tolerance


267
268
269
270
271
272
273
274
275
276
277
278
279
# File 'lib/finite_mdp/solver.rb', line 267

def value_iteration(tolerance:, max_iters: nil)
  delta = Float::MAX
  num_iters = 0
  loop do
    delta = value_iteration_single
    num_iters += 1

    break if delta < tolerance
    break if max_iters && num_iters >= max_iters
    yield num_iters, delta if block_given?
  end
  delta < tolerance
end

#value_iteration_singleFloat

A single iteration of value iteration.

This is the algorithm from Figure 4.5 of Sutton and Barto (1998). It is mostly equivalent to calling #evaluate_policy and then #improve_policy, but it is somewhat more efficient.

function

Returns:

  • (Float)

    largest absolute change (over all states) in the value


228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
# File 'lib/finite_mdp/solver.rb', line 228

def value_iteration_single
  delta = 0.0
  model.array.each_with_index do |actions, state_n|
    a_max = nil
    v_max = -Float::MAX
    actions.each_with_index do |next_state_ns, action_n|
      v = backup(next_state_ns)
      if v > v_max
        a_max = action_n
        v_max = v
      end
    end
    delta = [delta, (@array_value[state_n] - v_max).abs].max
    @array_value[state_n] = v_max
    @array_policy[state_n] = a_max
  end
  delta
end