<< mdp_span Markov Decision Processses (MDP) Toolbox mdp_value_iterationGS >>

Markov Decision Processses (MDP) Toolbox >> Markov Decision Processses (MDP) Toolbox > mdp_value_iteration

mdp_value_iteration

Solves discounted MDP with value iteration algorithm.

Calling Sequence

[policy, iter, cpu_time] = mdp_value_iteration (P, R, discount)
[policy, iter, cpu_time] = mdp_value_iteration (P, R, discount, epsilon)
[policy, iter, cpu_time] = mdp_value_iteration (P, R, discount, epsilon,
max_iter) [policy, iter, cpu_time] = mdp_value_iteration (P, R, discount,
epsilon, max_iter, V0)

Description

mdp_value_iteration applies the value iteration algorithm to solve discounted MDP. The algorithm consists in solving Bellman's equation iteratively.

Iterating is stopped when an epsilon-optimal policy is found or after a specified number (max_iter) of iterations.

This function uses verbose and silent modes. In verbose mode, the function displays the variation of V (value function) for each iteration and the condition which stopped iterations: epsilon-policy found or maximum number of iterations reached.

Arguments

P

transition probability array.

P can be a 3 dimensions array (SxSxA) or a list (1xA), each list element containing a sparse matrix (SxS).

R

reward array.

R can be a 3 dimensions array (SxSxA) or a list (1xA), each list element containing a sparse matrix (SxS) or a 2D array (SxA) possibly sparse.

discount

discount factor.

discount is a real which belongs to [0; 1[.

epsilon (optional)

search for an epsilon-optimal policy.

epsilon is a real in ]0; 1].

By default, epsilon = 0.01.

max_iter (optional)

maximum number of iterations to be done.

max_iter is an integer greater than 0.If the value given in argument is greater than a computed bound, a warning informs that the computed bound will be considered.

By default, if discount is not egal to 1, a bound for max_iter is computed, if not max_iter is egal to 1000.

V0 (optional)

starting value function.

V0 is a (Sx1) vector.

By default, V0 is only composed of 0 elements.

Evaluation

policy

optimal policy.

policy is a (Sx1) vector. Each element is an integer corresponding to an action which maximizes the value function.

iter

number of done iterations.

cpu_time

CPU time used to run the program.

Examples

-> P = list(); -> P(1) = [ 0.5 0.5;
    0.8 0.2 ]; -> P(2) = [ 0 1; 0.1 0.9 ]; -> R = [ 5 10; -1 2 ]; ->
    [policy, iter, cpu_time] = mdp_value_iteration(P, R, 0.9) cpu_time =
    0.1200 iter = 26 policy = 2 1 -> mdp_verbose() // set verbose mode>
    [policy] = mdp_value_iteration(P, R, 0.9) Iteration V_variation 1 8 2 2.76
    3 1.9872 4 1.430784 5 1.0301645 6 0.7417184 7 0.5340373 8 0.3845068 9
    0.2768449 10 0.1993283 11 0.1435164 12 0.1033318 13 0.0743989 14 0.0535672
    15 0.0385684 16 0.0277692 17 0.0199939 18 0.0143956 19 0.0103648 20
    0.0074627 21 0.0053731 22 0.0038686 23 0.0027854 24 0.0020055 25 0.0014440
    26 0.0010397 MDP Toolbox : iterations stopped, epsilon-optimal policy
    found policy = 2 1 In the above example, P can be a list containing sparse
    matrices: -> P(1) = sparse([ 0.5 0.5; 0.8 0.2 ]); -> P(2) = sparse([
    0 1; 0.1 0.9 ]); The function is unchanged.

Authors


Report an issue
<< mdp_span Markov Decision Processses (MDP) Toolbox mdp_value_iterationGS >>