Solves discounted MDP with Gauss-Seidel's value iteration algorithm.
[policy, iter, cpu_time] = mdp_value_iterationGS (P, R, discount) [policy, iter, cpu_time] = mdp_value_iterationGS (P, R, discount, epsilon) [policy, iter, cpu_time] = mdp_value_iterationGS (P, R, discount, epsilon, max_iter) [policy, iter, cpu_time] = mdp_value_iterationGS (P, R, discount, epsilon, max_iter, V0)
mdp_value_iterationGS applies Gauss-Seidel's value iteration algorithm to solve discounted MDP. The algorithm consists, like value iteration, in solving Bellman's equation iteratively Vn+1(s) calculation is modified. The algorithm uses Vn+1(s) instead of Vn(s) whenever this value has been calculated. In this way, convergence speed is improved.
IIterating is stopped when an epsilon-optimal policy is found or after a specified number (max_iter) of iterations.
This function uses verbose and silent modes. In verbose mode, the function displays the variation of V (value function) for each iteration and the condition which stopped iterations: epsilon-policy found or maximum number of iterations reached.
transition probability array.
P can be a 3 dimensions array (SxSxA) or a list (1xA), each list element containing a sparse matrix (SxS).
reward array.
R can be a 3 dimensions array (SxSxA) or a list (1xA), each list element containing a sparse matrix (SxS) or a 2D array (SxA) possibly sparse.
discount factor.
discount is a real which belongs to [0; 1[.
For discount equals to 1, a warning recalls to check conditions of convergence.
search for an epsilon-optimal policy.
epsilon is a real in ]0; 1].
By default, epsilon = 0.01.
maximum number of iterations to be don-> P = list(); -> P(1) = [ 0.5 0.5; 0.8 0.2 ]; -> P(2) = [ 0 1; 0.1 0.9 ]; -> R = [ 5 10; -1 2 ]; -> [V, policy, iter, cpu_time] = mdp_value_iterationGS(P, R, 0.9) e.
max_iter is an integer greater than 0.If the value given in argument is greater than a computed bound, a warning informs that the computed bound will be considered.
By default, if discount is not egal to 1, a bound for max_iter is computed, if not max_iter is egal to 1000.
starting value function.
V0 is a (Sx1) vector.
By default, V0 is only composed of 0 elements.
epsilon-optimal policy.
policy is a (Sx1) vector. Each element is an integer corresponding to an action which maximizes the value function.
number of done iterations.
CPU time used to run the program.
-> P = list(); -> P(1) = [ 0.5 0.5; 0.8 0.2 ]; -> P(2) = [ 0 1; 0.1 0.9 ]; -> R = [ 5 10; -1 2 ]; -> [policy, iter, cpu_time] = mdp_value_iterationGS(P, R, 0.9) cpu_time = 0.1900 iter = 28 policy = 2 1 -> mdp_verbose() // set verbose mode -> [policy] = mdp_value_iterationGS(P, R, 0.9) Iteration V_variation 1 3.8 2 0.4464 3 0.3696192 4 0.3060447 5 0.2534050 6 0.2098193 7 0.1737304 8 0.1438488 9 0.1191068 10 0.0986204 11 0.0816577 12 0.0676126 13 0.0559832 14 0.0463541 15 0.0383812 16 0.0317796 17 0.0263135 18 0.021788 19 0.0180401 20 0.0149372 21 0.0123680 22 0.0102407 23 0.0084793 24 0.0070209 25 0.0058133 26 0.0048134 27 0.0039855 28 0.0033000 MDP Toolbox : iterations stopped by maximum number of iteration condition policy = 2 1 In the above example, P can be a list containing sparse matrices: -> P(1) = sparse([ 0.5 0.5; 0.8 0.2 ]); -> P(2) = sparse([ 0 1; 0.1 0.9 ]); The function is unchanged. | ![]() | ![]() |