Solves discounted MDP with modified policy iteration algorithm.
[V, policy, iter, cpu_time] = mdp_value_iteration_mod (P, R, discount) [V, policy, iter, cpu_time] = mdp_value_iteration_mod (P, R, discount, epsilon) [V, policy, iter, cpu_time] = mdp_value_iteration_mod (P, R, discount, epsilon, max_iter)
mdp_policy_iteration_mod applies the modified policy iteration algorithm to solve discounted MDP. The algorithm consists, like policy iteration one, in improving the policy iteratively but in policy evaluation few iterations (max_iter) of value function updates done.
Iterating is stopped when an epsilon-optimal policy is found.
This function uses verbose and silent modes. In verbose mode, the function displays the variation of V for each iteration.
transition probability array.
P can be a 3 dimensions array (SxSxA) or a list (1xA), each list element containing a sparse matrix (SxS).
reward array.
R can be a 3 dimensions array (SxSxA) or a list (1xA), each list element containing a sparse matrix (SxS) or a 2D array (SxA) possibly sparse.
discount factor.
discount is a real which belongs to [0; 1[.
search for an epsilon-optimal policy.
epsilon is a real in ]0; 1].
By default, epsilon = 0.01.
maximum number of iterations to be done.
max_iter is an integer greater than 0.
By default, max_iter is set to 1000.
optimal value fonction.
V is a (Sx1) vector.
optimal policy.
policy is a (Sx1) vector. Each element is an integer corresponding to an action which maximizes the value function.
number of iterations.
CPU time used to run the program.
-> P = list(); -> P(1) = [ 0.5 0.5; 0.8 0.2 ]; -> P(2) = [ 0 1; 0.1 0.9 ]; -> R = [ 5 10; -1 2 ]; -> [V, policy, iter, cpu_time] = mdp_policy_iteration_mod(P, R, 0.9) cpu_time = 0.0500 iter = 5 policy = 2 1 V = 41.865642 35.47028 -> mdp_verbose() // set verbose mode -> [V, policy] = mdp_policy_iteration_mod(P, R, 0.9) Iteration V_variation 1 8 2 1.6238532 3 0.0437728 4 0.0011799 5 0.0000318 policy = 2 1 V = 41.865642 35.47028 In the above example, P can be a list containing sparse matrices: -> P(1) = sparse([ 0.5 0.5; 0.8 0.2 ]); -> P(2) = sparse([ 0 1; 0.1 0.9 ]); The function is unchanged. | ![]() | ![]() |