Solves MDP with average reward with relative value iteration algorithm.
[policy, average_reward, cpu_time] = mdp_relative_value_itera (P, R) [policy, average_reward, cpu_time] = mdp_relative_value_itera (P, R, epsilon) [policy, average_reward, cpu_time] = mdp_relative_value_itera (P, R, epsilon, max_iter)
mdp_relative_value_itera applies the relative value iteration algorithm to solve MDP with average reward. The algorithm consists in solving optimality equations iteratively.
Iterating is stopped when an epsilon-optimal policy is found or after a specified number (max_iter) of iterations.
This fonction uses verbose and silent modes. In verbose mode, the function displays the span of (Un+1-Un) for each iteration.
transition probability array.
P can be a 3 dimensions array (SxSxA) or a list (1xA), each list element containing a sparse matrix (SxS).
reward array.
R can be a 3 dimensions array (SxSxA) or a list (1xA), each list element containing a sparse matrix (SxS) or a 2D array (SxA) possibly sparse.
search for an epsilon-optimal policy.
epsilon is a real in [0; 1].
By default, epsilon is set to 0.01.
maximum number of iterations to be done.
max_iter is an integer greater than 0.
By default, max_iter is set to 1000.
optimal policy.
policy is a (Sx1) vector. Each element is an integer corresponding to an action which maximizes the value function.
average reward of the optimal policy.
average_reward is a real.
CPU time used to run the program.
-> P = list(); -> P(1) = [ 0.5 0.5; 0.8 0.2 ]; -> P(2) = [ 0 1; 0.1 0.9 ]; -> R = [ 5 10; -1 2 ]; -> [policy, average_reward, cpu_time] = mdp_relative_value_itera(P, R) cpu_time = 0.1200 average_reward = 3.8852352 policy = 2 1 -> mdp_verbose() // set verbose mode -> [policy, average_reward] = mdp_relative_value_itera(P, R) Iteration U_variation 1 8 2 3.4 3 2.72 4 2.176 5 1.7408 6 1.39264 7 1.114112 8 0.8912896 9 0.7130317 10 0.5704253 11 0.4563403 12 0.3650722 13 0.2920578 14 0.2336462 15 0.1869170 16 0.1495336 17 0.1196269 18 0.0957015 19 0.0765612 20 0.0612490 21 0.0489929 22 0.0391993 23 0.0313595 24 0.0250876 25 0.0200701 26 0.0160560 27 0.0128448 28 0.01 02759 29 0.0082207 MDP Toolbox : iterations stopped, epsilon-optimal policy found average_reward = 3.8852 policy = 2 1 | ![]() | ![]() |