<< mdp_policy_iteration Markov Decision Processses (MDP) Toolbox mdp_relative_value_itera >>

Markov Decision Processses (MDP) Toolbox >> Markov Decision Processses (MDP) Toolbox > mdp_policy_iteration_mod

mdp_policy_iteration_mod

Solves discounted MDP with modified policy iteration algorithm.

Calling Sequence

[V, policy, iter, cpu_time] = mdp_value_iteration_mod (P, R, discount)
[V, policy, iter, cpu_time] = mdp_value_iteration_mod (P, R, discount, epsilon)
[V, policy, iter, cpu_time] = mdp_value_iteration_mod (P, R, discount, epsilon, max_iter)

Description

mdp_policy_iteration_mod applies the modified policy iteration algorithm to solve discounted MDP. The algorithm consists, like policy iteration one, in improving the policy iteratively but in policy evaluation few iterations (max_iter) of value function updates done.

Iterating is stopped when an epsilon-optimal policy is found.

This function uses verbose and silent modes. In verbose mode, the function displays the variation of V for each iteration.

Arguments

P

transition probability array.

P can be a 3 dimensions array (SxSxA) or a list (1xA), each list element containing a sparse matrix (SxS).

R

reward array.

R can be a 3 dimensions array (SxSxA) or a list (1xA), each list element containing a sparse matrix (SxS) or a 2D array (SxA) possibly sparse.

discount

discount factor.

discount is a real which belongs to [0; 1[.

epsilon (optional)

search for an epsilon-optimal policy.

epsilon is a real in ]0; 1].

By default, epsilon = 0.01.

max_iter (optional)

maximum number of iterations to be done.

max_iter is an integer greater than 0.

By default, max_iter is set to 1000.

Evaluation

V

optimal value fonction.

V is a (Sx1) vector.

policy

optimal policy.

policy is a (Sx1) vector. Each element is an integer corresponding to an action which maximizes the value function.

iter

number of iterations.

cpu_time

CPU time used to run the program.

Examples

-> P = list();
-> P(1) = [ 0.5 0.5;   0.8 0.2 ];
-> P(2) = [ 0 1;   0.1 0.9 ];
-> R = [ 5 10;   -1 2 ];

-> [V, policy, iter, cpu_time] = mdp_policy_iteration_mod(P, R, 0.9)
cpu_time =
    0.0500
iter =
    5
policy =
    2
    1
V =
    41.865642
    35.47028

-> mdp_verbose()   // set verbose mode

-> [V, policy] = mdp_policy_iteration_mod(P, R, 0.9)
    Iteration   V_variation
          1                 8
          2                 1.6238532
          3                 0.0437728
          4                 0.0011799
          5                 0.0000318
policy =
    2
    1
V =
    41.865642
    35.47028

In the above example, P can be a list containing sparse matrices:
-> P(1) = sparse([ 0.5 0.5;  0.8 0.2 ]);
-> P(2) = sparse([ 0 1;  0.1 0.9 ]);
The function is unchanged.

Authors


Report an issue
<< mdp_policy_iteration Markov Decision Processses (MDP) Toolbox mdp_relative_value_itera >>