<< mdp_finite_horizon Markov Decision Processses (MDP) Toolbox mdp_policy_iteration_mod >>

Markov Decision Processses (MDP) Toolbox >> Markov Decision Processses (MDP) Toolbox > mdp_policy_iteration

mdp_policy_iteration

Solves discounted MDP with policy iteration algorithm.

Calling Sequence

[V, policy, iter, cpu_time] = mdp_policy_iteration (P, R, discount)
[V, policy, iter, cpu_time] = mdp_policy_iteration (P, R, discount, policy0)
[V, policy, iter, cpu_time] = mdp_policy_iteration (P, R, discount, policy0, max_iter)
[V, policy, iter, cpu_time] = mdp_policy_iteration (P, R, discount, policy0, max_iter, eval_type)

Description

mdp_policy_iteration applies the policy iteration algorithm to solve discounted MDP. The algorithm consists in improving the policy iteratively, using the evaluation of the current policy.

Iterating is stopped when two successive policies are identical or when a specified number (max_iter) of iterations have been performed.

This function uses verbose and silent modes. In verbose mode, the function displays the number of different actions between the policies n-1 and n after each iteration.

Arguments

P

transition probability array.

P can be a 3 dimensions array (SxSxA) or a list (1xA), each list element containing a sparse matrix (SxS).

R

reward array.

R can be a 3 dimensions array (SxSxA) or a list (1xA), each list element containing a sparse matrix (SxS) or a 2D array (SxA) possibly sparse.

discount

discount factor.

discount is a real which belongs to [0; 1[.

policy0 (optional)

starting policy.

policy0 is a (Sx1) vector.

By default, policy0 is the policy which maximizes the expected immediate reward.

max_iter (optional)

maximum number of iterations to be done.

max_iter is an integer greater than 0.

By default, max_iter is set to 1000.

eval_type (optional)

define function used to evaluate a policy.

eval_type is 0 for mdp_eval_policy_matrix use, mdp_eval_policy_iterative is used in all other cases.

By default, eval_type is set to 0.

Evaluation

V

optimal value fonction.

V is a (Sx1) vector.

policy

optimal policy.

policy is a (Sx1) vector. Each element is an integer corresponding to an action which maximizes the value function.

iter

number of iterations.

cpu_time

CPU time used to run the program.

Examples

-> P = list();
-> P(1) = [ 0.5 0.5;   0.8 0.2 ];
-> P(2) = [ 0 1;   0.1 0.9 ];
-> R = [ 5 10;   -1 2 ];

-> [V, policy] = mdp_policy_iteration(P, R, 0.9)
policy =
   2
   1
V =
   42.44186
   36.046512

-> mdp_verbose()  // set verbose mode

-> [V, policy, iter, cpu_time] = mdp_policy_iteration(P, R, 0.9)
  Iteration Number_of_different_actions
        1            1
        2            0
cpu_time =
   0.02
iter =
   2
policy =
   2
   1
V =
   42.44186
   36.046512

In the above example, P can be a list containing sparse matrices:
-> P(1) = sparse([ 0.5 0.5;  0.8 0.2 ]);
-> P(2) = sparse([ 0 1;  0.1 0.9 ]);
The function is unchanged.

Authors


Report an issue
<< mdp_finite_horizon Markov Decision Processses (MDP) Toolbox mdp_policy_iteration_mod >>