<< mdp_eval_policy_TD_0 Markov Decision Processses (MDP) Toolbox mdp_eval_policy_matrix >>

Markov Decision Processses (MDP) Toolbox >> Markov Decision Processses (MDP) Toolbox > mdp_eval_policy_iterative

mdp_eval_policy_iterative

Evaluates a policy using iterations of the Bellman operator.

Calling Sequence

Vpolicy = mdp_eval_policy_iterative(P, R, discount, policy)
Vpolicy = mdp_eval_policy_iterative(P, R, discount, policy, V0)
Vpolicy = mdp_eval_policy_iterative(P, R, discount, policy, V0, epsilon)
Vpolicy = mdp_eval_policy_iterative(P, R, discount, policy, V0, epsilon, max_iter)

Description

mdp_eval_policy_iterative evaluates the value fonction associated to a policy applying iteratively the Bellman operator.

Arguments

P

transition probability array.

P can be a 3 dimensions array (SxSxA) or a list (1xA), each list element containing a sparse matrix (SxS).

R

reward array.

R can be a 3 dimensions array (SxSxA) or a list (1xA), each list element containing a sparse matrix (SxS) or a 2D array (SxA) possibly sparse.

discount

discount factor.

discount is a real which belongs to [0; 1[.

policy

a policy.

policy is a (Sx1) vector. Each element is an integer corresponding to an action.

V0 (optional)

starting point.

V0 is a (Sx1) vector representing an inital guess of the value function.

By default, V0 is only composed of 0 elements.

epsilon (optional)

search for an epsilon-optimal policy.

epsilon is a real greater than 0.

By default, epsilon = 0.01.

max_iter (optional)

maximum number of iterations.

max_iter is an integer greater than 0. If the value given in argument is greater than a computed bound, a warning informs that the computed bound will be used instead.

By default, max_iter = 1000.

Evaluation

Vpolicy

value fonction.

Vpolicy is a (Sx1) vector.

Examples

-> P = list();
-> P(1) = [ 0.5 0.5;   0.8 0.2 ];
-> P(2) = [ 0 1;   0.1 0.9 ];
-> R = [ 5 10;   -1 2 ];
-> policy = [2;   1];

-> Vpolicy = mdp_eval_policy_iterative(P, R, 0.8, policy)
Vpolicy =
   23.170385
   16.463068

-> mdp_verbose()  // set verbose mode

-> Vpolicy = mdp_eval_policy_iterative(P, R, 0.8, policy)
   Iteration   V_variation
     1      10
     2      6.24
     3      4.992
     4      3.272704
     5      2.6181632
     6      1.7992516
     7      1.4394012
     8      1.0305747
     9      0.8244598
     10      0.6100282
     11      0.4880226
     12      0.3701266
     13      0.2961013
     14      0.2285697
     15      0.1828558
     16      0.1428803
     17      0.1143042
     18      0.0900490
     19      0.0720392
     20      0.0570602
     21      0.0456481
     22      0.0362846
     23      0.0290277
     24      0.0231263
     25      0.0185010
     26      0.0147616
     27      0.0118093
     28      0.0094313
     29      0.0075451
     30      0.0060295
     31      0.0048236
     32      0.0038562
     33      0.0030849
     34      0.0024668
     35      0.0019735
     36      0.0015783
     37      0.0012627
     38      0.0010099
     39      0.0008080
     40      0.0006463
     41      0.0005170
     42      0.0004136
     43      0.0003309
     44      0.0002647
     45      0.0002117
     46      0.0001694
     47      0.0001355
     48      0.0001084
     49      0.0000867
MDP Toolbox: iterations stopped, epsilon-optimal value function
Vpolicy =
   23.170385
   16.463068

In the above example, P can be a list containing sparse matrices:
-> P(1) = sparse([ 0.5 0.5;  0.8 0.2 ]);
-> P(2) = sparse([ 0 1;  0.1 0.9 ]);
The function is unchanged.

Authors


Report an issue
<< mdp_eval_policy_TD_0 Markov Decision Processses (MDP) Toolbox mdp_eval_policy_matrix >>