Evaluates a policy using iterations of the Bellman operator.
Vpolicy = mdp_eval_policy_iterative(P, R, discount, policy) Vpolicy = mdp_eval_policy_iterative(P, R, discount, policy, V0) Vpolicy = mdp_eval_policy_iterative(P, R, discount, policy, V0, epsilon) Vpolicy = mdp_eval_policy_iterative(P, R, discount, policy, V0, epsilon, max_iter)
mdp_eval_policy_iterative evaluates the value fonction associated to a policy applying iteratively the Bellman operator.
transition probability array.
P can be a 3 dimensions array (SxSxA) or a list (1xA), each list element containing a sparse matrix (SxS).
reward array.
R can be a 3 dimensions array (SxSxA) or a list (1xA), each list element containing a sparse matrix (SxS) or a 2D array (SxA) possibly sparse.
discount factor.
discount is a real which belongs to [0; 1[.
a policy.
policy is a (Sx1) vector. Each element is an integer corresponding to an action.
starting point.
V0 is a (Sx1) vector representing an inital guess of the value function.
By default, V0 is only composed of 0 elements.
search for an epsilon-optimal policy.
epsilon is a real greater than 0.
By default, epsilon = 0.01.
maximum number of iterations.
max_iter is an integer greater than 0. If the value given in argument is greater than a computed bound, a warning informs that the computed bound will be used instead.
By default, max_iter = 1000.
value fonction.
Vpolicy is a (Sx1) vector.
-> P = list(); -> P(1) = [ 0.5 0.5; 0.8 0.2 ]; -> P(2) = [ 0 1; 0.1 0.9 ]; -> R = [ 5 10; -1 2 ]; -> policy = [2; 1]; -> Vpolicy = mdp_eval_policy_iterative(P, R, 0.8, policy) Vpolicy = 23.170385 16.463068 -> mdp_verbose() // set verbose mode -> Vpolicy = mdp_eval_policy_iterative(P, R, 0.8, policy) Iteration V_variation 1 10 2 6.24 3 4.992 4 3.272704 5 2.6181632 6 1.7992516 7 1.4394012 8 1.0305747 9 0.8244598 10 0.6100282 11 0.4880226 12 0.3701266 13 0.2961013 14 0.2285697 15 0.1828558 16 0.1428803 17 0.1143042 18 0.0900490 19 0.0720392 20 0.0570602 21 0.0456481 22 0.0362846 23 0.0290277 24 0.0231263 25 0.0185010 26 0.0147616 27 0.0118093 28 0.0094313 29 0.0075451 30 0.0060295 31 0.0048236 32 0.0038562 33 0.0030849 34 0.0024668 35 0.0019735 36 0.0015783 37 0.0012627 38 0.0010099 39 0.0008080 40 0.0006463 41 0.0005170 42 0.0004136 43 0.0003309 44 0.0002647 45 0.0002117 46 0.0001694 47 0.0001355 48 0.0001084 49 0.0000867 MDP Toolbox: iterations stopped, epsilon-optimal value function Vpolicy = 23.170385 16.463068 In the above example, P can be a list containing sparse matrices: -> P(1) = sparse([ 0.5 0.5; 0.8 0.2 ]); -> P(2) = sparse([ 0 1; 0.1 0.9 ]); The function is unchanged. | ![]() | ![]() |