Evaluates a policy using the TD(0) algorithm.
Vpolicy = mdp_eval_policy_TD_0 (P, R, discount, policy) Vpolicy = mdp_eval_policy_TD_0 (P, R, discount, policy, N)
mdp_eval_policy_TD_0 evaluates the value fonction associated to a policy using the TD(0) algorithm (Reinforcement Learning).
transition probability array.
P can be a 3 dimensions array (SxSxA) or a list (1xA), each list element containing a sparse matrix (SxS).
reward array.
R can be a 3 dimensions array (SxSxA) or a list (1xA), each list element containing a sparse matrix (SxS) or a 2D array (SxA) possibly sparse.
discount factor.
discount is a real which belongs to [0; 1[.
a policy.
policy is a (Sx1) vector. Each element is an integer corresponding to an action.
number of iterations to perform.
N is an integer greater than the default value.
By default, N is set to 10000.
value fonction.
Vpolicy is a (Sx1) vector.
-> // to reproduce the following example, it is necessary to initialize the pseudorandom number generator -> grand('setsd',ones(625,1)) -> P = list(); -> P(1) = [ 0.5 0.5; 0.8 0.2 ]; -> P(2) = [ 0 1; 0.1 0.9 ]; -> R = [ 5 10; -1 2 ]; -> Vpolicy = mdp_eval_policy_TD_0(P, R, 0.9, [1; 2]) Vpolicy = 43.088729 20.887261 In the above example, P can be a list containing sparse matrices: -> P(1) = sparse([ 0.5 0.5; 0.8 0.2 ]); -> P(2) = sparse([ 0 1; 0.1 0.9 ]); The function is unchanged. | ![]() | ![]() |