Applies the Bellman operator to a value function Vprev and returns a new value function and a Vprev-improving policy.
[V, policy] = mdp_bellman_operator(P, PR, discount, Vprev)
mdp_bellman_operator applies the Bellman operator: PR + discount*P*Vprev to the value function Vprev.
Returns a new value function and a Vprev-improving policy.
transition probability array.
P can be a 3 dimensions array (SxSxA) or a list (1xA), each list element containing a sparse matrix (SxS).
reward array.
PR can be a 2D array (SxA) possibly sparse.
discount factor.
discount is a real number belonging to ]0; 1].
value fonction.
Vprev is a (Sx1) vector.
new value fonction.
V is a (Sx1) vector.
a policy.
policy is a (Sx1) vector. Each element is an integer corresponding to an action.
-> P = list() -> P(1) = [ 0.5 0.5; 0.8 0.2 ]; -> P(2) = [ 0 1; 0.1 0.9 ]; -> R = [ 5 10; -1 2 ]; -> [V, policy] = mdp_bellman_operator(P, R, 0.9, [0;0]) policy = 2 2 V = 10 2 In the above example, P can be a list containing sparse matrices: -> P(1) = sparse([ 0.5 0.5; 0.8 0.2 ]); -> P(2) = sparse([ 0 1; 0.1 0.9 ]); The function is unchanged. | ![]() | ![]() |