Solves discounted MDP with the Q-learning algorithm (Reinforcement learning).
[Q, V, policy, mean_discrepancy] = mdp_Q_learning(P, R, discount) [Q, V, policy, mean_discrepancy] = mdp_Q_learning(P, R, discount, N)
mdp_Q_learning computes the Q matrix, the mean discrepancy and gives the optimal value function and the optimal policy when allocated enough iterations. It uses an iterative method.
No additional display in verbose mode.
transition probability array.
P can be a 3 dimensions array (SxSxA) or a list (1xA), each list element containing a sparse matrix (SxS).
reward array.
R can be a 3 dimensions array (SxSxA) or a list (1xA), each list element containing a sparse matrix (SxS) or a 2D array (SxA) possibly sparse.
discount factor.
discount is a real which belongs to [0; 1[.
number of iterations to perform.
N is an integer that must be greater than the default value.
By default, N is set to 10000.
an action-value function that gives the expected utility of taking a given action in a given state and following an optimal policy thereafter.
Q is a (SxA) matrix.
discrepancy means over 100 iterations.
mean_discrepancy is a vector of V discrepancy mean over 100 iterations. Then the length of the vector for the default value of N is 100.
value function.
V is a (Sx1) vector.
policy.
policy is a (Sx1) vector. Each element is an integer corresponding to an action which maximizes the value function.
-> // to reproduce the following example, it is necessary to init the pseudorandom number generator -> grand('setsd',ones(625,1)) -> P = list(); -> P(1) = [ 0.5 0.5; 0.8 0.2 ]; -> P(2) = [ 0 1; 0.1 0.9 ]; -> R = [ 5 10; -1 2 ]; -> [Q, V, policy, mean_discrepancy] = mdp_Q_learning(P, R, 0.9); -> Q Q = 39.829611 3.6570905 19.883102 22.363265 -> V V = 39.829611 22.363265 -> policy policy = 1 2 ->plot2d(mean_discrepancy) (graph visualization not shown) In the above example, P can be a list containing sparse matrices: -> P(1) = sparse([ 0.5 0.5; 0.8 0.2 ]); -> P(2) = sparse([ 0 1; 0.1 0.9 ]); The function is unchanged. | ![]() | ![]() |