Journal of Computer Science 1 (1): 28-30, 2005

In this article, we presented the Q-Learning training method which is a derivative of the reinforcement learning called sometimes training by penalty-reward. We illustrate this by an application to the mobility of a mobile in an enclosure closed on the basis of a starting point towards an unspecified arrival point. The objective is to find an optimal way optimal without leaving the enclosure.


INTRODUCTION
The inherent difficulty in the construction of a training database in the learning process represents an operational limit of certain type of training such as the supervised training [1,2,3]. The reinforcement learning is a possible alternative to define the training database by an operator. One of its advantages resides in the form of its training examples. They are triplets (input, output, quality), where the last component represents the utility to produce such " output " for such " input ". The examples of learning are generated here automatically during a phase known as of "exploration ". It is generally about a random exploration of the research space [4].

Reinforcement Learning:
The reinforcement learning called sometimes learning with critic is a slightly supervised learning [5,6]. Initially, the study consists in observing the training, not directly, but by the means of the behavior; in the second time, the apprentice does not make any more that to answer to an environmental antecedent, and operates on the environment (Fig. 1). In this model, with each interaction with his environment, the apprentice represented by the robot and characterized by his behavior B perceives the state S in which it is by the means of function I (S). While basing itself on its perception of the state, the apprentice then chosen an action U among the whole available actions in this state according to a probability P. When this action is applied to the environment, the system changes state, the apprentice receives a reinforcement signal r by the reinforcement function R [7]. The objective of the reinforcement learning is thus to find the output of greater utility. The success of the application will depend on the quality of the function specifying the utility of a pair (input, output). This function of utility, usually called reinforcement function [2].
The Q-Learning: The Q-Learning was proposed for Markovien's problems decision, with discrete states and actions spaces. At each step of time an agent observes the vector of state x t chooses and applies an action u t .
The system passes in state x t+1 and the agent receives a reinforcement r(x t ,u t ). The goal of the training is to find the policy of order which maximizes the sum of the future reinforcements. For a given policy π , we note u t = π (x t ) the selected action. The evaluation function π, noted V π , is given by: The parameter γ, 0 ≤ γ < 1, ensure the convergence of the sum. The optimal evaluation function V *, corresponds to an optimal policy, checks the equation of optimality of Bellman : where: U x = set of possible actions in the state x P xy = probability of passing from the state x to y by the action u Q(x, y) represents the total reinforcement if the action u is selected in state x and if an optimal policy is selected then. It is called the quality function for a couple (state, action). If the transitions probabilities P xy (u) and the law of the reinforcement r(x,u) are known, it is possible to find an optimal policy by using a dynamic programming algorithm. Instead of using the evaluation function, Watkins proposed to estimate the function Q* by the function : (x, u) → Q(x, u) which is updated with each transition by [8,9] : β is a training parameter which must tend towards 0 when t tends towards the infinite one.

Exploration/Exploitation:
The generation of the learning base is done in parallel with the phase of exploration, the learning is incremental. This is why, when a representative base of learning was finally built, the learning is finished. The optimal policy is obtained by choosing the action which in each state maximizes the function of quality called Greedy policy: At the beginning of the learning the values Q(x,u) are not significant and the Greedy policy is not applicable. To obtain a useful estimate of Q it is necessary to sweep and evaluate the whole of the possible actions for all the states; what one calls the phase of exploration; then an exploitation phase is started once the finished learning. The Policy of Exploration/Exploitation PEE(x) can be selected according to Glorennec [7]. Pseudo-stochastic Method: The action with better value of Q has a probability P of being selected if not an action is selected randomly among all the possible actions in a given state.

Distribution of Boltzmann:
The action u is selected with the probability : T is comparable with the temperature, gives the importance of the random factors. This parameter decrease in time. Algorithm: After the choice of a policy of exploration/exploitation, the algorithm is held in the following way (Fig. 2 The user chooses an unspecified starting point S and a point of arrival A of his choice. The mobile passes by two phases: * A phase of exploration, where the mobile tries to move according to 4 possible choices' and with each action a reward or a punishment is generated according to whether one approaches the goal without leaving the matrix (Fig. 4). * In the second phase which is the exploitation phase (Fig. 5), the mobile moves towards its point of arrival by using the best actions learned during the first phase with maximum qualities. One needed 39 iterations for the phase of exploration and 6 only for the exploitation phase to go from the starting point (1 st line, 1 st column) towards the arrival point(3 rd line, 5 th column) for this simulation As for the matrix of qualities, we can give an explanation. If the mobile robot is with the first line first column, the quality to go down is the best (q=1.0), as for the action to go up or go on the left (q=-0.9) is punished, they are thus actions to prevent whereas the first action (q=0), was not tested yet.