Transfer Learning in Attack Avoidance Games

Corresponding Author: Edwin Torres Department of Electrical and Electronic Engineering, University of Los Andes, Bogota, Colombia Email: ed.torres20@uniandes.edu.co Abstract: Transfer knowledge is a human characteristic that has been replicated in machine learning algorithms to improve learning performance measures. However, little success has been accomplished in reinforcement learning tasks when a function approximation is needed to estimate the value functions. In this study, we present a new strategy to facilitate knowledge transfer when an agent is learning to solve a sequence of increasing difficulty tasks. We show that the tasks sequence is an effective scenario to segment the function approximation hypothesis space allowing a faster learning especially in the last task of the sequence. Moreover, the sequence allows the design of a similarity function that helps the agent to determine in which moment is more appropriated to use the transfer autonomously. We empirically show the importance of the presence of all the tasks in the established ordering to accomplish the best improvement in the learning time for the last task.


Introduction
Reinforcement learning algorithms require a large amount of data collected through the agent-environment interaction. This need in the volume of exploration to find adequate solutions restricts the applicability of the methods, a situation that becomes more critical when a function approximator is required to estimate the value functions. In this context, knowledge transfer is used to facilitate learning in new tasks based on previously acquired knowledge. In this way, it is possible to guide the exploration through mechanisms that make it possible to relate past experiences with those found in new tasks and provide advice to the agent when selecting actions in specific states. In this study, we propose a strategy for the task sequence construction that facilitates the function approximation and the use of transfer learning. This strategy produces a better learning rate for the agent in the final task of the sequence. We start from a fact that is generally found in human learning processes (Piaget, 1963;Vygotsky and Kozulin, 1962), in which tasks must be organized according to their level of difficulty and each new task must expand the knowledge of the previous task. Under this scheme, it is necessary to consider the capacity of representation of the function approximators used. Since, depending on its structure, the amount of exploration required to converge to an acceptable solution will also depend. In our method, we propose the use of structures whose complexity increases depending on the difficulty of the task to be solved. Thus, easy tasks use simple structures and difficult tasks use more complex structures. For example, in the use of neural networks, the increase in complexity is done by adding more neurons to the structure of the network. We build an experimental framework to let an agent learn in a sequence of related tasks ordered by its increasing difficult (Madden and Howley, 2004;Taylor et al., 2007a) and with the help of a similarity measure ensure when is most beneficial for the agent use transfer in an autonomous way. The rest of this document is organized as follows: Section 2 (Background) briefly presents the reinforcement learning framework, section 3 (Related work) we give a brief summary of the previous work in transfer learning applied to reinforcement learning, section 4 (proposed method) we give a detailed description of the proposed transfer strategy, in section 5 (Experimental results) we present the results of the experiments on attack avoidance tasks. Finally, section 6 (Conclusion) we resume the principal findings of this work.

Background
The tasks studied in this study can be formally described as a series of sequential decision problems. Each problem consists of a series of decisions that will lead to a final state in which the agent will evaluate if the decisions taken were appropriate: Whether some goal was achieved or not. Mathematically this process can be modeled as a Markov Decision Process (MDP), 1466 (Puterman, 1994). An MDP is specified by the 5-tuple S, A, P, R, γ. S is the set of possible states. A is the set of actions. P is a (possibly stochastic) transition function P: SASℝ, which indicates the probability that taking action a in state s will lead to state s'. R is the reward function P: SASℝ, which maps each state-action pair and the resulting state s' to a real number, the instantaneous reward. γ[0, 1) is the discount factor for future rewards. At each time step, an action an is taken according to the current policy π: SA which maps states to actions. If the MDP is episodic (as will be the assumption considered in this study), it will begin in a start state, then a series of actions will be taken until it reaches a terminal state, referred to as a goal state. Given the MDP specification, the problem is to maximize the expected sum of discounted rewards. The reward Rt represents a one-step measure of performance, that is, how good it was to take action at when the agent was in state st. The return Gt is the sum of discounted rewards and it is defined as: Maximizing the expected value of Gt (long-term measure of performance) implies finding an optimal policy π * . This policy allows the agent to select the best possible action for each state according to the maximization. Given the stochastic nature of the MDP, we are interested in the expected value of Gt, which can be characterized in two ways:  The state value function: Expected sum of discounted rewards given the initial state s following the policy π:  The state-action value function: Expected sum of discounted rewards given the initial state s and action a and following the policy π thereafter: The problem of finding an optimal policy π * is solved through the maximization of one of these functions.
An MPD can be solved through Dynamic Programming (DP) (Bellman, 1957), but a complete and correct knowledge of the transitions and rewards is required. DP iteratively computes approximations for the true value function, improving them over time. However, the full knowledge requirement is not always possible; especially in MDPs with large high-dimensional state and action spaces where it is unfeasible to determine the dynamics P. Additionally, as the number of states increases, the computational requirements make DP untractable. Reinforcement Learning-RL (Sutton and Barto, 1998) (Also known as Approximate Dynamic Programming-ADP) offers a powerful set of tools for sequential decision tasks with large state-action spaces. Most of them are based on Temporal Difference (TD) methods, such as Q-learning (Watkins, 1989) and SARSA (Rummery and Niranjan, 1994), in which the solution is learned by backing up experienced rewards through time, resulting in an estimated state-action value function. Updating of the current best policy is generated from Q by selecting the action that maximizes value for the current state: The agent taking the sequential decisions (actions) must balance between exploration, where the agent chooses a random action to observe different states and learn more about the environment and exploitation, where the agent selects actions according to the current policy (the current best action). A basic strategy that balances these two options is ε-greedy action selection: The agent selects a random action with probability ε and the current best action is selected with probability 1-ε (where ε is in [0, 1]).
The agent interacts with the environment and a value function actualization is carried out for each new sample (a sample: st, at, rt, st+1) one at a time. For this reason, a large amount of experience is needed to obtain a nearoptimal value function, to solve the task and produce an optimal policy.

Function Approximation in Reinforcement Learning
In problems where the number of states and available actions is large there is an exponential growth on the computational requirements needed to solve an RL problem (computation time and storage). In these cases, it is necessary to make use of approximation techniques to construct a compact representation of the value functions, which could be parametric or nonparametric. A parametric approximator for Q could be: where, the Q function approximator is parameterized by a n-dimensional vector w and a set of n basis functions Φ(st, at) that are used to extract pre-defined characteristics from the state-action pair. The approximation is built using samples collected from the interaction between the agent and the MDP (environment). With these samples, a regression is carried out to optimize the function approximator over the state-action space using an error measure over the value differences of the value function in each iteration. In RL algorithms a common error measure used is the Least Mean Square error -LMS-and the optimization is usually done through a gradient descent method.
The function approximator can be linear or nonlinear in the parameters. Typically, neural networks (nonlinear) and radial basis (linear) functions are used to implement approximators. The selection of a specific method depends on the problem and the generalization capabilities of the function approximator. Algorithms for problems with continuous RL problem include: TD Learning with Function Approximation, Policy Gradient (Sutton et al., 2000), LSTD (Boyan, 1999), LSPI (Lagoudakis and Parr, 2003), Batch methods FQI (Ernst et al., 2005) and Deep Qlearning (Mnih et al., 2013).

Transfer Difficulties in RL
As mentioned before, to solve tasks with large state or action spaces it is needed the use of function approximation techniques. In each MDP, the stateaction value function is unknown and it has to be estimated and incrementally build from rewards samples and previous estimations of the same function. This situation becomes more critical in harder tasks but can be easily managed in easier tasks as we will show later. Additionally, this estimation procedure has an important influence on the stability of the function approximation process and for the final performance of the learning, especially when a nonlinear function approximator is used. Some works (Boyan and Moore, 1995;Baird, 1995;Tsitsiklis and Van Roy, 1997) have reported low performance and divergence when function approximation is used. The exploration strategy (e.g., ε-greedy) also has an impact on the function approximation stability. The value function estimation used to derive the sampling policy affects the way the task samples are taken and these samples will affect the next value function estimation. This alternation between sample and learning can result in larger learning times and in the worst case to a divergence on the function approximation.

Related Work
The lifelong learning framework proposed by (Thrun, 1996) describes a scenario in which an agent interacts with a sequence of tasks. This scenario includes all possible future tasks that an agent may encounter over its lifetime. In RL, the lifelong learning setting focuses on problems in which an agent moves from one environment to another, or when the agent is in a changing environment. Additionally, it is assumed that exists some relation between the tasks MDPs. In our case, we need this relationship to construct an arrangement of the tasks. Tanaka and Yamamura (1997) proposed a method to pre-train a neural network using the knowledge from the previous task, using it to bias the weights of the neural network that is used in subsequent tasks. Their experiments show the importance of the relation between the tasks and how this impacts the agent's learning in future tasks. White et al. (2012) investigated the use of policies from past tasks to construct general value functions that are used in an offpolicy setting to improve the agent's performance in a new task. The general value functions can capture a wide variety of characteristics from the environment dynamics, which allows the agent to develop and preserve multiple capabilities. Another learning strategy concerned with the use of previous knowledge is Transfer Learning (TL), which is primarily focused on the task to task transfer of knowledge. TL has been successfully applied to several problems in machine learning (Pan and Yang, 2009). TL assumes that there exists a clear identification of the boundary where a task ends and a new one begins. Taylor and Stone (2009) presented a survey of research done in transfer learning related to RL and introduced the most used metrics to measure the efficiency of the transfer. These works describe methods like calculating prior probabilities, transfer of samples, value function and policy transfer and value function structure. Konidaris et al. (2012) defined a new value function, which considers only the common characteristics across all the tasks and computes an approximation of the Q values. In this way the transfer process can be done with this function, using it to determine a value that has relevance for the action selection in every task. For this reason, additional training is necessary to approximate this function to the Q values. This work shows an alternative source of knowledge and validates the idea of searching for other knowledge sources. Taylor et al. (2007b) proposed Intertask mappings to allow previous Q values to be used in new tasks. The intertask mapping does a transformation of the state an action spaces from the target task to the source task, then a Q value is computed and used the determine which action to take in the target task. However, the authors do not mention which is the process to design or select the intertask mappings. A completely different approach is proposed by (Lazaric, 2008), the authors determined the sample relevance from the source tasks to determine which ones to use in the agent's training when learning in the target task. This framework allows a different transition model or a different reward function from the source tasks and the target task, but the state-action spaces must be the same. The objective of the sample relevance analysis is to avoid negative transfer, which worsens the agent's performance. A related approach called Multitask learning (Caruana, 1997) has proved to be effective when transferring from multiple source tasks simultaneously. Other relevant works (Drummond, 1468(Drummond, 2002Torrey et al., 2005;Liu and Stone, 2006;Wilson et al., 2012;Fernández and Veloso, 2006) have applied transfer learning strategies to RL, using different strategies: Policy advice, value function transfer, model estimation. Each one of these strategies finds knowledge in a different location and based on that implements a method to transfer it. A method related to the learning in a task sequence in the context of supervised learning, curriculum learning, is presented in (Bengio et al., 2009) and (Kumar et al., 2010). These works show experimentally the effect in the learning rate produced by the task ordering inside the sequence. Weinshall et al. (2018) presents an empirical evaluation of a curriculum in where the tasks are ordered by difficulty, itis experimentally shown that the convergence rate of a deep neural network improves as a result of transfer using the proposed curriculum. An extensive survey in curriculum learning applied to RL is found in (Narvekar et al., 2020). This work summarizes a series of approaches that focus on the transfer of knowledge when an agent is given a series of tasks to solve. The work analysis is carried out based on task generation, sequencing and transfer learning.

Materials and Proposed Method
Based on the following characteristics of the human learning processes we proposed a technique to overcome some of the problems mentioned before:  The tasks faced by humans appears in a sequentially increasing difficulty order (Scaffolding (Vygotsky and Kozulin, 1962)  In the learning of new tasks humans make use of previously acquired knowledge finding similarities with the old tasks (Zone of Proximal Development (Vygotsky and Kozulin, 1962)  Each new task expands the previous task state and knowledge representation Our method is related to the curriculum learning framework from a RL perspective that replicates characteristics of the human learning process: An agent is learning to solve a sequence of tasks, the curriculum, that are related and ordered by increasing difficulty. Each task constitutes an episodic MDP with a large state space. The agent must learn to solve each MDP starting with the easiest one, the first task in the sequence. At the end of the first task, the agent will have knowledge that can be used (transferred) to solve the next task, which we assume to be related to the previous task, as shown in Fig. 1. The task decomposition into a sequence of ordered tasks can help to a function approximation hypothesis space segmentation as shown in Fig. 2. Easier tasks can use a small function approximator and thus have a smaller hypothesis space in where the value function optimization and the reinforcement learning problem can benefit from: (1) Less prone to function approximation divergence, (2) Better task exploration (sampling policy), need to explore only in the new part of the space (i.e., state space). There is already a knowledge of some part of the space. (3) Less time and samples needed to find a near optimal policy.

Task Sequence Generation: The Curriculum
The objective of learning in a sequence of RL tasks is to improve some performance measures in the learning of the last task TK. For this reason, we adopt a top-down strategy. This means starting from the last one which is the most difficult or complex to solve. Then, a decomposition step must be applied to TK in order to obtain an easier Tk-1 and continue until some desired initial task. The decomposition step consists in the application of a rule f to a given task T: f × Tk → Tk−1 There are a variety of rules f and each one can generate a different sequence . Even more, a different rule can be applied in each decomposition step. We call the set of all possible rules applicable to TK. From the application to TK we can generate the sequences set T, which is the set of all possible sequences whose final task is TK. Definition 1. Task sequence . Given a target task TK and rule f. We can obtain easier related tasks TK−n, with n < K, by applying f repeatedly to every task. With these tasks, we can construct the sequence of tasks ordered by increasing difficulty.
This ruled top-down strategy guaranties a relation between the tasks in and knowledge preservation since every task in contains all the information from the previous easier tasks in . In this context easier means that an agent can learn a policy that solves Tk−1 in a shorter time (less experience needed) than the time needed to solve Tk. Using this time measure we can say that a task is easier or harder than another task.
Is this context easier means that an agent can learn a policy that solves Tk−1 in a shorter time (less experience needed) than the time needed to solve Tk and the opposite applies to harder. Using this time measure we can say that a task is easier or harder than another task.

Task Similarity
In contrast to previous works (Carroll and Seppi, 2005;Ferns et al., 2012;Bou Ammar et al., 2014) where it is necessary to determine which tasks are relevant to transfer from, in our approach the task similarity measure is used to decide when is the best moment to transfer knowledge from the previous task. The agent will calculate this measure through a similarity function which evaluates the relatedness of the present states from TK with the previous task available information (e.g., Q-value function, policy, state space information). The idea behind the similarity function is intuitive from the human learning perspective, when the agent is learning a new task Tk and face a new situation the obvious first reaction is to relate the actual state to previous experience from Tk−1 to decide which could be the best action to take. In the same way, we used a rule set to generate the sequence tasks, it is necessary a similarity function for knowledge transfer from Tk−1 to Tk.
Definition 2. Given a state s from Tk and the state representation I(Sk−1) used in Tk−1, the sample similarity of s is defined as: where, I is a function that extract quantitative characteristics that represent a given space (action or state space).

Algorithm
The RL agent's learning algorithm for the learning in each task is shown in Algorithm 1, it is based on (Riedmiller, 2005). A Q value Function Approximator (FA) is initialized and an actualization of this function is done in batch after N interactions between the agent and the environment. The samples are collected using SARSA method (Rummery and Niranjan, 1994). This process is repeated M times (batches) to guarantee and appropriate learning of the task. This approach is called Batch algorithm (Lagoudakis and Parr, 2003;Bradtke and Barto, 1996;Ernst et al., 2005).
The function simulate (line 5) is used to observe the environment evolution to next state s' when an action a is performed by the agent when it is in state s. The actionSelection function (line 6) is used by the agent to select the next action a' when it is in state s' and this function is where the transfer takes place.  Algorithm 2 shows the pseudocode that implements the action selection through a transfer strategy from Tk-1 or using and ε-greedy strategy. The transfer strategy depends on the similarity measure ρ and the transfer rate parameter  which is used to control the amount of transfer. The parameter ρ0 is used as a threshold to determine when a state s is similar enough to the previous experience. The transfer uses a policy advice method to select the action a through the function approximator 1 k Q  estimated from the task Tk−1. The function filter (line 3) adapts the actual state s, since it comes from Sk, to be a state in the form required by

Experimental Results and Discussion
In this section, we describe the set of experiments conducted to test our proposed strategy. We introduce the attack avoidance task on which the task sequence and the similarity function were generated.

Attack Avoidance
The attack avoidance game is a discrete version of the pursuit-evasion (Ho et al., 1965;Parsons, 1978) and differential games (Isaacs, 1999). These games classes are related to the analysis and modeling of dynamical systems in which a set of variables evolve following a differential equation system.
Our attack avoidance game consists of an agent who must reach a goal zone and an attacker(s) who is(are) pursuing the agent. If the attacker touches the agent before it reaches the goal zone, the agent loses the game, the agent wins otherwise. Additionally, the agent is not allowed to stay in the forbidden zone, which corresponds to zones to both sides of the goal zone. The game board and the actions for the agent and the attacker are shown in Fig. 3.
Using the attack-avoidance game, we designed a set of four tasks shown in Fig. 4. In task T0 the agent has the same dynamics but there is no attacker. In this case, the agent's objective is to find the path to the goal zone. The rewards are defined as: 100 for reaching the goal zone, −100 for reaching the forbidden zone and 0 otherwise. In the second task T1 there is one attacker. In the third task T2 there are two attackers and in the third one T3 there are three attackers. This increase in the number of attackers makes each game more difficult than the previous one. In these last three tasks, the rewards changed: 1 for reaching the goal zone, −1 for reaching the forbidden zone, or when the attacker touches the agent and 0 otherwise.  The sequence containing the four tasks was generated by applying a simplification rule that consisted in eliminating one attacker from every task starting with T3.
In this case, it ended when there were no more attackers on the board.
The state variables are the agent position (x, y), the inverse distance from the agent to the attacker and the inverse distance from the attacker to a fixed point in the goal zone. This state representation is increased in the tasks where there is a new attacker.
To solve each task we took N samples from the environment before making a value function update, as shown in Algorithm 1. In this sampling phase, a εgreedy strategy collects samples from the interaction with the environment (game simulation), ε was set to 0.1. Next, in the learning phase, an approximation of the value function is calculated using the samples on the SARSA algorithm with γ set to 1.0.
At the beginning of each episode, the agent and the attackers are initialized at random positions. 300 batches of 600 samples were performed to train the agent in T0 and 6000 batches of 600 samples were performed to train the agent in tasks T1, T2 and T3.
We designed 15 experiments to tests different sequence configurations using the four tasks from the complete attack avoidance sequence. Table 1 shows the tasks included in each curriculum experiment.

Function Approximation
A neural network was used to approximate the Qvalue function. We started with a simple model for T0 an let the model grew as needed along the sequence. A representation of this process is shown in Fig. 5 and the structure of each network is shown in Table 2. The increase process is described as follows: After the training in Tk−1, a new neural network is created for Tk using the neural network structure of Tk−1 and adding nodes in the input layer to be able to receive the state variables for Tk. Additionally, we added more neurons to the hidden layer in order to increase the network approximation space for the more complex Tk. The weights of the new neural network were initialized randomly.

Similarity and Policy Advice
Transfer was done using a policy advice strategy inside the action Selection function, Algorithm 1 line 6. For this knowledge source, we used thee Q-value approximator obtained in Tk−1 to derive the policy πk−1. This policy was used to advise the agent in the Tk training using a εgreedy ρ-advice strategy. Algorithm 2 shows the action selection function where the agent used a similarity function ρ to determine in which states the agent could use the previous policy. For the attack avoidance game, a similarity function for each Tk−1 to Tk transfer was designed. In Fig. 6 are shown the similarity functions used in the transfer to T3. To be able to measure the influence of the amount of transfer used we added a transfer control rate variable . This variable was set to ten different values  = [0.0,0.1,0.2, ...,0.9] and for each value it was fixed during the training.  Target  Experiment  T0  T1  T2  T3  task  E1  X  T0  E2  X  T1  E3  X  X  T1  E4  X  T2  E5  X  X  T2  E6  X  X  T2  E7  X  X  X  T2  E8  X  T3  E9  X  X  T3  E 10  X  X  T3  E 11  X  X  X  T3  E 12  X  X  T3  E 13  X  X  X  T3  E 14  X  X  X  T3  E 15  X  X  X  X  T3   T0  T1  T2  T3 1472

Analysis
We focus our analysis in the experiments where T3 is the target tasks. Each transfer experiment Tk−1 to Tk was run 15 times. The Q value function transferred corresponds to an agent whose performance was the average of the 15 runs. From this agent, we obtain also the transfer rate . Table 3 shows the detailed performance for the transfer in the curriculums. Figure 7 shows the agent winning probability at the end of the training for the experiments in which T3 was the target task. All the experiments show an increase in their performance for the transfer rate  = 0.1. Particularly important at lower transfer rates is the performance for experiment E9 which contains T0 and T3, which reveals that a single task can improve the performance considerable if this task contains relevant knowledge that can be shared across all the curriculum. In this case, T0 is the source for the knowledge related to the goal location. This is more evident in Table 3, where can be seen that the best performances for T3 are for the experiments that included T0 in the curriculum. Although E10 and E12 also contain only two tasks, the performance in these experiments is not as good as the E9. The knowledge in E10 and E12 is more complex due to the presence of the attackers and is not shared over all the tasks.
The agent with the best performance corresponds to the E15 whose curriculum contains all the tasks. Figure 8 shows the mean reinforcement in the final part of the curriculum for the experiments in which T3 was the target task. As expected, these results show a similar behavior to the winning probabilities.
However, for the transfer rate  = 0.9 the agent is not learning, since 90% of the time is using the previous policy and is exploring in the other 10%. In this case, the performance is highly dependent in the previous knowledge transferred to the agent and the experiment performances are perfectly ordered according to Table  1, which indicates that as more complete is the curriculum the better the performance will be.
In Fig. 7 and 8 we see a decrease in the agent performance as the transfer rate increase. This decrease is a signal that negative transfer is occurring and that the selection of transfer must be done carefully to avoid it. Also, in these figures the agent performance in E8 was plotted to visually compare the effectiveness of our strategy in an equal time setting. E15 takes 300 batches from T0 and 6000 batches for T1, T2 and T3 for a total of 18300 batches. The curriculum in E15 was able to accomplish a better final performance than the one obtained when learning T3 from scratch even for the transfer rate  = 0.9 and for a similar number of batches.

Conclusion
In this study, a new framework for transfer learning was presented. First, we show the importance of using a similarity function when learning in a sequence of tasks. The similarity function act as a memory unit that allows the agent to compare old experiences with new ones and exploit the acquired knowledge in similar states. Our evaluations confirmed that the use of the similarity function improves the agent's learning rate in new tasks compare to learning from scratch.
Moreover, we also show the importance of the presence of all the tasks into the sequence. Our experiments using different sequences, called curriculums, show that important knowledge is contained in every task. For this reason, an adequate construction of the sequence must be done to guarantee an effective transfer of the knowledge that will result in a better performance in the target task.
Additionally, by the experiments modifying the FA size for each task we observed that it is possible to devise a strategy to find an optimal FA structure for the target task (the complex or harder one) and a proper way to training it to obtain a more optimal one in less time than training it from scratch.
Finally, by using the proposed transfer strategy, which includes the sequence design and the similarity function, the performance of the agent was increased in terms of time and samples needed in the target task, hence, confirming the importance of using the similarity function to determine which a where to apply transfer during the learning.