Combining Q-Learning and Multi-Layer Perceptron Models on Wireless Channel Quality Prediction

Corresponding Author: Andrea L. Piroddi Department of Computer Science, University of the People, Pasadena, California, USA Email: andrea.piroddi@uopeople.edu Abstract: One of the most complex challenges that wireless communication systems will face in the coming years is the management of the radio resource. In the next years, the growth of mobile devices, forecast (CISCO, 2020), will lead to the coexistence of about 8.8 billion mobile devices with a growing trend for the following years. This scenario makes the reuse of the radio resource particularly critical, which for its part will not undergo significant changes in terms of bandwidth availability. One of the biggest problems to be faced will be to identify solutions that optimize its use. This work shows how a combined approach of a Reinforcement Learning model and a Supervised Learning model (MultiLayer Perceptron) can provide good performance in the prediction of the channel behavior and on the overall performance of the transmission chain, even for Cognitive Radio with limited computational power, such as NB-IoT, LoRaWan, Sigfox.


Introduction
The current communication networks are rather complex dynamic systems; on the other hand, the simulation tools we have, to estimate the behavior of these architectures are based on simplified models that are often unable to reproduce the interaction of the multiple components involved such as the presence of interferers and phenomena such fading, moving obstacles, atmospheric events and last but not least the characteristics of the surrounding environment that can have a negative impact on the parameters of our system, such as frequency, amplitude, delay, etc. It is also true that networks today are able to produce a huge amount of measurement data and metadata, which if properly exploited could improve the management and interaction between the different elements in the network (Samek et al., 2017). Machine learning algorithms, Reinforcement Learning specifically, are particularly well suited for this purpose. The idea is to change the paradigm used so far, in which the goal is to adapt the transmission to the change in the characteristics of the channel in a new methodology that aims to predict what the characteristics of the channel will be in the instant preceding the transmissive event. In the following sections we introduce the concept of Cognitive Radio, then, we will show a Supervised Learning model, applied to an indoor context in which the system is able to predict the behavior of the channel inside the premises and to adapt some transmission parameters to guarantee a constant BER value. Finally, referring to the precious work done by (Gawłowicz and Zubow, 2019) in which it is proposed to combine the two simulation tools Network Simulator (NS-3) and OpenAI-Gym we present an optimized Q-Learning algorithm, which allows the agent to predict the behavior of the environment when sudden interference occurs in the system and consequently implementing the correct policy, in an Unsupervised Learning set. Having a better link quality means having higher ratio of successful reception and therefore a more reliable communication. The original contribution of this paper is the following: By appropriately combining two Machine Learning methodologies, it is possible to predict the behavior of the radio channel with a low computational cost, making this approach suitable for application in environments where terminals have limited computing capacity, as in IoT systems, LoRaWan and Sigfox. This translates into longer battery life and the possibility of increasing the number of terminals in the area served by a single node.

140
Background "A Cognitive Radio is the application of intelligent processing and adaptation to a wireless communications system" (Rondeau and Bostian, 2009). The basic idea is to make our network element an entity capable of learning, through the observation of environmental parameters, the behavior of the transmission channel and predicting its variations by acting in such a way as to optimize its performance in terms of throughput, power, coding scheme, energy consumption and at the same time minimizing interference to other devices.
In Fig. 1 we show an example of the policy that the agent implements, foreseeing a variation of the characteristics of the transmission channel. Since long time, in wireless communications we have learned how to describe the channel used for transmission, using different parameters, such as the operating frequency, the type of transmission medium (e.g., air, water), the type of environment (e.g., indoor, outdoor, urban, etc. …), the relative position of communicating parts (e.g., line of sight, not line of sight). The physical layer technology implemented in the transmitter and receiver includes blocks, such as the antenna, the frequency shifter, the sampler, the synchronizer, etc. The link layer is responsible for the correct delivery of the data frame, therefore includes header assembly and disassembly techniques and payload encoding and decoding, as well as mechanisms for correcting and checking errors and retransmissions. While the quality of a link is eventually influenced by a relatively limited number of observations, the so-called set of metrics.    (Cerar et al., 2018) collects the metrics that can be used to measure the radio link quality. Each metric can also be used as an input for another metric. So-called hardware-based metrics, such as Received Signal Strength Indicator (RSSI), Link Quality Indicator (LQI), Signal to Noise Ratio (SNR) and Bit Error Rate (BER) are produced directly from devices and depend on the underlying metrics, such as the Noise Figure, specific to one or another supplier. It is clear, looking at the table, that the number of the independent variables is bounded. However, recently, like additional input, the Topological (surrounding space) feature was taken into consideration, which presupposes the exchange of information on several levels, where the Learning Quality Estimator (LQE) is informed about the distance from the base station (or access point), etc... In this study we considered the development of a model based on topological data and classical metrics.
The classic approaches of channel resources management are based on measurement data report sent by the terminal to the central entity and decision action provided by the central unit to the terminal. These algorithms are managed centrally by the control unit that sends the actions to be performed by the terminal, such as a handover on a different node, or an increase in transmission power, or a change in modulation. This methodology presents a criticality: The device must always be connected with the central unit, otherwise the connection will be disrupted. Any unexpected variation of the radio parameters, such as sudden interference, can cause packet loss and the need to retransmit several times both the payload and the channel control packets. Essentially, the terminal is never autonomous in deciding which action to take in order to maintain the connection. In the event of sudden changes in the surrounding conditions, our terminal must be able to autonomously interpret the data collected and implement a decision that allows it to prevent the loss of the connection with the central unit. This is the reason why we propose the combined use of supervised and unsupervised learning methods in the management of radio resources.

Dataset and Layout
To verify our idea, we used an excellent dataset made available to the scientific community by Gonzalez-Ruiz -University of New Mexico (Gonzalez-Ruiz et al., 2011). The wireless channel measurements were collected indoor over a floor of the ECE building. at UNM along several routes. Figure 2 shows the floor plan as well as the regions where the measurements were taken. The triangular symbol shows the position of the transmitter. The position of the origin is also marked. Measurements are made in different regions marked by R and were collected with a router acting as the Transmitter (Tx) and a Pioneer robot that carries a WiFi card acting as a Receiver (Rx). Both transmitter and receiver are omni-directional. The WiFi card is an Atheros ar5006x WiFi card, operating at 2.4 GHz. The coordinates (x, y, z) of the origin are set to be (0, 0, 0). The unit used in this document and the data files is meter. The transmitter's location is (0.115, 0.11, 1.5). There is a total of 16 regions of measurements (R1-R16) and each region contains several routes of measurements. There is a total of 67 routes. The total number of measures is 12463, that is a solid dataset to work with. The data are in the format shown in Table 2.
Column 4 is the measured RSSI (Received Signal Strenght Indicator) of the signal in dBm.
Having a clear picture of our environment, we supposed the occurrence of an interferential source in a random point of our layout. The interference source will be a stationary signal over time, with transmission power of 0 dBm, center frequency and bandwidth the same as those used by the router and fixed position. We thought the interfering signal subject to the fading and path loss according to the Multi-Wall indoor Model (Publications Office of the EU, 1999) that is: Where: Lc = Constant loss Nwi = N. of penetrated walls of type i Lwi = Loss of walls of type i Nf = N. of penetrated floors Ntype = N. of wall types Lf = floor loss Since the loss due to floor penetration experimentally appears to be non-linear with the number of crossed floors, then an alternative version of the MWiM model has been proposed: Typical parameter values are: Lc = 0 dB, Lwi = 3-5 dB, Lf = 15-20 dB, b = 0.46.

SINR Modeling
SINR is usually defined for a specific receiver (or user). For a receiver positioned at some point x in space, its corresponding SINR value is given by: where, P is the received signal (of interest) power, I is the power of the other (interfering) signals in the network and N is the noise term, which may be random or a constant. In the following we are going to consider an Additive White Gaussian Noise (AWGN). The propagation model leads to a model for the SINR (Andrews et al., 2010): Consider a collection of n transmitters located at points x1 to xn in the plane or 3D space. Then for a user located, for example, at x = 0, the SINR for a signal coming from i-th base station (xi) is given by: where, hio is the power fading coefficient of the channel to the receiver of interest "o" from node "i", i is the power of transmitter "i" and φ is the set of interfering nodes (φ is a subset of all possible transmitters). The desired transmitter is at distance r from the desired receiver, while the i-th interferer is at distance Xi away. In our case we can consider numerator as the measured RSSI. The component can be seen as the Interference signal that reach our receiver from the interference source (α is the path loss exponent >2) and the term No is the Noise Power in the origin, given by: N0 is the Noise Power Spectral Density given by: N0 = kBT (kB is the Boltzmann's constant: 1.3810 23 J/K) and T is the system temperature (K). This means that we can calculate the SINR in each point of the floor. Figure 4 shows the SINR measured in Region 2.
Our goal is to predict the behavior of the transmission channel to choose the policy for optimizing the performance of our system. For simplicity, we will consider the optimization of the throughput. So, the basic idea is to use the most appropriate Modulation and Coding Scheme according to the prediction. To do this it is needed a prediction of the BER. One technique used to determine the quality of a digital transmission system is measuring its Bit Error Ratio (BER). The BER estimate is obtained by comparing the transmitted sequence of bits to the one received and counting the number of errors. The ratio between the bits received in error and the number of total bits received is the BER: This is a statistical process, so the measured BER only approaches the actual BER if the number of bits tested approaches infinity. In most cases we need only testing if the BER is less than a pre-defined threshold. The number of bits needed will depend only on the BER threshold and on the required confidence level. Figure 5 ( Nordin, 2012) shows how the BER varies as a function of the Dynamic Subcarrier Allocation -SINR based on the type of Modulation and Coding Schemes (MCSs) being used.

Multi-Layer Perceptron Model
The Multilayer Perceptron (MLP) is an artificial neural network model (Fig. 6) that maps set of input data into a set of appropriate output data. It is made up of multiple layers of nodes in a direct graph, with each layer completely connected to the next. Except for incoming nodes, each node is a neuron with a non-linear activation function. Multilayer Perceptron uses a supervised learning technique called backpropagation for network training. MLP is a modified version of the classical Linear Perceptron and can differentiate data that are not linearly separable. The fact that it is a supervised neural network clearly suggests that this part of our optimization involves the interaction with a central entity that will update the policy. We will use sigmoid, also known as the logistic function, as the activation function: the output obtained after the forward extension is known as the expected value (ŷ).

Learning Algorithm
The learning algorithm is composed by two parts: Backpropagation and optimization.
In the backpropagation process a loss function is used to know an estimate of how far we are from our desired solution. Generally, the Mean Square Error (MSE) is chosen as the loss function for regression problems and the cross entropy for classification problems. Given a regression problem its loss function is the mean square error, which squares the difference between the actual (yi) and the predicted (ŷi) value: The loss function is computed for the entire training dataset and its average is called the cost function C: To find the best weights for our Perceptron, we need to realize how the cost function changes in relation to weights and biases. This is done with the help of gradients. So, we need to identify the gradient of the cost function with respect to weights and bias.
We compute the gradient of the cost function C using the partial derivation, with respect to the weight wᵢ. Since the cost function does not depend directly on the related weight wᵢ, we use the chain rule: Equation 11 shows the gradient of the cost function (C) with respect to the predicted value (ŷ): Equation 14 shows the gradient of z with respect to the weight wᵢ is: So, we get: It is theoretically considered that the bias has an input of constant value 1.
Let's now turn to the optimization. Optimization is the selection of the best weights and the perceptron bias. For example, choosing gradient descent as the optimization algorithm, it changes the weights and bias, proportionally to the negative of the gradient of the cost function with respect to the corresponding weight or bias. The learning rate (α) is a hyperparameter that is used to control how much the weights and bias are changed.

Fig. 6: Multi-layer perceptron model
Weights and bias are updated as follows and backpropagation and gradient descent are repeated until convergence:

Application of the MLP to the Prediction of the MCS
Starting from our indoor environment dataset, we can train the MLP to identify the correct policy for choosing the MCS. As input values we have the position of the receiver, its distance from the access point, the received RSSI levels and the measured SINR levels, as output value we want our system to indicate which MCS to use (Fig. 5) or if it is the case to change carrier. We then build our MLP using Python code and the Scikit-learn library. Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems (Pedregosa et al., 2011). To import the dataset and make it available as input to the Scikit-learn MLP we used Pandas. Pandas is an opensource library which provides high-performance and data analysis tools for Python (Pandas Devel. Team, 2020). LP-Classifier trains iteratively, as, at each time step, the partial derivatives of the loss function with respect to the model features are calculated to update the parameters. To prevent the overfitting phenomena, a regularization term can be added to the loss function. The Python code is used to upload data, that are represented as dense numpy arrays of floating-point values and to run the MLP classifier. We run the simulation with different values of both α, the number of hidden layers, the number of nodes in the hidden layers and the number of iterations Fig. 7 shows the MLP Classifier configuration Row).
The chosen classification is the one shown in Fig. 8 considering BER 10 2 : Furthermore, we have opted for different configurations both in terms of solver and type. Figure  9 shows some training loss curves obtained with different learning strategies, such as Stochastic Gradient Descend (SGD), Momentum, Nesterov Accelerated Gradient and Adam.
SGD runs frequent updates with a high variance producing a heavy fluctuation of the objective function.
Momentum is a method that aims to accelerate SGD in the relevant direction by adding a fraction γ of the update vector of the past time step to the current update vector: The momentum term increases updates for dimensions whose gradients head in the same directions and decreases them for dimensions whose gradients change directions. The result is that it gains faster convergence and reduced oscillation.
Nesterov Accelerated Gradient (NAG) is a way to provide our momentum an approximation of the subsequent position of the parameters, a rough sign where our parameters are going to be: Adaptive Moment Estimation (Adam) is another method that estimates adaptive learning rates for each parameter. Besides storing an exponentially decreasing average of past squared gradients vt, Adam strategy keeps also an exponentially decreasing average of past gradients mt, similar to momentum. For the sake of brevity, gt is used to denote the gradient at time step t, so The decreasing averages of past gradient mt and past squared gradient vt are computed as follows: mt is an estimate of the first moment (the mean) and vt of the second moment (the uncentered variance) of the gradients, hence the name of the method. To counteract these biases the strategy computes bias-corrected first and second moment estimates: Adam works fine in practice and competes to other adaptive learning-method algorithms. Figure 9 shows the training loss curve for the choice of the MCS. We have considered seven different learning strategies:

Results of MLP Prediction
 Constant learning-rate  Constant with momentum  Constant with Nesterov's momentum  Inv-scaling learning-rate  Inv-scaling with momentum  Inv-scaling with Nesterov's momentum  Adam The convergence is reached after fifty iterations with Adam strategy that appears as the most appropriate for this scenario. Table 3 shows the classification report using ADAM learning strategy.
This shows that the level of accuracy is high, although we must consider that this environment is far from being realistic. We should take into account other interfering elements and moving obstacles inside the set, in order to make the scenario more accurate. On the other hand, it is true that the more interferers there are, the greater the contribution of measures that will be made available to the central entity to recalculate the policy, because each interferer works also as a data source. In any case, the result obtained provides some interesting food for thought: The accuracy of such a system can be different changing the learning strategies. For example, using a constant learning-rate policy, the obtained score is 0.984113, while using an inv-scaling learning-rate the score is 0.743400 and with inv-scaling with Nesterov's momentum the score is 0.770200. Furthermore, from Table 3 it is noted that the most critical cases, i.e., those in which it is necessary to be reasonably sure of the prediction, are the two cases with less uncertainty. Figure 10 shows the MLP's policy distribution for the 64QAM Modulation and Coding Scheme.   However, the MLP's are trained in batch mode and remain static after training, therefore the estimator is not adaptable to persistent changes in the link. Batch or offline training of ML algorithms (Banerjee and Basu, 2007) means that the model is trained, optimized and evaluated once on the training and test sets available and must be completely retrained later to accommodate possible changes in the file dissemination of updated data. In practice, this corresponds sporadic updates, for example, once every few hours and once for day depending on how the whole system was designed. For in the case of embedded devices, the device must be fully or partially reprogrammed (Ruckebusch et al., 2016). This consideration therefore prompted us to evaluate whether it was possible to add an unsupervised approach to the MLP so that the agent can self-learn the most suitable policy as the surrounding conditions change.

Reinforcement Learning Approach
Suppose our agent, to whom a central entity has sent a policy, is experiencing sudden interference. We consider, for example, the problem of radio channel selection. It will take some time before the new policy is recalculated and sent back to our agent. So, the objective of the agent is to choose for the next time slot a channel with no interference. Suppose the external interference has a periodic pattern, sweeping over all channels one to four in the same order. The agent must now autonomously learn a strategy that allows him to avoid the interfered time slots. In this case a Reinforcement Learning approach, in particular a Q-Learning Model, can be the solution. In this sense, our simulation environment transfers control to the agent, who autonomously identifies the appropriate policy for the new situation.

Q-Learning Model
In this case we have to take into account the protocol stack of our system, as learning, now, takes place in real time. To do this we can use ns-3. Ns-3 is a discrete-event network simulator for Internet systems (ns-3 project, 2020). In order to make ns-3 communicate with a Reinforcement Learning algorithm in OpenAI-gym we used ns3-gym. OpenAI Gym is a toolkit for Reinforcement Learning (RL) widely used in research. Ns3-gym is a framework that integrates both OpenAI Gym and ns-3 (Gawłowicz and Zubow, 2019).
Q-Learning is a model-free application of machine learning, that is the AI "agent" does not need to know the environment that it will be in. Indeed, the same algorithm can be used across different environments. Once defined the environment, everything is splitted into "states" and "actions." The states are observations of the environment and the actions are the choices the agent has made based on the observation. Table 4 shows the RL mapping that has been used by Gawlowicz.
The agent doesn't really need to know anything about the environment. For each environment, the agent can query for how many actions are possible. In this case, there are "4" actions. When the agent steps the environment, it act with a 0, 1, 2 or 3 as its "action" for each step. Each time it does this, the environment will return to him the new state, a reward, whether the environment is done and then any extra info that some envs might have. A "0" means go to timeslot 1, 1 means go to TS 2 and so on. All the agent needs to know is what the options for actions are and given a state, what the reward of performing a chain of those actions would be. The agent knows he can take 4 actions at any given time. That's the "action space". Now, we need the "observation space." In this gym environment, the observations are returned from resets and steps. The "observation" is given by the information of which of the four time slots is interfered at that time.
The way Q-Learning works is based on a "Q" value per action possible per state. This produces a table. To figure out all of the possible states, the agent can either query the environment or just simply has to engage in the environment for a while to figure it out. It will check this table to determine the moves. When the agent is being "greedy" and trying to "exploit" its environment, it will choose to take the action that has the highest Q value for this state. However, sometimes, especially at the beginning, it may decide to "explore" and choose a random action. These random actions are the way our model will learn better moves over time. Q values are updated this way:  Set the channel to use for the next time slot Reward +1 in case of no collision with interferer; -1 otherwise Gameover If more than 3 collisions occur during the last ten time-slots The Discount is a measure of how much the agent wants to care about future reward rather than about immediate reward. Typically, this value is between 0 and 1. The higher the better, because the purpose of Q Learning is, indeed, to learn a chain of events that ends with a positive outcome, so it's natural that the agent put greater importance on long terms gains rather than short term ones. The max_future_q is determined after the agent has performed its action already and then it updates its previous values based partially on the nextstep's best Q value. Over time, once the agent has reached the objective, this "reward" value gets slowly backpropagated, one step at a time, per episode. Figure 11 shows the learning performance using a modified version of the Q-Learning algorithm used by Gawlowicz. The modified version of the algorithm can be found in a GitHub Repository 1 . The main difference we introduced, compared to the original version, is related to the libraries used. We have eliminated the dependence on libraries such as Tensorflow and Keras. These libraries in fact, while ensuring high performance, use AVX instructions which may not run on older CPUs. In the original version we could see that after 80 episodes the agent will be able to perfectly predict the next channel state from the current observation so avoiding any collision with the interference. In our modified version we need some more episodes, about 600 episodes. On the other hand, the advantage is that the modified version can be used even on Cognitive Radio with limited computational power, such as NB-IoT, Sigfox and LoRaWan devices, because it does not require GPU support and high performing CPU, since in the prediction were not employed high performance numerical computation tools such as (Tensorflow, 2019; Keras, n.d.). 1 https://github.com/apirodd/apirodd/projects?query=is%3Aopen

Discussion
It has been shown in (Xu and Gu, 2020) that neural Q-learning with Multiple Layers finds the optimal policy with O(1/sqrt(T)) convergence rate if the neural function approximator is sufficiently overparameterized, where T is the number of iterations. Table 5 from (Jin et al., 2018), shows that the Time complexity for the Model-free scenario is O(T) where T is the total number of steps.
In real-time applications, the appropriate task representation or suitable initial Q-values is very important. In fact, prior results indicated that reinforcement learning algorithms are exponential in "n" (number of states), thus limiting their practical use if this set is high dimensional. In (Koenig and Simmons, 1993) has been shown that such algorithms are tractable if we use appropriate initial Q-values.
Further studies are moving towards the analysis of a multi-agent interaction (Multi Agent Reinforcement Learning-MARL). This would allow the different devices to cooperate by identifying a multi-agents policy, addressing the sequential decision-making problem when they are operating in a common environment. Each agent aims to optimize its own longterm reward by interacting with the environment and other agents (Busoniu et al., 2008), in particular, both the evolution of the system state and the return received by each agent are influenced by the joint actions of all agents (Zhang et al., 2019).

Conclusion
Over the next few years, the growth of mobile devices will grow steadily while the radio resource will remain substantially unchanged. It is therefore necessary to provide strategies for an optimized use of the radio channel. In this study we have shown a possible approach to face the problem, highlighting how the combined use of supervised learning and reinforcement learning models applied to predicting the behavior of the transmission channel can provide interesting results on the performance of the entire system.

Author's Contributions
Andrea L. Piroddi: Designed the research plan, organized the study and participated in all experiments (In particular, he wrote and ran the Machine Learning Codes.), coordinated the data-analysis and contributed to the writing of the manuscript.
Maurizio Torregiani: Participated in all experiments, mainly contributing on the radio propagation aspects inside the paper. Verified the consistency of results of the experiments and contributed to the writing of the manuscript.