Data Preparation in Machine Learning for Condition-based Maintenance

: Using Machine Learning (ML) prediction to achieve a successful, cost-effective, Condition-Based Maintenance (CBM) strategy has become very attractive in the context of Industry 4.0. In other fields, it is well known that in order to benefit from the prediction capability of ML algorithms, the data preparation phase must be well conducted. Thus, the objective of this paper is to investigate the effect of data preparation on the ML prediction accuracy of Gas Turbines (GTs) performance decay. First a data cleaning technique for robust Linear Regression imputation is proposed based on the Mixed Integer Linear Programming. Then, experiments are conducted to compare the effect of commonly used data cleaning, normalization and reduction techniques on the ML prediction accuracy. Results revealed that the best prediction accuracy of GTs decay, found with the k-Nearest Neighbors ML algorithm, considerately deteriorate when changing the data preparation steps and/or techniques. This study has shown that, for effective CBM application in industry, there is a need to develop a systematic methodology for design and selection of adequate data preparation steps and techniques with the proposed ML algorithms.


Introduction
Under the Industry 4.0 paradigm, reliability of industrial assets and production machines is very important. Towards smart factory, machine health monitoring and management aims to operate with near zero breakdown. In this context, (Aivaliotis et al., 2019) have proposed a methodology based on the Digital Twin concept in order to enable predictive maintenance for manufacturing systems using Prognostics and Health Management techniques. Also, as reported in, (Carvalho et al., 2019;Diez-Olivan et al., 2019), several recent Prognostics and Health Management projects were able to reach this level of profitable maintenance using Machine Learning (ML). Specifically, based on collected sensors data, ML intelligent predictive algorithms are implemented to reach successful Condition-Based Maintenance (CBM) strategy. This success is mainly related to the capability of such ML algorithms to handle high dimensional and multivariate data from various sensors and predict the degradation and future failure states (Accorsi et al., 2017). For example, (Márquez et al., 2020), have shown the success of Artificial Neural Network (ANN) in identifying deterioration of bearings. Also, in (Coraddu et al., 2016) authors have shown the potential of ML algorithms in predicting propulsive performance degradation of a naval vessel powered by Gas Turbines (GTs).
However, as pointed out in (Bennane and Yacout, 2010;Loukopoulos et al., 2017;Diez-Olivan et al., 2019), the majority of these works have centered in comparing performances of different ML algorithms in degradation prediction; however, they did not give enough insight on the data preparation phase. Only a few works, such as (Bukhsh et al., 2020) have exhibited the data preparation technique used before applying predictive ML models for bridges intelligent maintenance. The data preparation phase generally includes three steps: Data cleaning, i.e., handling missing data and outliers, data reduction, i.e., reducing the data size by aggregation, elimination redundant feature, etc. and data normalization, (Han et al., 2011). In other fields, such as biological and medical research, it is well known that this data preparation phase can greatly improve or deteriorate the ML prediction accuracy, (Perez-Rey et al., 2006). For example, 526 (Nawi et al., 2017) have shown that the prediction accuracy of the Artificial Neural Network (ANN) ML algorithm, considerably deteriorates when the data normalization step is conducted using the Min-Max technique rather than the Z-score one.
Thus, the present work aims to investigate the effect of the data preparation steps on ML prediction of GTs performance decay. Experiments are conducted on data generated from a simulator of a gas turbine and were formerly used in (Coraddu et al., 2016;Cipollini et al., 2018) to show the benefit of ML in predicting the decay of GTs performances installed on naval vessel. This work intends to go further, investigating the effect of the used technique during data preparation steps on the ML prediction performances. To address this issue, a new Mixed Integer Linear Programming (MILP) model is firstly proposed to implement a robust linear regression imputation as a data cleaning technique. This model is based on former works, in the biomedical field, conducted by (Omelchenko, 2014;Poos et al., 2016), which showed the benefit of MILP modelling in avoiding over-emphasizing outliers when building regression models. This MILP model is implemented and used as a cleaning technique in the data preparation step. The benefit of its use is shown through comparison with other more common techniques such as mean imputation. Then, in order to analyze the effect of the different data preparation steps and techniques, three ML algorithms are used: Linear Regression (LR), k-Nearest Neighbors (k-NN) and Neural Network (NN) to predict the GT degradation coefficients. The effect is measured by comparing the corresponding Mean Absolute Percentage Error (MAPE). The results show that this MAPE is not only sensitive to the steps but also to the technique used to prepare the data before applying ML algorithm for degradation state prediction.
This study is organized as follows. Section 2 reviews previous relevant studies related to the data preparation steps and techniques used in CBM context. Section 3 presents the MILP formulation of a data cleaning technique developed to handle missing data. Section 4 introduces the methodology followed in this study. It also includes a description of the considered dataset, followed by a visualization of the effect of data imputation techniques. Section 5 includes the computational experiments and comparison of ML algorithms after performing the data preparation techniques. Finally, conclusions are discussed.

Literature Review
The success of CBM over preventive strategies is mainly due to its capability to avoid unnecessary maintenance tasks by taking actions only when abnormal behaviors of a physical asset is detected. Diez-Olivan et al. (2019) gave an extensive review on machinery diagnostics and prognostics implementing CBM. Since the success of this CBM strategy is highly dependent on the prediction accuracy, several researchers have focused on using ML algorithm to enhance the prediction of the failure state, (Prajapati and Ganesan, 2013). According to (Coraddu et al., 2016), ML approaches have the capability to identify complex pattern from the received sensory data and provide better estimation of the degradation state. Review of recent works using ML algorithm to predict future degradation state and the remaining useful life of assets is given in (Carvalho et al., 2019). Although extensive work has been carried out under ML for CBM, yet little attention has been paid to the data preparation phase. According to (Bennane and Yacout, 2010;Loukopoulos et al., 2017;Diez-Olivan et al., 2019), the relevance of the data preparation phase has been widely recognized in the literature but still few research efforts have been carried out to address this issue in CBM context.
In this context, (Bennane and Yacout, 2010;Ragab et al., 2016) mainly focused on data cleaning by identifying outliers using the Inter-Quartile Range (IQR) method and handling the missing data using different techniques, namely the complete case analysis, mean imputation and k-Nearest Neighbors (k-NN) imputation. Data were cleaned using the Logical Analysis of Data (LAD) model; then a supervised learning algorithm was used to predict the health state of an oil transformer system. Loukopoulos et al. (2017) have also presented different imputation techniques to handle the missing data, for the CBM application on centrifugal compressors. Among these techniques, autoregressive model, k-NN imputation, Self Organizing Map (SOM) and Bayesian Principal Components Analysis (BPCA) were used to fill the missing data. Tsang et al. (2006) proposed three data cleaning procedures to handle missing data, in the practice of CBM optimization. The first one is based on completely recorded observations, also known as the complete case analysis. The second is the imputation-based procedures such as mean imputation or regression imputation. The last proposed procedure is based on models in which the models' parameters are estimated using techniques such as the Expectation Maximization (EM) algorithm. Hu et al. (2012) implemented the z-score normalization technique to prepare data before the application of the Recurrent Neural Network (RNN) model used to predict the Remaining Useful Life (RUL). (Ghasemi et al., 2013) reduced the dimensionality of the condition monitoring data set using the Principal Component Analysis (PCA), before the application of the LAD model. Ragab et al. (2016) applied feature selection and extraction methods to reduce the data dimensionality. Feature extraction was performed using the Discrete Wavelet Transform (DWT) method, while the feature selection was performed using 527 the Distance Evaluation Technique (DET). These data preparation techniques improved the accuracy of LAD algorithm used to estimate the RUL of a turbofan engine. Bukhsh et al. (2020) have proposed a predictive model based on the Neural Network algorithm for efficient bridge maintenance planning. They have used the z-score data normalization.
As seen from the reviewed works, when data preparation phase is conducted in the CBM context, authors generally focused on a single aspect: Data cleaning, data reduction or normalization. However, in other fields, such as biomedical, it is well known that efficient data preparation may have a major impact on the ML algorithm performance (Nawi et al., 2017). For example, (Wu et al., 2019) have shown the benefit of conducting data cleaning and then data reduction to better predict fatty liver disease. Singh and Singh (2019) prepared clinical biomedical data set by performing three different steps. The data cleaning step consisted of imputing missing values either by their means or by their modes. The dataset was then normalized using the zscore normalization technique. The final step, data reduction, was conducted using three feature selection techniques, which are χ 2 statistic, symmetric uncertainty and gain ratio. Daoud and Mayo (2019) presented different techniques of data normalization and data reduction implemented before building ML models used for cancer prediction purpose. Data normalization was carried out using different methods. The PCA method was performed to reduce the data dimensionality. Kotsiantis et al. (2006) also conducted data cleaning and data normalization, in order to achieve better performance of supervised algorithms.
Obviously, according to these works done in the biomedical field, preparing data with different steps may considerably affect the ML model performance. More recently, some researchers have investigated the issue of using different techniques when conducting a data preparation step. Nawi et al. (2017) compared the effect on the performance of the Artificial Neural Networks (ANN), of three different normalization techniques; namely the Min-Max Normalization, Z-Score Normalization and Decimal Scaling Normalization. Results on all datasets revealed that when the z-score technique is applied, the ANN produces the best performance. Also (Lokman et al., 2019), have shown the effect of different normalization techniques on the accuracy performance of anomaly detection in cyberattacks.
This literature review reveals that researcher on CBM generally focused on proposing new ML models for better degradation and failure prediction. To the best of our knowledge and according to the recent review (Carvalho et al., 2019), comparing the effect of different data preparation steps and techniques on this prediction accuracy has not yet been addressed in the CBM context. Thus, in this study the effect of the widely used steps and techniques on the data preparation phase is analyzed. A new data cleaning technique named the robust Linear Regression is also introduced. The proposed model to implement this imputation technique is given in the next section.

Mathematical Model of the Robust Linear Regression
In this section, the MILP model proposed to implement a robust linear regression for imputation purpose is presented.

Problem Description
One of classical imputation methods to handle the missing data considers the simple LR model. However, in order to conduct regression a commonly known drawback of this simple LR model is its lack of robustness to some unusual observations, usually called outliers. To overcome this drawback, (Omelchenko, 2014) has proposed a more robust linear regression using MILP. This model allows the detection and the exclusion of potential outliers from the regression model. This approach was originally developed for predicting chemical compound's properties of peptides. This regression model was implemented and tested on real biological dataset examples, proving that it has better predictive performance compared to the regular LR technique. Poos et al. (2016) have also shown that MILP is a powerful approach which avoids over-emphasizing outliers when building LR models. Based on this study, a new MILP is proposed to adapt this robust Linear Regression approach for the imputation purpose.
Hence, supposing that n observations in a dataset where, 0 and 1 are the regression coefficients and i is the error term. The simple LR model is not robust to outliers. In fact, when independent variables xi or the dependent ones yi include outliers, the estimate of the regression coefficients 0 and 1 will be biased according to Eq.
(1.1). Thus, the simple LR imputation method might be not reliable in the presence of outliers to estimate missing data. Therefore, the least absolute deviations method is performed to estimate the regression coefficients 0 and 1. This approach consists of minimizing the sum of the absolute error , which is the distance between the actual dependent variable Y and the predicted dependent variable Ŷ, such as ŷ = 0 + 1xi for i = 1, … n, as follows: The linearization of this objective function results in having a MILP model. Constraints are added to this mathematical program so that only the "good" observations are taken into account when estimating the regression coefficient of the simple LR model. In other words, if the observation is considered as an outlier, the MILP model disregards it and does not take it into account to estimate the coefficients. Then, a maximum number of outliers could be detected and removed by the MILP model. For this imputation technique, the following MILP is proposed.

Mathematical Formulation
The sets, parameters and decision variables of the MILP model are defined as follows. In order to perform the robust simple linear regression, the MILP model is formulated as follows:

Sets
where, M is a parameter that has a large value. In Model (2.1)-(2.8), the objective function (2.1) is to minimize the sum of the absolute distance between the actual dependent variable and the predicted variable Ŷ in order to find the coefficient values 0 and 1which provide the best-fitting line through the data points. As mentioned in section 3.1, the regression model is constructed by solving the least absolute deviations problem. To linearize the non-linear objective function (1.2) in this problem and to remove the absolute value operator, the positive auxiliary variable is introduced. Constraints (2.2) and (2.3) require excluding the outliers from the regression model.
Actually, supposing that a good observation, will be equal to 0. If the observation is an outlier, will be equal to 1 and αi will be restricted to be equal to 0 while i = yi-0-1xi will always be in the range of [-M, M]. Constraints (2.4) express the restricted number of outliers that could be detected by the model. Finally, Constraints (2.5)-(2.8) guarantee the variables nature. This MILP model is implemented using CPLEX (version 12.7). The following section will show how the imputation is conducted using this robust Linear regression approach.

Methodology
This section presents the used methodology in this study. Section 4.1 describes the ML process and the considered Gas Turbine (GT) dataset. Section 4.2 lists the selected techniques for each data preparation step. In section 4.3, the data is visualized after the cleaning step using the robust LR imputation technique.

ML Process and Data Description
The main goal of using ML algorithm is to characterize the behavioral pattern of the assets based on the sensors data herein provided in the GT dataset. To conduct this ML process, the phases presented by (Diez-Olivan et al., 2019), summarized in the following schematic diagram, Fig. 1, are used. The considered data is split into two sets: Training set and test set: The training set representing 70% of the dataset and the test set which contains the remaining 30%. Data preparation phase is conducted using different steps with the corresponding different techniques. These techniques are presented later in section 4.2. The ML model is built using the training data. Then the model makes predictions of the 529 output variable which indicates the asset's health using the test data. Finally, an evaluation of the model performance is carried out by calculating the error, which is resulted from a comparison between the actual performance decay and the one predicted by the ML model. This methodology is applied to predict the performance decay of GTs used for a naval propulsion plant. The corresponding dataset was generated by (Coraddu et al., 2016), using a simulator of a naval vessel with GT propulsion plant. This dataset is given in UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/). Researchers in (Coraddu et al., 2016;Cipollini et al., 2018) worked on designing a CBM approach and applying it to GTs data used for naval propulsion plants, enabling the diagnosis and the prognosis of the naval assets. The dataset was used to derive a ML predictive model in order to monitor performance decay over time of the propulsion system and to identify in advance potential failures.
In the present work, the system's behavior is described by two parameters, the turbine and the compressor degradation coefficients. Both parameters represent the output variables in the dataset. Each of these two outputs' ranges has been sampled with a uniform grid of precision 10 −3 , in order to get a good granularity representation. Given that the dataset is labeled with degradation coefficients which represent a continuous range, regression models are used to investigate these two parameters in order to perform the CBM approach. The dataset also includes the following 16 features: Lever position, Ship speed, Gas turbine shaft torque, Gas generator rate of revolutions, Gas turbine rate of revolutions, Starboard propeller torque, Port propeller torque, HP turbine exit temperature, GT compressor inlet air temperature, GT compressor outlet air temperature, HP turbine exit pressure, GT compressor inlet air pressure, GT compressor outlet air pressure, GT exhaust gas pressure, Turbine injection control and Fuel flow.
In order to predict the performance decay of GTs, the following three ML algorithms: Linear Regression (LR), k-Nearest Neighbors (k-NN) and Neural Network (NN) are used. The k-NN is a supervised learning algorithm. k-NN algorithm first selects the k target variables whose associated feature (input) is the closest to the new input, according to a distance and then the algorithm determines the output value to be predicted based on the k selected target variables. A Neural Network (NN) algorithm; specifically, a multilayer perceptron, which is an ANN, is implemented. It is a supervised learning model that can learn a non-linear function by training a set of inputs and an output in a dataset. For more insight on these ML algorithms, interested reader can consult (Basheer and Hajmeer, 2000;Kramer, 2013). These algorithms were selected based on their popularity among the practitioners and researchers (Shukla and Kumar, 2019;Ray, 2019). Each of these algorithms has its own characteristics. The LR models are known for being easily interpretable, as the regression coefficients indicate the most important features. The k-NN algorithm, on the opposite of the LR algorithm, does not require a linear relationship between the inputs and the target variable, providing a more flexible approach. On the other hand, the NN algorithm is known by its flexibility provided by the hyper parameters, making it able to learn and model non-linear and complex relationships. Nevertheless, NN considers a greater number of hyper parameters compared to k-NN. Also, the k-NN and NN models are more complex to interpret compared to LR models. The next paragraph exhibits the implemented steps and techniques at the data preparation phase.

Data Preparation Techniques
In this study, the effect of the three commonly used steps in data preparation, namely data cleaning, data normalization and data reduction, are investigated.
For data cleaning the following techniques were investigated:  Mean Imputation consists of replacing missing values with the average of the non-missing values in a variable, (Malarvizhi and Thanamani, 2012)  Linear Regression (LR) imputation aims to model a linear relationship between a dependent variable and one independent variable, if a simple LR model is considered, (Yan and Su, 2009). The idea is to set the feature that has missing values as the dependent variable and set another feature as the independent variable, assuming the existence of a linear relationship between these two features. In order to construct the LR model, the observations which contain missing values in the two features are removed. In fact, regression coefficients of the model can only be estimated by using a complete set of observations. Once the simple LR model is built, it is possible to estimate the missing values of the feature involved. In this study, the coefficients are estimated using the least-squares method  EM Algorithm is an iterative algorithm that finds the maximum likelihood estimates of a model parameters defined by an incomplete dataset. The number of iterations of this algorithm was set at 50  Robust Linear Regression (LR) imputation uses the same methodology as the regular LR imputation technique; however, it relies on a different approach to estimate the regression coefficients. This approach is based on the MILP presented in section 3. For the following experimentation, the number of maximum outliers o that the MILP can detect was set to be equal to 10% of the total observations in the dataset. This choice is based on preliminary experiments conducted with a range of o values corresponding to five different percentages of the total observations in the dataset (10 to 50%) and selecting the o value providing the lowest MAPE when implementing the regression algorithms. Bartoli and Olsen (2006;Filzmoser, 2005) used a similar method to select the maximum percentage of outliers  k-NN imputation determines the average value of the k closest neighbors that are the most similar to the feature with the missing value. The calculated average is then imputed in the missing value. The closest or most similar k neighbors are selected by using the similarity metric, also called the distance metric (Troyanskaya et al., 2001). In this study, the distance metric selected is the Euclidean distance and the number of neighbors k was set to 5. The neighborhood size k selection plays an important role in resulting in a good performance of k-NN. However, as pointed out by (Loukopoulos et al., 2017), no global rule is set for determining this optimal k. In the present paper, preliminary experiments as the one done by (Thanh Noi and Kappas, 2018), with different values of k between 1 and 20, are conducted. Then the k value which gave the lowest value of MAPE is selected Data normalization consists of scaling the features so they can fall within a smaller range, improving the efficiency and the accuracy of ML algorithms, (Han et al., 2011). Herein follows the two data normalization techniques used in this study:  Min-Max normalization is performed using the following formula, from (Kotsiantis et al., 2006): v is the value of the old feature and v ′ is the value of the normalized feature. In this study, the range chosen is [0, 1], in other words, new_minA is equal to 0 and new_maxA is equal to 1. This means that, for every feature, its minimum value is converted to 0 and its maximum value is transformed into 1 and all other values get transformed into a decimal between 0 and 1:  Z-score normalization uses the mean of all the values of the feature and its standard deviation, (Kotsiantis et al., 2006), as stated below: Data reduction step creates a reduced representation of a dataset based on the original one making its volume much smaller while maintaining its integrity. Data dimensionality reduction is chosen among the data reduction approaches. Dimensionality reduction consists of reducing the number of features in the original dataset, aiming to train ML algorithms with fewer features (Han et al., 2011). The following techniques were selected to reduce the data dimensionality:  Principal Component Analysis (PCA) reduces the number of features by converting the correlated ones into linearly new uncorrelated features called principal components. PCA thus creates new orthogonal linear combinations based on the initial features with the largest variance. The selection of the number of principal components is related to the percentage of the total variance which needs to be explained in (Abdi and Williams, 2010). In this 531 study, the number of components is selected such as the percentage of the total variance explained by the PCs is equal or greater to 95%  Factor Analysis (FA) is also a linear dimensionality reduction method. FA assumes that the observed variables depend on a lower number of some unknown latent variables. The purpose of FA is to uncover such linear relations and to explain the covariance among the observed variables. FA can therefore reduce the datasets by using the new independent latent variables, called factors (Fodor, 2002)

Visualization of Data Cleaning
The purpose of this section is to visualize data before and after the data cleaning step, allowing us to observe the results of some techniques proposed in the previous sections, particularly the robust LR imputation technique, presented in section 3. As mentioned by (Kohavi, 2001) data visualization tools provide an easier way to visualize trends, understand patterns and the relationship between the features in the dataset. They make it easier to identify areas which need attention and that can affect the performance of the ML algorithms. It is worth to mention that numerical comparison will be conducted in the next section. The objective is to visualize the effect of the imputation technique in data cleaning step. Knowing that there are no missing values in the dataset, 10% from each feature, are randomly deleted, in order to study the effect of handling the missing data with the proposed techniques. The trend of the feature "HP turbine exit temperature" in the original dataset is shown in Fig. 2. The x axis represents the first 200 observations or measures of the dataset and the y axis represents the feature value corresponding to each observation.
The line plot of the same feature is represented, in Fig. 3, after artificially creating the 10% of missing values. The randomly and discontinuously dispersed missing data are shown in Fig. 3 with gray circle.
Then the proposed MILP model is applied, in section 3, to perform the data cleaning step using the robust Linear Regression imputation technique. The corresponding imputed data are plotted in Fig. 4. The comparison of the Fig. 2 with the Fig. 4 shows that the difference between the feature trend in the original dataset and the one after handling the missing data using the robust LR imputation technique is minimal.
In order to visually compare the effect of different imputation techniques in filling the missing data, the mean imputation technique is applied and the corresponding trend is plotted in Fig. 6. This plot shows that the feature trend in the imputed dataset using the robust LR imputation technique (Fig. 5) is more similar to the actual feature trend in the original dataset than when using the mean imputation technique (Fig. 6).  In this section, the approach of imputing missing data in the feature "HP turbine exit temperature" with different cleaning techniques is shown. In the following section, the effect of these different data preparation steps and techniques on the ML accuracy prediction of GTs performance decay will be explored in more depth.

Computational Experiments and Results
In order to measure the effect of the data preparation steps and techniques, the metric used to evaluate the performance of the ML algorithms has to be first defined. Then, the experimental results of testing the data preparation steps with the different techniques listed in section 4.2 are presented. Imputed data using mean imputation Original data 533 Experiments were conducted using Python 3.6 programming language.

Metric
To evaluate the effect of different steps and techniques on the performance of the prediction accuracy of the ML algorithms, the Mean Absolute Percentage Error (MAPE) was used, it is defined as follows: yi is the value of the i th observation of the output in the original dataset, ŷ is the value of the i th observation of the predicted output and is the number of observations in the test dataset. The MAPE, as a relative measure, is a frequently used ML evaluation metric and is known for its advantages of scale-independency and interpretability. Coraddu et al. (2016;Cipollini et al., 2018), the authors chose the same metric on the same GT dataset example. Also, this metric has been successfully adopted in similar studies in recent years (Wisyaldin et al., 2020;Velasco-Gallego and Lazakis, 2020). Given that there are two outputs, the "GT compressor decay coefficient" and the "GT decay coefficient", this multi-target problem is tackled by decomposing it into two single target sub-problems. This same calculation approach was used in (Coraddu et al., 2016;Cipollini et al., 2018). This means that the ML algorithms predict each output separately, by considering one decay coefficient at a time and then the average of the two MAPEs is computed. A better prediction accuracy is given with the lowest MAPE.

Effect of Data Preparation Steps on ML Performances
In order to study the effect of data preparation steps on the performance of ML algorithms, different combinations of these steps on the GT dataset were applied. The following four combinations are considered: 1. The application of only one step which is the data cleaning step 2. The application of both steps: Data cleaning and data normalization 3. The application of both steps: Data cleaning and data reduction 4. The application of all 3 steps: Data cleaning, data reduction and data normalization Following the ML process presented in section 4.1, these four combinations are applied considering the different techniques presented in section 4.2. Results for the three ML algorithms: The LR, k-NN and the NN are computed. For each of these algorithms, if only the data cleaning step is performed, five different MAPEs are obtained, each one corresponding to a data cleaning technique. If both steps of data cleaning and data normalization are performed, the results show ten different MAPE, considering the five data cleaning techniques and the two data normalization techniques used. For each combination and for each algorithm, the lowest MAPE is selected. The following graph, Fig. 7 summarizes the outcome of this experimentation.
The Fig. 7 shows that the NN model's MAPE, after only cleaning the data, is significantly high, indicating that this algorithm is considerably sensitive to the data normalization and/or the data reduction steps. Besides, results point out that the k-NN and NN algorithms perform better than the LR algorithm, regardless of the data preparation steps. This figure also shows that the combination of steps chosen to prepare the data has an impact on the performance of these algorithms. In fact, the best performance obtained by the LR (1.35%) is resulted from implementing only the data cleaning step. Whereas, when the k-NN or the NN models are used, better performance is found when the 3 data preparation steps are performed. As shown in the Fig. 7, the K-NN model's MAPE trained using the prepared dataset by performing all the 3 steps (0.01%) is remarkably lower than the one resulted from only performing the data cleaning step, which is equal to 0.93%. Likewise, the MAPE of the NN model built after performing all the data preparation steps, is the lowest (0.88%). This illustrates that MAPE of the ML model varies according to the used data preparation steps. In other words, the prediction accuracy of GTs performance decay is related to the choice of data preparation steps. The next section further investigates the effect of the chosen technique when conducting a data preparation step. In order to elucidate this issue, the effect of using different data cleaning techniques on the performance of ML algorithms is analyzed.

Effect of Data Cleaning Techniques on ML Performances
This section focuses solely on the effect of changing the data cleaning technique. The aim is to study the effect of these techniques on the performance of ML algorithms, specifically the k-NN and the NN algorithms, since they gave the best accuracies, in section 5.2. For that purpose, the effect of using the EM algorithm as a data cleaning technique is analyzed compared to the robust Linear Regression technique. The present section will not focus on the case where only data cleaning is conducted before NN implementation, because as shown in Fig. 7, the 534 corresponding MAPE is extremely high. Results are provided in the following graph, Fig. 8.
As seen in Fig. 8, performance of the k-NN algorithm considerabely deteriorates when using the EM cleaning technique rather than Robust LR. In fact, the NN algorithm produces now the best performance rather than k-NN. Actually, the choice of data cleaning technique has also a significant impact on the performance of ML algorithms. In fact, the chosen data preparation steps and the used techniques to implement these steps have an impact on the ML prediction accuracy of the GTs performance decay. This result is in concordance with the one provided in (Nawi et al., 2017) where authors have found that ANN prediction accuracy considerably deteriorates when data normalization step is conducted using the Min-Max technique rather than Z-score.
Experiments to investigate the effect of data normalization and reduction techniques on the prediction accuracy of the GTs performance degradation are also conducted. Results for the k-NN algorithm are plotted in the following graph, Fig. 9.
As shown in Fig. 9, performance of k-NN algorithm varies according to the used data normalization and reduction techniques. Specifically, this figure reveals that the best accuracy of the k-NN algorithm is reached when data cleaning is conducted with the robust Linear Regression, data normalization is done with the z-score technique and finally the data reduction with PCA. Given that these techniques gave the best prediction accuracy, next section aims to investigate whether this performance is affected by the combination of data preparation steps.

Effect of Data Cleaning Techniques in Data Preparation Steps
This study is conducted for the k-NN algorithm. For each of the 4 combinations of data preparation steps and for each data cleaning technique implemented, the results for z-score normalization and PCA reduction which gave the best prediction of the k-NN algorithm are plotted. Results are given in Fig. 10, to visualize the MAPE of the k-NN algorithm after preparing the data.
The Fig. 10 indicates that the best performance of the k-NN algorithm is obtained by selecting the robust Linear Regression imputation as a data cleaning technique. However, this algorithm gives the best performance, only if its application is carried out on normalized and reduced data. In fact, results reveal that the MAPE of the k-NN algorithm is equal to 0.01% when applying all the three steps of data preparation while using the robust linear regression imputation technique. However, when only data cleaning and data reduction are conducted, the MAPE deteriorates and reaches 0.94%. For this case, the k-NN imputation technique provides the best accuracy. That means that the choice of the steps is also sensitive to the technique used to implement these steps. Thus, to benefit from the good prediction accuracy of the GTs performance decay using the k-NN, it is mandatory to impute data using the robust Linear Regression technique and then conduct data normalization using the z-score technique and finally reduce data using the PCA technique. The importance of normalizing the data using the z-score technique before applying PCA was also found in (Gao et al., 2019;Obaid et al., 2019).  The figure also shows that the MAPE is the lowest when performing both steps of data cleaning and data reduction and using one of the following techniques to handle missing data: Mean imputation, EM algorithm, k-NN imputation. On the other hand, the performance of the k-NN algorithm is the best when applying all the three steps of data preparation in case the Robust Linear Regression imputation technique is implemented to handle the missing data. Hence, this confirms that the data preparation steps should be considered depending on the data cleaning technique to get the best performance of the ML algorithm. This also proves that the choice of the data preparation steps is sensitive to the technique used to clean the data.
These experimentations lead to the conclusion that changing the steps and/or the techniques during the data preparation may have considerable effect on the prediction accuracy of the ML algorithms. In fact, even though cleaning the data with the proposed MILP for robust Linear Regression imputation have led to the best prediction of the GTs performance decay; but to reach this prediction accuracy it is imperative to apply z-score normalization and PCA data reduction.

Conclusion
This paper investigates the effect of different data preparation steps and techniques on the ML prediction accuracy of GTs performance decay. An accurate degradation prediction model of GTs performance is highly desired to reach an effective CBM strategy which predicts the degradation of the propulsion plant over time and schedule maintenance in advance.
First, a literature review was conducted to distinguish the used techniques and steps for data preparation in the CBM context. Then, based on former works in the biomedical field, a new MILP model was proposed to implement a robust Linear Regression imputation technique. The effect of this imputation technique in the data cleaning step is shown with trend visualization.
Computational experiments are conducted using three different Machine Learning algorithms to predict the performance decay of GTs. Results have revealed that the k-NN prediction model may provide the best prediction accuracy when data are cleaned using the robust linear regression technique, normalized using the z-score and reduced with the PCA technique. Otherwise, if data are prepared using different steps or with different techniques, the prediction accuracy deteriorates. In fact, the results show that when only data cleaning step is conducted, prediction accuracy of the NN surpasses k-NN. That means that the ML algorithm prediction accuracy of the degradation and failure state is affected by changes in the used data preparation steps and techniques.
The main finding is that, in order to benefit from the high prediction capability of the proposed Machine Learning algorithm in CBM, researchers should clarify how data have been prepared. Specifically, with which steps and techniques they have conducted the data preparation phase before applying ML algorithm for prediction of the degradation and failure state.
Future research is intended to explore in more depth this complex interaction between the proposed ML algorithms for CBM and the data preparation steps and techniques. It is evident that this interaction may considerably deteriorate prediction accuracy of degradation and failure, which in turn can have major impact on the applicability of these ML algorithms in practice. In fact, for effective CBM application there is a need to develop a systematic methodology for design and selection of the adequate data preparation steps and techniques with the proposed ML algorithms.