Toward an Online DoS/DDoS Classification: An Empirical Study for Network Intrusion Detection Systems

Corresponding Author: Tran Hoang Hai School of Information and Communication Technology, Hanoi University of Science and Technology, Hanoi, Vietnam Email: hai.tranhoang@hust.edu.vn Abstract: In recent years, Distributed Denial of Services (DDoS) attacks have caused significant losses to industry and government due to an increasing number of devices connected to the Internet. These devices use services-over-Internet more frequently with services characterized and provided seamlessly by 5G, Cloud and Edge Computing. According to Cisco Annual Internet Report, the frequency of DoS/DDoS attacks has increased more than 2.5 times over the last 3 years and the average size of attacks is increasing steadily and approaching 1 Gbps. Therefore, there are cyber threats continuing to grow even with the development of new protection technologies. Our work is strongly motivated from with the goal to study and evaluate four Machine Learning models toward development of an Online Network Intrusion Detection System (N-IDS). This article studies on the application on three feature selection algorithms combined with four machine learning models applied to NIDS. We have implemented performance evaluation our proposed model on three up-to-date DoS/DDoS datasets. We have shown that Feature Importance and K-Nearest Neighbors’ algorithm (KNN) can give better results in all benchmark datasets than previous work and the empirical results of all four machine learning models and three feature selection algorithms are also presented in detail.


Introduction
In recent years, Denial of Service (DoS) and Distributed Denial of Service (DDoS) attacks are among the most common attacks in cyber-attacks (Lima Filho et al., 2019).
Internet provides an open environment in which any host can communicate with others. There is an increasing number of devices/mobiles connected to the Internet and using services-over-Internet due to the fast development of Ubiquitous Computing, Internet of Things (IoT) model with services characterized and provided seamlessly by 5G, Edge Computing. The main objective of an attacker is to take down services of the websites, Intranet's enterprises or prevent users using specific service over Internet. According to (Cisco Annual Internet Report, 2018-2023, the frequency of DoS/DDoS attacks has increased more than 2.5 times over the last 3 years and the average size of attacks is increasing steadily and approaching 1 Gbps. These attacks are strong enough to take most organizations completely offline. Denial of Service DDoS attacks can be performed using a botnet network controlled by the attacker. The owner of the compromised computer system is unaware the botnet is installed in their computers or they are part of a botnet network. With increasing number of Internet users, DDoS attack has become the second most significant threat after virus infection to the Internet users (Gupta et al., 2009). Moreover, the individual attack can launch the attack easy by open source tools quite easy on Internet. Gupta et al. (2009;, the authors classify DDoS attacks into two broad categories by flooding attacks and logical attacks. Flooding attacks create huge of traffic at the victim side which makes the target computer incapable of handling request from the legitimate users. There are several types of flooding attacks such as SYN Flooding, ICMP, UDP Flooding, etc. In logical attack, the attackers exploit known software bugs to implement such as Ping of Death, Teardrop and Land attack. Network Intrusion Detection System (N-IDS) plays an extremely important role in security management which can support network administrators to detect such type of DoS/DDoS about unusual behaviors where a traffic flow might be an attack or normal traffic flow. Currently, network administrators apply some solutions such as firewalls to prevent some unwanted traffics, but the network manager have to detection by its own technique. In the traditional rule-based N-IDS, the rules are usually predefined by the security experts and need to be updated regularly such as Snort, Suricata (Karthikeyan and Indra, 2010;Turner et al., 2016). The advantage of rule-based N-IDS is that we can detect specific attacks in detail to give better accuracy and reduce false alarms. However, with the increasing of traffic flow, it is difficult for network experts to follow the system (Srivastava et al., 2011). Therefore, we propose a smart N-IDS which can capture network traffic and able to analyze and detect network anomalies automatically. With the rapid development of machine learning models, several methods have been proposed to build a knowledge system on the IDS system (Gupta et al., 2009;Karthikeyan and Indra, 2010), where abnormal traffic can be detected and prevented automatically. Another type of N-IDS based on statistical analysis which analyze statistical behavior of users to find abnormal behaviors based on assumption that malicious traffic differs from typical user behavior traffic (Chellam et al., 2018). In recent years, both industry and academia make a huge effort to address this problem. There are several approaches, ranging from filtering-based approaches (Savage et al., 2000;Song and Perrig, 2001;Argyraki and Cheriton, 2005;Mahajan et al., 2002;Ioannidis and Bellovin, 2002;Liu et al., 2008), capability-based approaches (Yaar et al., 2004;Yang et al., 2008;Liu et al., 2010;2016). We believe that a knowledge system which inherit latest development of machine learning models to combat the risks is extremely important (Panja et al., 2014;Zwass, 2018;Fuchsberger, 2005). Some related works based on statistical methods (Larranaga et al., 2013) and Bayes algorithm (Seraphim et al., 2018) are typical representative algorithms in this field. An expert system is currently the most feasible solution which uses artificial intelligence to solve problems in a field that requires human expertise. The application of machine learning algorithms is a breakthrough that provide us an efficient tool to apply N-IDS in practice and can be found in detail in (Khraisat et al., 2019;Dali et al., 2015). Moreover, Deep learning is a subset of machine learning that outperforms the traditional machine learning by learning to represent the data as a nested hierarchy of concepts. A general survey of Anomaly Detection using Deep Learning can be found in (Chalapathy and Chawla, 2019). The rest of paper is organized as follows. Section 2 gives an overview of related works and application of machine learnings models to N-IDS and its related benchmark datasets. Section 3 presents our proposed model for evaluation of machine learning models with several feature selection algorithms. Section 4 introduces performance comparison and finally, conclusions are given in section 5.

Machine Learning Models to Network Anomaly Detection
In this study, we choose four popular machine learning algorithms which are k-Nearest Neighbors' algorithm (KNN) (Altman, 1992), Adaptive Boosting (AdaBoost) (Gandhi, 2018), Random Decision Forests (RandomForest) (Ho, 1995) and Support Vector Machine (Cortes and Vapnik, 1995) for our model. The motivation is that those algorithms are recent works on applying machine learning to N-IDS and they provide better results and lower processing time compared to others (Khraisat et al., 2019;Dali et al., 2015). In pattern recognition, the k-Nearest Neighbors' algorithm (k-NN) is a non-parametric method proposed by Thomas Cover used for classification and regression (Altman, 1992). Among all these data mining techniques, KNN was used prominently due to its better accuracy and detection rates. Wang et al. (2009), the authors use attribute normalization to improve the performance of intrusion detection on three methods, KNN, Principal Component Analysis (PCA) (Jolliffe, 1986) and SVM. KDD Cup 1999 dataset is used to evaluate the normalization schemes and the detection methods (UCI, 1999). Panda et al. (2012), the authors proposed a 2class classification strategy on an early version of NSL-KDD dataset with 10-fold cross validation method. Their proposed model produced a high detection rate and low false alarm rate between normal and anomaly traffic. Jamshidi and Nezamabadi (2013), the authors introduced a nonlinear valuation function based on lattice based nearest neighbor classifier to tune the performance of the intrusion detection and was evaluated by an old KDD Cup'99 dataset. Tharwat et al. (2013), the authors designed and developed three different classifiers based on KNN classifier's concept for facial age estimation to achieve high efficiency. Rao and Swathi (2017) adapted two fast KNN classification algorithms i.e., Indexed Partial Distance Search k-Nearest Neighbor (IKPDS), Partial Distance Search k-Nearest Neighbor (KPDS) and comparing with traditional KNN classification for Network Intrusion Detection on NSL-KDD dataset 2017 (NSL-KDD, 2009). Benaddi et al. (2018), the authors propose to use PCA-fuzzy Clustering-KNN method which ensemble of Analysis of Principal Component and Fuzzy Clustering with K-Nearest Neighbor feature selection technics to detect anomalies.
Adaptive Boosting (AdaBoost) is a machine learning meta-algorithm proposed by Yoav Freund and Robert Schapire who won the 2003 Gödel Prize for their work (Zhang et al., 2005). AdaBoost is classified as boosting class (or sometimes referred as ensemble learning approach) because it aims to convert weak classification algorithms, correct previous algorithm errors then finally get a strong classifier. Shahraki et al. (2020), the authors investigate the feasibility of N-IDS by means of the most famous version of the boosting algorithms, including Real AdaBoost (Schapire and Singer, 1999), Gentle AdaBoost (Friedman et al., 2000) and Modest AdaBoost (Vezhnevets and Vezhnevets, 2005) on five public IDS datasets.
Random Forests (RF) is an ensemble learning method for classification, regression which construct a multitude of decision trees at training time and outputting which is classification or regression of the individual trees (Ho, 1998;Biau, 2012). Following (Resende and Drummond, 2018), RD uses multiple decision trees for layering. The algorithm assumes that if a sample is layered by multiple decision trees, whichever layer is chosen by most trees, then this sample will be assigned to that class. Efron (1979), several authors show that RF model applied in N-IDS is efficient with low false alarm rate and high detection rate. For more accuracy, RF uses a process which is called Bootstrapping. Bootstrapping is a statistical resampling technique that involves random sampling of a dataset with replacement (Peng et al., 2002). In addition, to make sure the decision trees are different, RF will randomly skip a few questions when building a decision tree. In this case, if the best question is not selected, the next question will be selected to build the tree. This process is called attribute sampling. Support Vector Machine (SVM) is an efficient classification technique in a wide variety of problems (Thai et al., 2012) which often provides considerable improvement over competing methods. Winter et al. (2011), the authors proposed a lightweight IDS that uses a one-class SVM to analyze incoming net-flows for analysis. Goeschel (2016), the authors proposed to combine a linear SVM, decision trees and Naïve Bayes to reduce the number of false alarms of the IDS and evaluation model on the old KDD Cup'99 dataset (UCI, (1999). Lee et al. (2005), the authors proposed an IDS model which consists of a one-class SVM for anomaly detection during an initial analysis, a multi-class SVM for traffic classification in the four classes (i.e., Denialof-Service, Remote to local, User to root and Probing attacks) and a final clustering process. Khan et al. (2007), the authors proposed a new approach for enhancing the training process of SVM when dealing with large training datasets. This work combines the use of SVM and clustering analysis to reduce the number of instances used during the computation of the support vector margin, which, in turn, reduces the training time without affecting the results. Sahu et al. (2019), the authors used an ensemble approach of supervised (SVM) and unsupervised (K-Means) to detect the anomaly patterns which provides more than 99% on three benchmarked datasets.

Feature Selection
Our work is strongly motivated by (Lima Filho et al., 2019) where the model does not process on all features of a network traffic flow which implement feature selection before training phase. Feature selection is an important step in the pattern recognition process and consists of defining the smallest possible set of variables capable of efficiently describing a set of classes (Ganapathy et al., 2013). According to (Miao and Niu, 2016), due to presence of noisy, redundant and irrelevant dimensions of large-scale data which can not only make learning algorithms very slow and even degenerate the performance of learning tasks. Benefits of feature selection are preventing overfitting (less redundant data helps to reduce opportunity to result decisions based on noise) and reducing training time (fewer data reduces algorithm complexity). In this study, we proposed to use Principal Component Analysis (PCA) (Wold et al., 1987;Shlens, 2014), Feature Importance (Abualigah et al., 2017), Univariate Selection (Rahman and Xu, 2004) instead of Recursive Feature Elimination with Cross-Validation technique used in (Lima Filho et al., 2019). PCA is a statistical method which projects data in a higher dimensional space into a lower dimensional space by maximizing the variance of each dimension. PCA builds new coordinate axes, has the ability to represent data equally well and ensures the variability of the data on each new dimension. PCA can discover the relation of feature in the new space, it is difficult to detect if placed in the old space because these relations are not visible. Feature Importance refers to the feature selection technique which gives us a score for each feature from network traffic data in which the higher the score more important or relevant is the feature towards our output variable. The importance characteristic is very useful to understand more about the data, the model and we can choose to remove features has lower importance and vice versa. Univariate feature selection examines each feature individually to determine the strength of the relationship of the feature with the response variable. Both FI and US help us choose redundant features to remove, sampling fewer features of traffic, saving system storage space. These methods are simple and particularly good for us to obtain a better understanding of data (Rahman and Xu, 2004). With FI, we propose using Extra Trees to estimate importance for each feature because Extra Trees uses random split can save more computation time than Random Forest using best split and then, we set threshold by SelectFromModel to keep the number of feature important. Features have higher importance or equal this threshold will be selected. US also calculate the score for each feature, so in this proposed system, we use SelectKBest in Pycharm to select features what we would like to keep by setting a threshold.

Benchmark DoS/ DDoS Datasets
In this study, we have used three benchmark DoS/DDoS datasets which are (NSL-KDD, 2009), CICIDS 2017 (Sharafaldin et al., 2018) and simulated traffic in (Lima Filho et al., 2019). NSL-KDD 2019 is the up-to-date dataset we choose for testing the model since it has a lot of improvement compared to KDD CUP 99. There has been a competition called the KDD Cup, an international competition for knowledge mining and data mining tools. The mission of the competition is to design a network intrusion detection system which aims to be a predictive model that can distinguish "normal" or "abnormal" connection. The results of the competition collected several network traffic records and gathered a dataset called KDD'99 and since then NSL-KDD dataset was created which is an optimized version KDD'99 from the University of New Brunswick (NSL-KDD, 2009). Finally, the complete dataset NSL-KDD 2019 is an upto-date dataset which contains new types of attacks and removed duplicates from the KDD'99 dataset. This resulting dataset contains about 150,000 data points and is divided into predefined training and test subsets which are KDDTest+, KDDTest-21, KDDTrain+, KDDTrain+_20% where KDDTest-21 and KDDTrain + _20% are subsets of KDDTest+ and KDDTrain+. KDDTrain+ is considered a training data and KDDTest+ is considered a testing data. KDDTest-21 is a subset of the testing data which removes the most difficult data records (point 21). KDDTrain _20% is a subset of the training data where the number of records equal to 20% of the total number of records in the training data. In other words, the records in KDDTest-21 and KDDTrain+_20% are included in testing and training data and no records exist in both datasets at the same time which make the evaluation of anomaly detection more accurately. CICIDS 2017 was created within an emulated environment over a period of 5 days and contains network traffic in packet-based and bidirectional flow-based format. For each flow, the authors extracted more than 80 attributes/features and provide additional metadata about IP addresses and attacks. This dataset contains a wide range of attack types such as SSH brute force, heartbleed, botnet, DoS, DDoS, web and infiltration attacks. In the original article (Sharafaldin et al., 2018) studied by the CIC organization that published this dataset, Iman Sharafaldin and colleagues used the RandomForestRegressor algorithm to select the best characteristics for each specific type of attack in CICIDS dataset. To select these features, they calculated the weight of each feature corresponding to each attack type. Finally, the selected features will be tested for performance and accuracy with seven machine learning algorithms. The results show that the ID3 model providing the highest F1 index which reaches up to 98%. Aksu et al. (2018), Doğukan Aksu and colleagues proposed a model using fisher score algorithm to select 30 optimal features from 80 of CICIDS2017 dataset. Then, they applied several machine learning algorithms such as KNN, SVM, Decision Tree to test and evaluate the results in which F1 measures 0.99 for the DDos dataset. The simulated network data in (Lima Filho et al., 2019) has done by simulating several VLANs which can connect to the Internet. The authors plan to create every 30 min an attack and there are 48 attack events in 24 h, starting at 00:00 and ending at 23:59. The attack tools are parameterized to create sneaky low-volume, medium-volume or light mode and massive high-volume attacks which result to this simulation datasets containing 73 features for a single record. Each record is clearly labeled as "normal" and "attack".

Proposed Model
In this study, the author focuses on applying several feature selection techniques specific to a network flow and use different machine learning algorithms to create a different training system. We will evaluate the execution time, the accuracy of DDoS attack detection between the models. Therefore, we illustrate the empirical study in the Fig. 1. The input of our proposed model use three benchmark DoS/DDoS which are NSL-KDD 2019, CICIDS 2017 and simulated traffic in (Lima Filho et al., 2019). After receiving the input data, this feature selection block use three different feature selection techniques: PCA, Feature Importance (using Extra-tree and SelectFromModel of scikit-learn []), Univariate Selection (using SelectKBest with chisquared algorithm) where:  PCA: We recalculate the relationship between features and reducing the number of data dimensions to the quantity we want. In order to find the appropriate number of dimensions, we have to repeat this step several times  Feature Importance and Univariate Selection: Although there are different ways of evaluating individual feature, however there are two techniques to calculate "points" for each feature and then retain the features with "points" higher than the pre-set threshold  With the PCA algorithm, the value of the n_components parameter corresponds to the dimension of the data when projecting old data to the new data space. With the algorithm ExtraTree and Chi-Square, the number of dimensions of data depends on the feature importance value of the features that calculated based on these two algorithms. In order to compare the correlation between the classification execution time and the classification performance, the data after reducing dimension through 3 proposed algorithms must have the same number of data dimensions. A match value is given based on the correlation between accuracy, algorithm execution time and the number of dimensions of the data. According to Table 1 in the paper, after applying the data dimension reduction algorithms, the number of dimensions of the NSL-KDD dataset decreased from 42 to 21. The number of dimensions of the 2017 CIC-IDS dataset decreased from 68 to 23 and the number of dimensions of the simulated data set decreased from 73 to 20. Thus, these algorithms have significantly reduced the dimension of data, saving computer resources and computation time.
After dealing with data dimension reduction with feature selection techniques, we have a new data set with the dimension much smaller than the original data. Then, we train this dataset with each machine learning algorithm (KNN, AdaBoost, Random Forest and SVM) to classify DoS/DDoS attacks.

Experimental Classification Results and Analysis
The experimental environment is implemented in the system with the following configuration:  Table 2 and the overall result is illustrated in Table 3. We can see that the computation time after data dimension reduction including processing time with PCA technique has been significantly reduced. The reason for using PCA is to recalculate the relationship between features to move from a dimensional space to a less data dimension space. Therefore, every time a network traffic goes through, the system needs to change the data direction of that traffic and then analyze whether the traffic is normal or attack. We also found that the accuracy of the system using the AdaBoost algorithm after reducing the data dimension by PCA increased significantly, but we also noticed that the execution time of both AdaBoost and Random Forest models increased dramatically, showing that recalculating the relationship between features has a good effect on increasing accuracy but makes the processing time of these two algorithms significantly increase. This result shows that this NSL-KDD 2019 is not suitable for the proposed data reduction model by PCA and using AdaBoost and Random Forest to detect attacks. The execution time of the KNN is reduced because fewer dimensions of the data must be calculated, the faster process of KNN. Therefore, we obtain the results showing that the model using KNN and PCA techniques achieved good results on this NSL-KDD 2019 dataset. Moreover, the results of applying Feature Importance in NSL-KDD 2019 dataset are shown in Table 5 and the overall result is illustrated in Table 4. After performing a dimensional reduction with the Feature Importance technique using the Extra Tree to calculate the Importance of each feature and using SelectFromModel algorithm to select the features that meet user-defined conditions, we can remove 20 redundant characteristic and only 21 features being used. The remaining features are 'is_host_login', 'num_outbound_cmds', 'num_shells', 'urgent', 'num_failed_logins', 'num_root', 'num_file_creations', 'su_attempted', 'num_access_file', 'root_shell', 'is_guest_login', 'land', 'dst_host_srv_diff_host_rate', 'dst_bytes', 'duration', 'dst_host_diff_srv_rate', 'srv_diff_host_rate', 'hot', 'num_compromised', 'service', 'dst_host_same_src_port_rate'. With the obtained results, we found that this training data set is not suitable for the AdaBoost algorithm. In addition, the system which makes the selection of high-important "features" in this training data set achieves very good results. We found that processing speed could be significantly improved if the system uses the KNN algorithm to detect network attacks with the accuracy of the system mitigating but the execution time is much faster than not eliminating unnecessary features.          We continue to apply Univariate Selection in NSL-KDD 2019 dataset and the results are shown in Table 6 and in Table 7. We perform data dimension reduction with the Univariate Selection technique that uses the chisquared algorithm to calculate squared values for each feature in the data set, then sort them in descending order. We then set the characteristic parameters we want to keep to SelectKBest and the features are taken in order from high to low according to the chi-square parameters until sufficient. Finally, the remaining 21 features being used which are 'is_host_login ', 'urgent', 'num_compromised', 'num_root', 'num_file_creations', 'src_bytes', 'num_shells', 'num_failed_logins', 'dst_bytes', 'num_access_files', 'sv_attempted', 'root_shell', 'hot', 'is_guest_login', 'dst_host_diff_srv_rate', 'diff_srv_rate', 'dst_host_srv_diff_host_rate', 'service', 'protocol_type', 'duration', 'dst_host_count'. With the obtained results, we can see that AdaBoost is not working well with this dataset, same when we do feature selection with Feature Important. In summary, through the results achieved after implementing the proposed system on the NSL-KDD 2019 dataset, we realize that the data reduction system by PCA and intrusion detection using KNN algorithms bring the best results. However, we also implement this proposed model using SVM but the obtained result is worst. Moreover, the processing time is very slow, so it is not suitable for online attack detection system. The possible reason is that it is very time-consuming to analyze and compute hyperplane in SVM to classify the attacks.  Fig. 2, we can see that KNN provides best results together with all feature selection algorithms. The main reason is that the less the number of dimensions to be computed, the faster the KNN algorithm is processed. But with AdaBoost and Random Forest processing with PCA, the results are not in our expectation. It is probably the correlation between the features of an object and AdaBoost and Random Forest are classified as black box models. It is possible that the reduced correlation between features makes splitting feature to build the tree of both algorithms more difficult and more computationally time consuming.

Evaluation of CICIDS 2017 Dataset
CICIDS 2017 dataset contains normal traffic records and common attacks with real data packets in PCAP. It also includes network traffic analysis results using CICFlowMeter with timestamped flows, source and destination IP, source and destination port, protocol and attack (CSV file). The data collection time begin at 9 am, Monday, July 3, 2017 and ended at 5 pm, Friday 7 July 2017 for a total of 5 days monitoring. Monday is a normal day and only includes valid traffic. Attacks are carried out (including Brute Force FTP, Brute Force SSH, DoS, Heartbleed, Web Attack, Infiltration, Botnet and DDoS) in the morning and afternoon on Tuesday, Wednesday, Thursday and Friday. In this study, we use attack data on Friday mornings (Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv). Initially, this dataset has 78 features, after pre-processing the data, we remove10 features with 0 variance. Moreover, the dataset had 68 features and was labeled with Benign and DDoS. The data set contains records with values of some features such as NaN, Infinity, which have no effect in the calculation, so they are deleted. After processing the dataset, it includes 97686 Benign records and 128025 DDoS records. After, we then used the Min Max Scaling (Han et al., 2011) to normalize the data with characteristic values from -1 to 1 to serve the performance evaluation. The results of applying PCA in CICIDS 2017 dataset are shown in Table 8 when we collected 23 features from the original 68 features of the dataset. We can see that except Random Forest algorithm, the execution time has increased significantly and the implementation time has decreased significantly. This is because PCA has transformed the data set into a new dataset, which makes the structure of newly constructed trees different from the original tree. In general, the accuracy will decrease after reducing the data dimension, but this reduction is acceptable compared to the execution time, i.e., the accuracy decreases about 0.01-0.03% but training and testing times reducing it 2-3 times. Overall, KNN still provides good results with the execution time is much faster and still gives the system a relatively high accuracy. The CIC-IDS 2017 data set is made up of a lot of different traffic files and there are many traffics of different types of DoS/DDoS attacks. In this study, we only use a single file (Friday-  Table 9. After performing a dimensional reduction with the Feature Importance, we get a new data with 23 features which are 'ECE Flag Count', 'RST Flag Count', 'Active Std', 'Active Max', 'Active Min', 'Bwd IAT Min', 'Active Mean', 'Flow Bytes/s', 'Flow IAT Min', 'Fwd IAT Min', 'Bwd Packets/s', 'FIN Flag Count', 'Init_Win_bytes_backward', 'Total Length of Bwd Packets', 'Bwd IAT Mean', 'Idle Std', 'Fwd Packets/s', 'Bwd Header Length', 'Subflow Bwd Bytes', 'Flow Packets/s', 'Idle Min', 'Subflow Fwd Bytes', 'Flow IAT Mean'. Moreover, we find the results to be very positive. The accuracy of the model slightly decreased but the implementation time was much reduced. In addition, when using the Feature Importance, we get only 23 features that need to be processed. Therefore, the network administrator only needs to set the rules so that only 23 features are collected from a network flow which can help reducing data sampling and increasing the processing speed for the overall system.
We continue to apply Univariate Selection on CICIDS 2017 dataset and the result is illustrated in Table  10. Same as previous work, we obtain 23 feature after using Univariate Selection which are 'ECE Flag Count', 'RST Flag Count', 'Active Std', 'Active Max', 'Active Min', 'Bwd IAT Min', 'Active Mean', 'Flow Bytes/s', 'Flow IAT Min', 'Fwd IAT Min', 'Bwd Packets/s', 'FIN Flag Count', 'Init_Win_bytes_backward', 'Total Length of Bwd Packets', 'Bwd IAT Mean', 'Idle Std', 'Fwd Packets/s', 'Bwd Header Length', 'Subflow Bwd Bytes', 'Flow Packets/s', 'Idle Min', 'Subflow Fwd Bytes', 'Flow IAT Mean'. In Fig. 3, we see that combining feature selection methods gives a good result on each individual model. However, with RF in combination with PCA and SVM in combination with US, the result is not good in terms of processing time.

Evaluation of the Simulated Traffic
This dataset is quite similar to a real network operating environment (Lima Filho et al., 2019). Therefore, we chose to use this data set to do performance evaluation with the proposed model However, we find that this data set is not large enough which includes 45500 records (including 22412 attacks and 23088 normal records). The results of applying PCA in this dataset is illustrated in Table 11 when we collected 20 features from the original 73 features of this dataset. The computation time after data dimension reduction has taken into account the processing time with PCA due to the nature of this technique is to re-calculate the relationship between the features to move from multi-dimensional space to a less data dimensional space. Therefore, every time a network traffic goes through, the system needs to change the data direction of that traffic, then analyze whether the traffic is normal or attack. We found that except for the Random Forest and AdaBoost algorithms the execution time increases significantly, but the overall the execution time decreases significantly. In general, the accuracy will decrease after reducing the data dimension, but it is acceptable compared to the execution time. In addition, we find that KNN algorithm is very suitable for this training dataset since the execution time is much faster and still provides a relatively high accuracy level.
We continue to apply Univariate Selection on simulating dataset and the result is illustrated in Table 13. Same as previous work, we obtain 20 feature after using Univariate Selection which are ''ip_ttl_cv', 'ip_len_cv', 'ip_len_cvq', 'ip_ttl_cvq', 'tcp_ack_rte', 'tcp_seq_cvq', 'tcp_seq_rte', 'tcp_dataofs_median', 'tcp_dataofs_mean', 'tcp_window_median', 'dport_cv', 'tcp_window_mean', 'tcp_flags_mean', 'tcp_flags_median', 'tcp_ack_cvq', 'tcp_seq_mean', 'tcp_seq_median', 'tcp_seq_cv', 'ip_ttl_std', 'ip_len_std'. We found that with two datasets (CICIDS 2017 and simulating traffic (Lima Filho et al., 2019)) that contain only normal and attack labels, the proposed model all performed well except for SVM. Therefore, it could provide a solution for an online network intrusion detection but still give relatively high overall accuracy. With the NSL-KDD 2019 dataset, the accuracy of classifying individual attack when using the AdaBoost algorithm is not good. Most attacks classified in the NSL-KDD dataset by the AdaBoost algorithm have very low performance. The traffic types for which the AdaBoost algorithm can has a high classification probability such as Neptune, normal and pod all have lower value than the other algorithms. Specifically, the accuracy of Neptune traffic classification by the AdaBoost algorithm is 96.83%, 2.81% lower than that of KNN algorithm and 3% compared to the Random Forest algorithm. Thus, the classification ability of the AdaBoost algorithm on the NSL-KDD dataset is not good. The proposed model can provide high accuracy of anomaly detection but when classifying each specific attack type, the accuracy is relatively low and there are few false alarms. We find that the proposed system is special good for labeled data sets which are normal or attack. The two models using KNN and Random Forest combined with feature selection techniques have good results in both accuracy and implementation time. Finally, we find that the proposed algorithm to achieve the best results on all three data sets is the combination of KNN algorithm and the Feature Importance.   the performance of KNN algorithm is much improved since only important features are retained. Moreover, the lower the number of data dimensions, the faster the calculation of KNN. Therefore, although the accuracy is slightly reduced, the calculation time is greatly reduced and this is acceptable for us.

Conclusion
In this study, we have proposed a model for empirical study for Machine Learning-based Network Intrusion Detection with Feature Selection algorithm which are PCA, Feature Importance and Univariate Selection. Our contribution is to study in detail of Machine Learning algorithms to work with Feature Selection techniques to evaluate the accuracy level of each combination of machine learning model and feature selection technique. Processing network traffic flow in an online manner is a difficult task especially when it contains a lot of redundant features/or characteristics. Moreover, we found that not all machine learning models can provide good results as in previous works, therefore we have evaluated the proposed models on three benchmark datasets which are NSL-KDD 2019, CICIDS 2017 and simulating traffic (Lima Filho et al., 2019). Lastly, we conclude that the combination of KDD and Feature Importance can provide a feasible solution toward an online network intrusion detection system.