From Feature Selection to Building of Bayesian Classifiers: A Network Intrusion Detection Perspective

: Problem statement: Implementing a single or multiple classifiers that involve a Bayesian Network (BN) is a rising research interest in network intrusion detection domain. Approach: However, little attention has been given to evaluate the performance of BN classifiers before they could be implemented in a real system. In this research, we proposed a novel approach to select important features by utilizing two selected feature selection algorithms utilizing filter approach. Results: The selected features were further validated by domain experts where extra features were added into the final proposed feature set. We then constructed three types of BN namely, Naive Bayes Classifiers (NBC), Learned BN and Expert-elicited BN by utilizing a standard network intrusion dataset. The performance of each classifier was recorded. We found that there was no difference in overall performance of the BNs and therefore, concluded that the BNs performed equivalently well in detecting network attacks. Conclusion/Recommendations: The results of the study indicated that the BN built using the proposed feature set has less features but the performance was comparable to BNs built using other feature sets generated by the two algorithms.


INTRODUCTION
Intrusive attempts on computer networks have become more prominent considering the increasingly important role played by Internet systems in our daily life [1] . Individuals and organizations nowadays hardly accomplish their daily tasks without relying on the conveniences provided by computer networks and Internet technologies. Therefore, intruders could have exploited the weaknesses of such technologies to take advantage of the information gained from the individuals as well as organizations.
A series of protective measures have been taken to protect Internet systems, which includes the setting up of firewall, anti-virus software, intrusion detection systems (IDSs) and implementation of a proper security policy. IDSs are one of the mentioned measures that have received extensive attention by the public to protect their Internet systems. IDSs are used to identify, classify and possibly, to respond to benign activities [2] .
There are two basic types of IDSs, namely, Hostbased IDS (HIDS) and Network-based IDS (NIDS). The HIDS monitors activities in a computer system without considering the activities in the computer network where the computer system is located. The NIDS, as opposed to HIDS, are not concerned with activities in individual computer systems but monitor activities in the computer network(s) where the computer systems are located. Sensors of NIDS are deployed at network entry points in order to monitor traffics that traverse the networks. Both HIDS and NIDS can be implemented as passive or inline technology. The IDSs that utilize inline technology is able to prevent damages once an intrusion is found. On the other hand, IDSs that work passively typically log intrusive activities without preventing the losses caused by intruders.
There are two basic approaches for HIDS and NIDS in detecting intrusions: (1) misuse detection and (2) anomaly detection. IDSs that employ misuse detection approach detect attacks by comparing the existing signatures against the network traffics captured by the IDSs. When a match is found, the IDSs will take action as the traffics are considered harmful to computer systems or computer networks. Actions taken by the IDSs will normally include sending alerts to network administrator and logging the intrusive events. IDSs that implement misuse detection approach are, however, incapable of detecting novel attacks. The network administrator will need to update the stored signatures frequently to make sure that the IDSs perform well in detecting intrusions. IDSs that employ anomaly detection are capable of identifying novel attacks, that contain activities deviated from the norm. Such IDSs utilize the built profiles that are learned based on normal activities in computer networks. Nevertheless, false positive alarms are likely generated by the IDSs as activities in computer networks do not always follow the norm. For instance, a server in a computer network might receive an incredibly large number of connections from the public in a short period due to its interesting content.
In this study, we attempted a NIDS that employs Bayesian approach to detect intrusive activities in computer networks. Empirical evaluation was conducted to obtain optimal features to built different types of BNs by leveraging on a standard network intrusion detection dataset. In addition, stratified sampling of the standard dataset was performed to obtain four different sizes of datasets. Using the datasets, BNs built using the selected features were tested to investigate their performance in detecting intrusions in computer networks.
A study by [3] classified intrusions using both BNs and Classification and Regression Trees (CART). The features of the intrusion data were selected based on Markov Blanket of the target variables. An ensemble classifier was constructed by combining both approaches to increase robustness, accuracy and better overall generalization.
An interesting research by [4] proposed an IDS with a cooperative agent architecture. The system allows the agents to share belief on an event occurrence and perform soft-evidence update to enable a continuous scale for intrusion detection. There are three types of agents in the proposed system: system monitoring agent, intrusion monitoring agent and registry agent. The system monitoring agent is responsible for processing log data upon request and communicates with the operating system. Such agents publish their facts and beliefs derived from observations of each other. Intrusion monitoring agent, on the other hand, performs belief update based on BNs using observed values (hard evidence) and derived values (beliefs or soft evidence) from other agents as well. Using both hard and soft findings, the system is able to identify various known attacks. In the research, each intrusion monitoring agent encapsulates an Expert-elicited BN and is responsible for monitoring a particular type of intrusion. Therefore, the modification of an intrusion pattern will not affect others.
A hybrid intelligent IDS developed by [5] incorporated BN and Self-Organizing Map (SOM). In this research, SOM theory was slightly modified for the standard network intrusion dataset, which contains labels. The experimental results showed that the performance of the hybrid intelligent IDS was better compared to the non-hybrid Bayesian learning approach.
Research for comparing performance of different classifiers were conducted as well. The research by [6] has shown that Naïve Bayes Network depicts competitive results when compared to Decision Trees, despite the fact that Naïve Bayes Network works based on the assumption that all variables involved are conditional independent from each other.
A framework for an adaptive intrusion detection system was proposed by [7] using BN. In this research, any new network data that was considered intrusive by the system will be added to the dataset. The IDS was therefore, updated from time to time.
The technical report of [8] proposed a new model for intrusion detection that is able to classify new unlabeled data and allow for constant updating whenever new data is captured. The author exploited the possibility of developing the model using Partially Observable Markov Decision Process (POMDP).
Session Anomaly Detection (SAD) was proposed by [12] , which utilized Bayesian parameter estimation method to analyze web logs and detecting anomalous sessions generated by the Whisker and Nimda worms. SAD functions by developing a normal usage profile and compared it to the generated web logs against the expected frequency. The study reported that SAD performed better than SNORT, which used misuse detection technique.
A study by [13] proposed a method to effectively analyze data that were collected by the distributed IDS based on Bayesian Multiple Hypothesis Tracking (BMHT), so that the related incidents can become apparent. As discussed in [10] , most of the existing research works concentrate only on a network that the IDS want to protect and therefore only the information of attack activities that occurred in the network will be gathered. To have a complete view of an intruder's action, the author suggested an approach in gathering data from more than one network via IDSs. The BMHT is used to reorganize network data so that a better view of the activities occurring in the networks can be obtained.
The above mentioned research works reported various network intrusion detection methods, which utilized a single type of BN or a BN is used together with other classifier in building a better IDS. However, these research works did not evaluate the performance of different types of BN before deciding to use either one of them. Therefore in this study, we investigated how different types of BN perform in identifying various types of attacks. Two known types of BN and a BN crafted based on the domain knowledge on attacks were built and evaluated.
Bayesian networks: BN is a prevailing method for dealing with uncertainty in real-world decision making and it has been applied to various research domains successfully. There are major advantages of using BN in various research domains. A research domain can be understood well as the BN structure provides explicit inter-relationships among the data set attributes. Besides, methods are provided for handling missing data and to prevent over-fitting of data. Data and domain knowledge can be combined because a BN model has both a causal and probabilistic semantics [14] . Human interventions are allowed to modify the BN to increase the performance of the predictive model. Furthermore, the expert-elicited network can be further enhanced using probability learning and network learning method to achieve higher accuracy of prediction. Adding decision node and utility node to the network will extend the capability of a BN for decision analysis.
BN is a Directed Acyclic Graph (DAG) and its structural representation is represented by nodes that correspond to random variables in a problem domain. Arcs in a BN represent causality or influential relationship between parent nodes and child nodes. Nodes in the BN contain states of random variables. As shown in Fig. 1, the BN is structured in such a way that only the node C has Conditional Probability Table  (CPT) given its parents. Nodes A and B have only prior probability tables since they do not have any parent node. The CPT describes the strength between the parent node A and the child node C as well as the parent node B and the child node C. Assuming that all the nodes in Fig. 1 have two states, thus the CPT for node C has a 2 3 = 8 probability value entries.

Fig. 1: A simple BN
Consider a BN with n nodes, with X representing random variables and x denotes the states of the random variables. The joint distribution is presented by P(X 1 = x 1 , X 2 = x 2 , …, X n = x n ), or in a more compact way, P(x 1 , x 2 , ...,x n ). The graph specifies a factorization of the joint probability distributions based on the chain rule: A BN can be described via qualitative and quantitative components. The qualitative component is presented by the structure while the quantitative component, through its CPT. Posterior probabilities of query variables can be calculated in light of any evidence by having both the qualitative and quantitative representation of BN. By using Bayes's rule and an inference algorithm, BN can be used to perform diagnostic, predictive and inter-causal reasoning, as well as any combination of the above [15] .
There are three basic types of BN classifiers, namely, Naive Bayesian Classifier (NBC), Learned BN and Expert-elicited BN. The NBC is the simplest BN model that consumes low computational power. The NBC has child nodes where they all share the same and single parent node. The NBC assumes conditional independence for the child nodes.
There are two steps involved in building a Learned BN. Firstly, the DAG has to be induced using existing algorithms such as, PC, K2 and NPC. Secondly, the parameters as defined by the DAG have to be estimated. Parameter estimation can be conducted using algorithm such as Expectation-Maximization (EM).
Besides constructing BNs using existing machine learning algorithms, a BN can be constructed manually by eliciting knowledge of a domain expert. The construction process is a repetitive process, which involves model verification and model revision.
There are basically three categories of variables, namely, problem variables, information variables and mediating variables to be identified by domain experts in constructing a BN manually. Problem variables are related to classification, which in this study, classify intrusions in computer networks. Information variables, on the other hand, provide information relevant to classifying network intrusions. The features of the dataset we used in this study will be served as evidence for classifying intrusions. The information variables can be further divided into two sub-categories namely, background information variables and symptom information variables. Background information is the information available before the problem exists whereas symptom variables can be viewed as consequences after the occurrence of the problem. Since the background information came before the problem, thus, background information variables will be the root of a DAG. The mediating variables serve as unobservable variables, which are used to counter the dependency of two or more information variables for solving the problem [16] . The causal relations of the variables are as shown in Fig. 2.
DoS attacks are performed to a host by using up its resources so that it will not be able to provide network service to the legitimate users. DoS attacks are most feared as such attacks do not require intruders to access to a victim machine. Performing DoS attacks can be as simple as running a script or a tool. There are many types of DoS attacks. Smurf attack is one of its many types. By performing Smurf attack, an intruder sends large amount of spoofed Internet Control Message Protocol (ICMP) messages to broadcast addresses of a computer network. Hosts in the computer network will reply the ICMP messages and this will eventually multiply the network traffics in the computer network. A computer network can be saturated if such network traffics are huge in number.
Probing normally precedes an actual access or DoS attack. Probing can be performed by utilizing freely available tools in the Internet such as Nmap, so that vulnerabilities of a particular host or a computer network can be found. Such tool can be used to ping sweeps a computer network to generate a list of potential victim machines. Port scanning can then be performed on any of the machine in the list to find out the ports or services that are currently active. Intruders can soon send queries to gather information such as application type, version of the application or probably operating system to figure out the possible vulnerabilities to be exploited. R2L attacks are conducted by sending packets to a targeted machine in a computer network to gain access as if the intruders own an account in the targeted machine. R2L attacks can be performed in many forms. It takes advantage of weakly configured security features, perform buffer overflow attacks and guess or capture password of hosts in computer networks. Whereas for U2R attacks, a local user may exploit flaws in poorly designed systems so that root level privileges can be obtained [27] .

MATERIALS AND METHODS
Pre-processing the dataset: The standard dataset used for network intrusion detection domain was a result of a DARPA intrusion detection evaluation program [28] . It consists of 494,021 records with 41 features and each of the records is labeled with a class Normal or any of the 22 types of attacks. One of the records was however removed due to errors.
The 22 types of attacks were later being categorized into four attack categories. The reason to categorize the attacks into four attack categories is to ease the classification tasks in the later stage as some of the attacks consist of only a few records. Nevertheless, unevenly distributed number of records could still be seen after categorization as illustrated in Fig. 3. Attack category such as U2R consists of only 52 records while DoS consists of nearly 0.4 millions of records. Consequently, classification accuracies of category such as U2R might be affected. However, better classification accuracies will be obtained in handling four attack categories rather than handling 22 types of attacks.  Fig. 4. As shown in Fig. 5, the records of the standard dataset were randomized and values of each of the features of the records were discretized at the preprocessing stage (line 8 and 9). Special characters for instance, "\" and "_" were seen after discretization. The special characters will increase the size of the dataset and consequently increase the computational cost in processing the dataset. Removal of these special characters is thus necessary (line 10).

Feature selection approach:
The number of features required is another major concern in processing the dataset as well. As the number of features increase, the relationships among the features as well as the relationships between features and classes will become very complex. High computational cost will inevitably be needed in processing such complex relationships. It is thus necessary to undergo a feature selection stage to obtain an optimal feature set with less number of features but able to provide high detection accuracies.
We proposed a novel feature selection approach in which the decision of feature selection algorithms and opinion of experts were incorporated. In our approach, two filter-based feature selection methods were used to confirm important features of the dataset. Additional features which are considered important by the domain expert were added to identify network intrusions.
As shown in Fig. 5, two filter-based feature selection methods were utilized at the feature selection stage to produce two feature sets (FS 1 and FS 2 ) (line 12 and 13). Correlation-based Feature Selection Subset Evaluator (CFSE) and Consistency Subset Evaluator (CSE) were utilized by these two feature selection methods. CFSE uses an algorithm that works together with an evaluation formula, in which the ideas are based on test theory. Good features are then selected with an appropriate correlation measure and a heuristic search strategy. The algorithm has the advantages in identifying irrelevant, redundant and noisy features fast. Relevant features can be identified as long as their relevance does not strongly depend on other features [29] . On the other hand, inconsistency of a feature set class given different class labels is measured by CSE. The algorithm involved is monotonic and has the advantage of removing redundant or irrelevant features fast. It is also multivariate and able to handle noises in dataset [30] .
Confirmation of important features was done by extracting the shared features of these two feature sets to form a shared feature set (FS 3 ). These two feature sets were then combined without repeating the same features to generate a combined feature set (FS 4 ) (line 16 and 17).
The neglected features (F n ) related to Probe, R2L and U2R attacks were selected by domain experts and added one by one into the shared feature sets to form the proposed feature set (FS 5 ) (line 20-27). As the numbers of records of these attacks were relatively small compared to DoS and Normal, thus classification accuracies were expected to be low. Intervention of domain expert might help in this case. Considering the characteristics of probe attacks, features such as dst_host_count and dst_host_rerror_rate needed to be added. dst_host_count was selected among the neglected features as the Probe attacks involved a large number of connections to a same destination host.   Fig. 6: The same structure of NBC was used for the four datasets dst_host_rerror_rate was considered as well because certain probing attacks have larger time interval in scanning hosts or ports. These features are formed using a connection window that consists of 100 connections. root_shell was included to detect U2R and R2L attacks, which involve unauthorized access to a machine. The finalized features for proposed feature set and other feature sets are as shown in Table 1.
Explanation of the features is given in Table 2. Five independent datasets were formed based on the features of these five feature sets. BNs were built using K2 algorithm and 10-fold cross validation was conducted to evaluate the BNs' classification accuracies. The feature set with optimal performance will be selected for the next experiment.

Constructing BNs as classifiers to intrusion detection:
In the next experiment, performances of different types of BNs were evaluated. The dataset was re-sampled to provide another three sample datasets in different sizes (75, 50 and 25 of the standard dataset). The re-sampling was done to produce sample datasets that have the same class distribution as the original dataset.
The intrusion detection stage involved three BN classifiers, namely, NBC, Learned BN and Expertelicited BN. The optimal feature set decided in the previous experiment was used to construct the BNs. The NBC is made simplified by assuming the variables are conditional independence of each other (Fig. 6). The Learned BN can be constructed using a few existing search algorithms. Experiment was conducted based on the datasets in order to choose an algorithm that has the optimal performance.
On the other hand, the Expert-elicited BN allowed researchers to incorporate expert views into it. To construct an Expert-elicited BN, various types of variables need to be identified. Intrinsic features of the resulted dataset such as service, dst_bytes and service existed in raw dataset. Thus, they would be treated as the background variables in constructing the BN. Classes of various types of intrusions (DoS, Probe, R2L and U2R) will be represented using a problem variable. There are features in the dataset, which are derived from raw dataset. Features such as logged_in, count, dst_host_count, dst_host_rerror_rate and root_shell were formed based on knowledge of the domain and they were treated as symptom variables. The symptoms variables served as evidences for classifying the intrusions. Mediating variables were not in our consideration in constructing the BN as the variables in this study are observable. The Expert-elicited BN is as shown in Fig. 7. The root of the BN was the background variables as they have direct influence on the problem variables. The domain experts incorporated their views regarding the attacks by refining and verifying the parameters of the nodes of the Expertelicited BN.

RESULTS
The first experiment was conducted to select an optimal feature set while the second experiment involved a selection of search algorithm for building Learned BN. Various types of BNs were then constructed based on the optimal feature set and the performances of these BNs were compared in the last experiment.
The results of the first experiment showed that the performances of the feature sets were comparable to each other except the shared feature set ( Table 3). The performance of the BN built using the shared feature set was poor especially in Probe, R2L and U2R attacks (63.3, 33.8 and 23.1%).
To compare the performances of the BNs built using the proposed feature set and other feature sets, an independent-samples t-test was conducted. The result in Table 4 shows that there was no significant difference   between the proposed feature set and other feature sets except shared feature set. The poor performance of the BN built in Probe, R2L and U2R was the cause of the significant difference between the proposed feature set and shared feature set.
A selection was then conducted to select a search algorithm with optimal performance for building Learned BN. Despite many more search algorithms, the selected search algorithms K2, Hill-climber and Tabu [31][32][33] were identified because they are computationally feasible in processing huge dataset compared to others.
As shown in Table 5, the Tabu search algorithm performed relatively weak in classifying R2L and U2R attacks compared to others irrespective to the sizes of dataset. We thus compared only the performance of K2 and Hill-climber search algorithms. An independentsamples t-test was conducted as well. The results show that there was no significant difference between the BNs built using K2 and hill-climber search algorithms (Table 6). K2 was selected to build Learned BN and later compared with the performances of other types of BNs (Fig. 8). Table 7 shows the classification accuracies of different BNs based on various sizes of dataset. We conducted an independent-samples t-test and no difference in accuracy between the BNs regardless of the sizes of the dataset (Table 8).         The results of three BNs were used to compare with the results of the researchers that prepared the dataset [34] . The BNs performed better in DoS, R2L and U2R categories and give comparable results in Probe category ( Table 9).
As shown in Table 10, NBC was built and tested with the least requirement of time in almost all the datasets regardless of its size. However with full dataset, it required a slightly more time in network building compared to Expert-elicited BN.   The BNs are able to perform well in identifying DoS and Normal attacks ( Fig. 9a and b). The BNs gave good performance as well in probe and R2L attacks although the number of records involved was medium as compared to DoS and normal.
Although the differences in terms of classification accuracy were not significant, classification accuracies of Learned BN for Probe category, especially involving small datasets, were slightly lower as compared to other BNs (Fig. 9c). The classification accuracies of NBC for R2L were slightly higher in all sizes of datasets as compared to others (Fig. 9d). The Expert-elicited BN was able to give a better performance as compared to others in this attack category. A difference of 29.4% in classification accuracy was observed in quarter dataset (Fig. 9e).

DISCUSSION
The proposed feature set generated in the first experiment has fewer features compared to other feature sets. This has suggested that the required computational cost in processing network data can be reduced while preserving the classification accuracy.
In the second experiment, K2 was selected as the search algorithm to build Learned BN considering that it exhibited no difference in terms of performance as compared to Hill-climber search algorithm.
The subsequence experiment also indicated that any type of BN is suitable for identifying attacks utilizing the standard dataset regardless of its sizes. However, NBC can be utilized in identifying attacks if time for building network and testing is a factor.
The huge number of DoS and Normal records in the dataset was the reason for high and consistent classification accuracies of the BNs. The number of records for U2R attacks was very small in the dataset. Classification accuracies were affected and therefore the performances of BNs in the attack category were not that promising. Nevertheless, the results also indicated NBC and Expert-elicited BN demonstrated advantages in identifying R2L and U2R attacks, respectively.

CONCLUSION
In this study, we propose a novel approach for selecting features and comparing the performance of various BN classifiers utilizing the standard network intrusion dataset. The feature selection approach required two feature selection algorithms to confirm the important features of the dataset and later intervention of domain experts to form a proposed feature set. The interventions of domain experts in feature selection has proven to be useful as less features were generated compared to other feature sets produced by the two feature selection algorithms. Moreover, the performance of the BN built by the proposed feature set was comparable to others. Empirical experiment conducted in this research also indicated that any of these three BNs can be used in identifying attacks utilizing the standard network intrusion dataset. The performances of the BNs are comparable to each other. However, NBC should be considered if the dataset involved is huge in size and time for building network and testing is the main factor to consider.
We are considering to implementing multiple-BN classifiers in the future by looking at the advantage of the BNs in identifying certain type of attacks. It was observed that even though the number of records is low for an attack category, the classification accuracies can be maintained or improved with the intervention of domain experts. As such, the Expert-elicited BN should be improved so that it will work better if multiple classifiers or a single classifier is implemented in NIDS.