MCMC-Fuzzy: A Fuzzy Metric Applied to Bayesian Network Structure Learning

: Bayesian network structure learning is considered a complex task as the number of possible structures grows exponentially with the number of variables. Two main methods are used for Bayesian network structure learning: Conditional independence, a method in which a structure is created consistently with independence tests performed on data; and the heuristic search method that explores the structure space. Hybrid algorithms combine both of the aforementioned methods. In this study, we propose the combination of common metrics, used to evaluate Bayesian structures, into a fuzzy system. The idea being that different metrics evaluate different properties of the structure. The proposed fuzzy system is then used as a metric to evaluate Bayesian networks structures in a heuristic search algorithm based on Monte Carlo Markov Chains. The algorithm was evaluated within the context of synthetic databases through comparison with other algorithms and processing time. Results have shown that, despite an increase in processing time, the proposed method improved the structure learning process.


Introduction
Bayesian networks are probability models that represent knowledge under random uncertainty. They are used in several areas, such as in behavior predictions, natural language processing, robotics, among others (Friedman and Koller, 2003).
Bayesian networks are composed of two main components: Parameters and structure. Parameters define the conditional probabilities between the variables or nodes. Structure defines the network topology, where the connections and the direction of such must be determined (Castillo et al., 2012).
Often domain experts are needed to define Bayesian network structures. However, this can be costly, complex and time-consuming due to the amount of variables, incompleteness of data and the difficulty in maintaining the structure, making the process impracticable in several cases (Scutari and Denis, 2014). Therefore, significant research effort has been invested into learning Bayesian networks structure from data. Both components can be learned from data. The learning of parameter is considered a simple task when the structure of the network is well defined. On the other hand, structure learningis an NP-Hard problem (Chickering, 1996).
The literature review shows two main methods to learn Bayesian network structures. The first method uses conditional independence tests on the data to find a structure consistent with the observed independence. The problem with this approach is the exponential number of dependence tests (Margaritis and Thrun, 1999). The second method defines a function to evaluate how well a structure represents the data, finding the simplest structure that increases the value of this function (Ko and Kim, 2014). Algorithms in this method explore the search space, assessing the structures score metric functions. A problem with this approach is that the search space contains all the possible structures (Yan and Cercone, 2010). Some algorithms, named hybrid, combine both approaches (Margaritis and Thrun, 1999). The most common approach in hybrid methods is to use independence tests to restrict the search space and then apply a heuristic search method (Tsamardinos et al., 2006).

The research question posed for this work is:
To what extent the evaluation of different properties of Bayesian networks improve the structure learning process in a heuristic search algorithm? To tackle this problem, we propose a fuzzy system, which combines common evaluation metrics. This approach draws inspiration from (Morales et al., 2004). Fuzzy systems are flexible tools capable of approximating different functions into one through a set of rules (Brooks et al., 2011). Moreover, fuzzy systems have been widely used in many fields, including in the Bayesian networks domain, for learning structure (Morales et al., 2004) and inference (Yang, 1997).
The main contribution of this work is the development of a fuzzy system that combines different common metrics to evaluate Bayesian network structures. The metric is then applied to a Markov Chain Monte Carlo search algorithm in the process of learning Bayesian network structures. Even though this work applies the metric to an MCMC algorithm, the metric is generic enough to be applied to other search algorithms. The proposed approach is evaluated by comparing its results to the results of other state-of-the-art algorithms in the context of two synthetic databases.
The article is thus presented as follows: Section II presents the related work. Section III presents the background knowledge on Bayesian network structure learning. Section IV presents the proposed fuzzy metric that is applied to a Markov Chain Monte Carlo algorithm. Section V presents the evaluation and Section VI concludes the article.

Related Work
Bayesian networks are widely used and thus automatic learning these networks from data is a very active research field (Chickering, 2003;Tsamardinos et al., 2006;Friedman and Koller, 2003;Vafaee;2014;Guo and Li, 2009). This literature review considered articles published in digital libraries. The databases used in this review were IEEE Xplore, ACM digital Library, Springer Link and ScienceDirect. The selected articles are presented below. These were selected based on their importance as well as current trends in the field. The review considered papers published between 1995 and 2017.The research presented in this paper focuses on heuristic search algorithms. Therefore, the articles presented in this Section fall under the heuristic search approach. Table 1 presents a summary of the approaches analyzes in this paper. Many of these approaches apply genetic and greedy algorithms to learning Bayesian network structures. Greedy algorithms find an optimal local, hoping that a local optimum also represents the global optimum of the problem. Greedy algorithms usually have good performance, however, they are highly dependent of its initial stage. The K2 (Heckerman et al., 1995) algorithm is one of the most known greedy algorithms in Bayesian structure learning. The main drawback of the K2 algorithm is that the order of the nodes is required as a parameter. Ko and Kim (2014) proposed an algorithm to define this parameter. Chickering (2003) proposed the Greedy Equivalence Search (GES) algorithm, which has gained many extensions. Nielsen et al. (2002) added randomness to this algorithm that attempts to escape from local optima. The extension of the GES proposed by Alonso- Barba et al. (2011) aimed at improving its performance. Scanagatta et al. (2015) proposed a greedy algorithm to work on large networks. Genetic algorithms are inspired by natural evolution, where the fittest individuals are selected in each iteration, to produce the next sample population. Examples of such algorithms include a genetic algorithm which also uses a fuzzy system (Morales et al., 2004). Another genetic algorithm applies the K2 scoring function to evaluate the produced structures (Faulkner, 2007). There is also a genetic algorithm focused on learning large structures (Vafaee, 2014).
Hybrid algorithms combine heuristic search and conditional independence tests to the data. For example, in (Zhang et al., 2013), in order to limit the search space, the authors proposed the use of conditional independence tests to construct an undirected graph. The resulting graph is applied as input to a greedy algorithm that determines the direction of the relations. See Table 1 for other algorithms that apply different approaches to limit the search space of Bayesian structures.  Markov blanket algorithms use techniques to identify the Markov blanket of each possible node. A Markov blanket of a given variable X is defined as the subset of variables that if observed, make the variable X conditionally independent of all other variables (Margaritis and Thrun, 1999). These algorithms propose different approaches to define the Markov Blanket, which is then used to define the Bayesian network structure (Margaritis and Thrun, 1999;Aliferis et al., 2010;Sechidis and Brown, 2015). Existing algorithms include the work of Pellet and Elisseeff (2008), which proposed the use of feature selection algorithms to define Markov blankets. Zhu and Yang (2014) decomposes the BN structure using Markov blankets and then defines the orientation of the arcs using the K2 scoring function.
Markov Chain Monte Carlo (MCMC) algorithms are methods that explore the search space by sampling from a probability distribution. MCMC methods apply correlated random sampling to move around the chain (Friedman and Koller, 2003). Examples of MCMC algorithms include its use to find the order of the nodes as a first step, then its application to search for the structure of the network using the order found previously (Friedman and Koller, 2003). Grzegorczyk and Husmeier (2008) proposed the remove arc modification operation into the MCMC method. Guo and Li (2009) combined the Expectation-Maximization algorithm to the MCMC algorithm. Niinimaki et al. (2012) proposed another algorithm over the order of the nodes, similar to the one proposed by Friedman and Koller (2003). Masegosa and Moral (2013) combines stochastic search and MCMC sampling. Su et al. (2014) proposed the incorporation of external knowledge into the MCMC algorithm. Su and Borsuk (2016) used Markov blanket resampling as a step of the MCMC algorithm.
As mentioned before, Table 1 summarizes the main approaches applied in Bayesian network structure learning. The table also shows how the MCMC method is being used in recent and relevant studies for the task of Bayesian network structure learning.
As mentioned before, the proposed fuzzy metric draws inspiration from a genetic algorithm proposed by Morales et al. (2004), which also uses a fuzzy metric to evaluate the structures. In this work, we propose a similar fuzzy metric applied to an MCMC heuristic search algorithm in the process of learning Bayesian network structures from data.

Bayesian Network Structure Learning
Bayesian networks can be modeled by domain experts or learned from data. As mentioned before, the model of Bayesian networks by experts can be costly, complex and time-consuming due to the number of variables, incompleteness of the data, amongst others. Structure learning generates a structure based on evidence found in the data. There are two main approaches in Bayesian network structure learning.
The method based on conditional independence treats each and every variable independently of its nondescendants, conditioned to their parents. This means that each and every Bayesian network represents a function of density and unique probability that can be factored (Carvalho and Chiann, 2013).
The second most common method found in the literature review applies heuristic search methods to generate a structure that best represents the data based on a score metric criteria (Daly et al., 2011). K2 is an example of an algorithm that applies heuristic search in the process of learning Bayesian structures from data. Some of the main metrics used in algorithms that apply heuristic search methods are: AIC, MDL and BDe.
The Akaike Information Criterion (AIC) is based on two terms: One term controlling entropy, based on conditional entropy and another one controlling complexity of the structure. In Information Theory, entropy is a non-negative value that measures uncertainty, tending to zero when knowledge is high. The AIC metrics is given by (Akaike, 1974): Considering that: The Minimum Description Length (MDL) metrics uses the same terms of the AIC metric with a small difference in the second term. This metrics is known for finding Bayesian network structures that are simpler than through AIC. The MDL metrics is given by (Bouckaert, 1993): The Bayesian Dirichlet equivalence (BDe) maximizes structure likelihood according to the datathat is, the metric uses the conditional likelihood of each variable in the network. The BDe metrics is given by (Heckerman et al., 1995):

∏∏ ∏
The AIC metric analyzes network information. The MDL metric analyzes complexity, while the BDe is based on probability.

Fuzzy Metric
In this section, a fuzzy metric that combines different common metrics for structural learning of Bayesian network is proposed. The proposed score fuzzy metric is applied to a heuristic search Markov Chain Monte Carlo method, which is being called MCMC-Fuzzy. The membership functions and fuzzy rules were based on the properties of the metrics being used by the fuzzy system and by attempting different configurations. These were also based on the work of Morales et al. (2004).
Monte Carlo via Markov Chains (MCMC) is a method that uses approximation by sampling. The goal is to generate a Markov chain limited by a desired distribution. The method obtains random probability distribution samples that are considered difficult to sample directly. Starting from any point, as the number of samples increases, it is said that the chain is closer to its balance distribution (Brooks et al., 2011).
The workflow of the MCMC-Fuzzy algorithm is presented in Fig. 1.

Fig. 1: MCMC algorithm workflow
Given a set of data, the first stage of the algorithm generates a random Bayesian networks structure, which is used as the initial state of the Markov chain. In the next step, two variables are randomly sampled from the set of variables from the dataset. To select a structure modification, a random number is generated from a uniform distribution. The structure modifications can be: Adding, removing, or inverting an arc, between the two sampled variables. Since a Bayesian network structure cannot have cycles, before evaluating the structure, the algorithm validates that.
Usually, Bayesian network structures are evaluated using a single metric, which is the case of the EM-MCMC algorithm. It is known that score metrics evaluate Bayesian structures differently, for example, some metrics favor more complex structures, while others prefer structures with less parameters (Su and Borsuk, 2016). In this work, we present an approach that combines known metrics through a fuzzy system to evaluate Bayesian networks structure. This metric can be used to evaluate the structures being induced by any algorithm that have an evaluation step.
Fuzzy logic provides a way of combining such distinct metrics into one. The metrics used in the proposed fuzzy system are: AIC, MDL and BDe. The fuzzy metric will thus have four variables: 3 inputs and an output variable, named Quality (Table 2).
Each set is defined as a membership function. Membership functions were defined by using the highest value among all metrics, considering a network structure completely connected and the same structure when completely disconnected. All sets were defined uniformly in the [0,1] interval. Figure 2 presents the membership function for the metrics, in which x represents the value of metrics and µ(x) its membership.
The membership function for the Quality variable is presented in Fig. 3.
The following step of the fuzzy system was the definition of the fuzzy rules. The MDL and BDe metrics are minimized and AIC is maximized. Thus, the BDe metric influences the Poor, Average and Good fuzzy sets of the Quality variable. MDL and AIC metrics influence the Average, Good and Excellent fuzzy sets of the same output variable. Figure 4 presents the rules used in the proposed fuzzy metric.  The metrics are aggregated to the Quality variable by truncation. The final part is to defuzzify the Quality output variable. The proposed approach uses the centroid method for defuzzification. The algorithm continues until the amount of iterations is met. The algorithm pseudocode, which is based on the MCMC Metropolis-Hastings algorithm (Chib and Greenberg, 1995), is presented in Fig. 5.  The algorithm starts by generating a random structure. Line 3 calculates the initial network score using the proposed fuzzy metric. Line 4 stores this value as the best score. Then, the iterative process starts.
Lines 6 and 7 select 2 nodes randomly. Line 8 specifies one of the possible modifications (add, remove, or invert arcs). The modification is performed in Line 9. Line 10 calculates the score for the new network with the random modification. Line 11 verifies if a random number of a uniform distribution is smaller than the minimum between 1 and the exponential difference between the scores. If it is, the structure is accepted as the new best structure sample.
The test presented in Line 11 represents the probability of chain acceptance according to the equation presented below (Friedman and Koller, 2003):

Evaluation
The MCMC-Fuzzy was applied to two synthetic databases. These synthetic databases are commonly used in evaluating Bayesian networks learning algorithms, which makes a comparison between methods possible. The databases are also freely available.
The first synthetic database, called Asia, represents patient diagnosis in a hospital emergency room. The Asia dataset has 5000 rows. The Asia gold standard network has 8 nodes, 8 arcs and 18 parameters. The second database is called Alarm. The Alarm dataset has 20000 rows. The Alarm gold standard network has 37 nodes, 46 arcs and 509 parameters. Both datasets are composed of categorial data, which iscommon for Bayesian networks. We refer the reader to Scutari and Denis (2014), for a full description of these databases.
The evaluation conducted compared the structures learned and the processing time from the proposed method to the EM-MCMC, MMHC and K2 algorithms. EM-MCMC is used in this experiment for being an MCMC method, easily comparable to the proposed Fuzzy-MCMC algorithm. MMHC and K2 are used for widely known and used algorithms for learning Bayesian network structures.

Network Structure Comparison
In this section, we present a comparison of the Bayesian network structures learned from the proposed MCMC-Fuzzy algorithm, to the EM-MCMC, MMHC and K2 algorithms. The algorithms used for comparison were selected due to popularity, efficiency and broad utilization. Figure 6 presents the structure learned from applying the MCMC-Fuzzy algorithm for the Asia database. The outlined arcs represent arcs correctly identified by the algorithm. Figure 7 presents the Bayesian network learned by the MCMC-Fuzzy algorithm for the Alarm database.
Tables 3 and 4 present the results found by the K2, MMHC, EM-MCMC as well as by the MCMC-Fuzzy algorithm for the Asia and Alarm databases. The EM-MCMC and MCMC-Fuzzy have the number of iterations as their stop condition. Therefore, tests were performed using different values for this parameter. The gold standard presents the correct Bayesian network settings for the database.
The K2 algorithm is widely known for having good results in several databases. This is due to the input parameter defining the order of variables in the Bayesian network. The definition of this parameter is considered a complex task (Friedman and Koller, 2003). Moreover, structure learning is commonly done when there is little or no information about the data. The remaining algorithms learn the network structure without the need of additional parameters, which makes learning process more difficult.
The MMHC algorithm is used in several applications particularly because it can be scaled, with proper run times even when working with a high number of variables. When applied to the Asia and Alarm databases, the algorithm identified 4 correct arcs for Asia and 18 for Alarm. The MMHC results show certain equivalence with EM-MCMC results, identifying nearly the same amounts of correct, extra and missing arcs in both databases.   The MCMC-Fuzzy algorithm yielded good results for both databases. When applied to the Asia database, it identified 6 correct arcs, which is compatible to the results for K2 with the advantage of not requiring additional parameters.
For the Alarm database, MCMC-Fuzzy found the largest number of correct arcs and the smallest number of missing arcs. However, these results are still inferior when compared to K2. MCMC-Fuzzy also identified more extra arcs than any other method.

Processing Time
The processing time tests were performed on a computer with an Intel Core i7 chipset and 8GB RAM memory, running MAC OS X 10.9. Table 5   MMHC is known for optimal processing times and is widely used in large databases. By observing the processing time of the algorithms, one can note that the improvement in the structure had a cost, which is still acceptable. Since, structure learning only needs to be redone when there are changes in the data. In addition, the number of iteration for the MCMC methods influence directly upon their processing times.
K2 had the best results regarding processing times. However, the algorithm depends on previous knowledge over the database since the order or variables must be informed as an input parameter. The MCMC-Fuzzy method was proven efficient, yielding the best results in structure learning among the algorithms that learn Bayesian networks only from data. Furthermore, processing time could be improved by optimizations in the code.

Conclusions and Future Work
In this study, a fuzzy metric that combines distinct common metrics applied to Bayesian network structure learning is proposed. This metric was applied to a MCMC heuristic search algorithm, which is called Fuzzy-MCMC and evaluated using two synthetic databases.
The combination of different metrics resulted in a fuzzy system capable of evaluating different properties of Bayesian networks simultaneously. The modeled metric is also generic enough to be applied to other heuristic based search algorithms.
The proposed algorithm, Fuzzy-MCMC, has comparable results with the K2 algorithm for the Asia dataset. K2 has the best results for the Alarm dataset, with Fuzzy-MCMC having the second best results. However, the K2 algorithm requires as a parameter the order of the variables, which is often complex to determine. Considering algorithms that do not require additional parameters, the proposed MCMC-Fuzzy algorithm had the most accurate Bayesian network structure for both datasets. However, MCMC-Fuzzy algorithm also identified many extra arcs in comparison with other methods, hence, it obtained a more complex network structure. In relation to processing time, the MMHC algorithm had the best results. However, code optimizations can be performed to MCMC-Fuzzy in order to improve time performance. We believe that the MCMC-Fuzzy algorithm can be used when not much information is available about the dataset and when the extra processing time is not an issue.
Future work includes more experimentation with the fuzzy metrics' rules system, which can affect the accuracy of the structures learned as well as their running time. Future work also includes applying the proposed fuzzy metric to other heuristic search algorithms.