Fuzzy Method for Online Learning of Bayesian Network Parameters

: In learning problems, there are situations where training data is not fully available at the learning time. They are incrementally generated by time, defining a type of domain called online that has among its characteristics the possibility of data failure or even missing data. In Bayesian networks, learning is divided into two categories: structure (related to the graph of conditional relations) and parameters (related to the strength of conditional relations). In this work we present an online parameter learning method that quickly adapts to changes in the environment aiming not only the reproduction of the probability distribution (generative learning) but also the increase of accuracy in the network (discriminatory learning). Our approach is compared with the Adaptative Voting EM method considering two simulation conditions: when distributions are unknown and when distributions undergo abrupt changes. The proposed method achieves good results in both situations by adjusting to environment changes more quickly and by simplifying the parameterization of the traditional approach.


Introduction
Bayesian Networks (BN) have become extremely popular in the last decades because they have been able to map the between variables Friedman et al. (1997). In addition, they are an appropriate language with efficient resources for representing the joint probability distribution over a set of random variables. The technique is even more attractive by being able to model real world problems and by the interpretation of the network by nonspecialists Zhou (2015).
The learning process in Bayesian Networks is divided between structure learning and parameters learning (Kurihara et al. (2001); Chen et al. (2001); Zhang and Liu (2008). While the first aims to build the network graph, the second focuses on updating the conditional probabilities among the variables.
Parameters learning algorithms are divided into two main categories: generative and discriminative Su et al. (2008). The first one creates conditional probabilities considering the data distribution and the second does it with the objective of increasing the accuracy on the network. Among the most used generative algorithms is the maximization of likelihood (MLE) obtained directly from the dataset and the (EM) Expectation-Maximization Dempster et al. (1977) algorithm in case of missing data.
One of the difficulties in parameters learning is the computational complexity of the algorithms, since the problem in the worst case is NP-hard Ratnapinda and Druzdzel (2015). There is also the risk of the algorithm being stopped at a local maximum Myers et al. (1999).
Online parameter learning is usually accomplished through adaptations in the generative methods by informing the influence of future data against past data. The goal of those methods is the model convergence, that is, to reproduce the distribution of the data in the Conditional Probability Tables (CPT).
Although Bayesian reasoning is probabilistic, it is possible to combine complementary techniques and reasoning in BN. Take for example, the Fuzzy-Bayesian model Brignoli (2013) that combines diffuse (fuzzy) reasoning with the probabilistic reasoning.
In a previous work e Lima (2014), a discretization method was developed for Bayesian networks through rules of data-based cuts and the overall optimization of them using genetic algorithm.
By combining different reasoning techniques on the uncertainty it is possible to address more than one face of the same problem. In this work it is proposed a method that performs the online parameters learning in a hybrid way between the discriminative approach and the generative approach. The proposed method is based on the Voting EM Cohen et al. (2001b) and it inherits some of its characteristics as online learning and the possibility of missing data during learning.
Although there are other methods of parameters learning in literature that make the hybridism between the discriminative approach and the generative one, it is usually done by separating the variables into two distinct sets. The first set is treated in a generative way (where the goal is to reproduce the data distribution) and the second in a discriminatory manner (where the goal is increase the accuracy in classification problems). The proposed method carries out this hybridism in an integrated way between two approaches: The same variable is learned simultaneously in a generative and discriminatory way through a fuzzy learning system.

Related Work
Models in Bayesian networks are made from the graph topology and the conditional probabilities between the variables. The definition of these two properties is accomplished through the process known as learning where: • Structure Learning: It determines whether or not there is independence between the variables of BN and it gives a score for each candidate structure • Parameter learning: It is related to the estimation of conditional probabilities among the variables Parameter learning is related to the estimation of Conditional Probability Tables (CPT) and it is divided into two approaches: Generative and discriminative.
In generative learning the conditional probabilities are computed directly from data. On the other hand, in the discriminative approach, learning is done considering the conditional probability of the variables in order to provide the increase of accuracy in the BN Su et al. (2008).
Generative learning is made from the distribution of data and it seeks the likelihood maximization Zhou (2015). The most common method is the (MLE) Maximum Likelihood Estimation for cases where there is no missing data Friedman et al. (1997).
When there is occurrence of missing data, the EM method is the most used Dempster et al. (1977). It enables the estimation of parameters through a repeating structure that toggles between two steps: E-step and M-step until it reaches the convergence. Reaches convergence some variations of EM method were proposed in the literature, for example the EM ( η) Bauer et al. (1997). This algorithm defines the concept of learning rate in EM and the update rules considering a Bayesian network. The Voting EM method is an online version of EM (η) Cohen et al. (2001b), Cohen et al. (2001a). The main features of Voting EM are: • Adaptation to data distribution changes • Ability to escape from the local maximum in the likelihood function • It reaches the convergence more rapidly than the MLE method • Faster adaptation in cases where there are changes in data distribution when it is compared to MLE Another method related to EM is the EM-like proposed by Saloj¨arvi et al. (2005). The method is a discriminative version of EM and it aims to maximize conditional probabilities rather than likelihood probabilities as it happens in classical EM method.
The pioneering method in discriminative approach is the (ELR) Extension to Logistic Regression proposed by Greiner and Zhou (2002), where CPT are estimated by a process that uses the downward gradient as a way to maximize the conditional probability. The authors show that discriminative learning requires fewer training instances than generative to converge and that usually leads to a more efficient classifier. However, the computational cost can be significantly higher. Raina et al. (2003) propose a hybrid method between the generative and discriminative approach. The method divides the variables into two groups: Discriminative and generative. Therefore, if a variable has a direct influence on classification, it is learned in a discriminatory way and, if not, in a generative way. The method obtained a high accuracy rate and a low error when compared when it I compared to ELR. Kang and Tian (2006) propose the HBayes-NB which is a hybrid approach to learning parameters and structure. The HBayes-NB performs the relaxation of the na¨ıve Bayes topology by creating additional arcs in the graph. The variables are separated into two sets: discriminative and generative. Discriminative learning is done by the ELR method and the generative by the MLE. The method obtained good results when it is tested on public databases and compared with state-of-the-art methods in classification problems. Liu and Liao (2008) propose an online learning method made by combining MLE and VotingEM. The method proposed by the authors changes the VotingEM learning rate proportionately to the time of arrival of the data in a similar way to the MLE method. The method proposed obtained similar results to VotingEM but proved less sensitive to the parameters configuration Su et al. (2008) propose the (DFE) Discriminative Frequence Estimate that learns parameters in a discriminatory way considering the data frequency. DFE is a variation of the MLE method and uses the error (loss) as a penalty in learning. The method was compared with MLE, ELR and with an ensemble method in several public databases of the UCI repository. The DFE obtained good results and the authors conclude that the method is computationally efficient, converges quickly and has results similar to the state-of-the-art methods. Pernkopf and Wohlmayr (2009) propose three discriminative methods of parameters learning. The first is an extension to RB of the Baum-Welch algorithm Bridle (1990). The other two methods are based on EMlike Saloj¨arvi et al. (2005): (ECL) Exact conditional likelihood method and (ACL) Approximate Conditional likelihood method. The methods were tested in public databases and compared with the MLE method, obtaining superior results in classification problems. Xue and Titterington (2010) propose (JoDiG) Joint Discriminative Generative Modelling. The method performs the parameters learning by dividing the variables into two sets: Discriminative and generative. A variable is treated in a discriminatory manner if the process or function that originates the data is not found, that is, if it does not have a good adherence to some probability distribution function. The method was tested on public databases of the UCI repository and obtained similar or better results than other methods that are only discriminative or generative. Jing et al. (2011) propose a method of parameters learning based on the theory of interactive control of learning. The proposed algorithm provides the dynamic system and rules for upgrading CPT. The authors analyzed the convergence of the algorithm and concluded that the conditional probabilities reached reflected accurately to those desired. In addition, the convergence rate has been significantly improved when compared to other learning algorithms in the literature. Carvalho et al. (2011) propose a data-based score metric without the use of parameters through Conditional Log-Likelihood factoring (CLL). The technique is used both for the structure learning and for parameters learning aiming to increase the classification in BN. The authors obtained good results by comparing the proposed method with other classifiers considered state of the art on public databases. In addition, the authors concluded that the computational time of the technique is significantly lower. Broeck et al. (2014) propose a new family of algorithms for parameters learning considering missing data. The main features are: Parameters are computed in a non-interactive way, estimates are obtained without the need for Bayesian inference and the estimation of parameters is consistent for large databases. The authors conclude that the algorithms are faster than EM and avoid local minima. This paper aims to explore the hybridism between fuzzy, basic statistics and Bayesian inference to compose an online method of parameters learning that is able to combine elements of generative and discriminative learning in Bayesian Networks.

Bayesian Networks
A Bayesian Network (BN) Pearl (1988) is a model of representation and reasoning of uncertainty that uses the conditional probability between variables of a specific domain, expressed by Directed Acyclic Graphs (DAG). Its graphical structure can tackle correlations between variables effectively, with appropriate language and efficient resources to represent the joint probability distribution over a set of random variables (Friedman and Goldszmidt (1996).
Defining formally, a BN is a pair (S, P), where S = (X, E) is a DAG. The nodes X = {X 1 ,…,X n } represent the variables and edges E = {e 1 ,…, e m } represent a direct correlation between each node in X.
P is defined as a set of probabilistic parameters expressed through tables. Given a particular variable, a conditional probability distribution is made for each of their classes/values With that configuration, the network establishes that a variable is independent of all other variables except their descendants in the graph, given the state of its parents. The inference inside the network is done by the Bayes theorem for ( ) The joint probability is determined by the called chain rule and assumes the conditional independence between the variables: where, Pa i determines the set of parent nodes from X i . The BN reasoning is established in two distinct scenarios: if "input" then "output" if "output" then "input"

  
Considering all the possible network topologies for a Bayesian network the well-known structure Na¨ıve Bayes is the simplest one. It assumes that all variables are mutually independent given the class context. Although this model does not reflect the reality in most real-world tasks it is very effective, because the parameters of each attribute can be learned separately, facilitating the learning process McCallum and Nigam (1998). The na¨ıve Bayes topology is there for a set of mutually independent variables that works as the input which collectively has a single parent (output node).

Parameters Learning
Parameter learning is related to filling the CPT in a fixed structure S*. That is, it is assumed that there is a joint mdistribution of probability P (.) that represents a domain.

Generative Parameters Learning
Generative learning is made from the data set, seeking the maximization of likelihood Zhou (2015) and is known as (MLE) Maximum Likelihood Estimation) Friedman et al. (1997). The MLE estimate for each CPT after T samples, without missing data, is given by the formula: where, T ijk N is the number of times that the data was observed in the configuration k i x for the parent set j i pa and T ij N the total amount of X i .

Parameters Learning with Missing data
Missing data can be divided into three categories Rubin (1976): • MCAR: Missing completely at random • MAR: Missing at random • NMAR: Missing not at random Missing data of type MCAR are those that have the highest degree of randomness and occur when the likelihood of finding a missing value is the same for all variables in any dataset. For example: In a network of sensors some of them, randomly, fail to capture data at certain times.
Data of type MAR occur when a variable X j of the dataset influences the existence of missing data in a different variable X i . For example, imagine a network of security sensors that capture the temperature and the existence of movement in a particular environment. Also imagine that some motion sensors have environmentalsensitive hardware: In the case of higher temperatures they cannot always capture the existence of movement. In this case a variable other than the one observed changes the likelihood of missing data happening.
Missing data is considered as NMAR when they are related to unobserved events or even the attribute itself. For example, if the ambient temperature influences the ability of the sensor to capture the data of the temperature itself or even if the factor influencing the occurrence of missing data is unknown.
Parameters learning with missing data can be summed up in three different approaches: • Ignore/Discard data: It is the simplest way to deal with missing data, because it removes a data entry or even a variable. A variable with missing data is not a variable of hidden type: So there is data from the variable, but not in all cases.
The EM algorithm Dempster et al. (1977) enables the parameters estimation in models with missing data and is the most used algorithm in literature Zhou (2015). This algorithm uses a reiteration system that toggles in two steps (E step and M step) until it reaches convergence.
In a given instance y l it is possible to have missing data (Z l = {z l1 ,…, z l0 }) and observed variables (Γ l = {γ l1 ,…., γ lh }) where o + h = n. The steps for convergence are given by: • E Step (expectation step): From the current parameters setting (θ (t) ), where the first interaction is given by θ (0) and has the initial configuration given by random values. Expectation is calculated through the maximum likelihood function considering the data set D: • (M Step) Maximization Step: Calculates the new estimation of θ (t+1) parameters by maximizing the first step: The Algorithm 1 describes the computational approach of EM Algorithm 1 Expectation Maximization (EM) 1: θ ← random values 2: while not converge do 3: Step E: use γ l to calculate l(θ‫|‬θ (t) ) 4: Step M: replace θ by arg max θ l(θ‫|‬θ (t) ) 5: end while 6: return θ Another type of learning approach in missing data uses gradient methods, which are an alternative to learning in cases where BN has continuous variables Binder et al. (1997); Buntine (1994).
Other forms of learning in missing data were developed in the literature whether using methods of Monte Carlo or even by Gaussian approximation Barber (2012). In addition there are mixed approaches, such as that of Johnny that proposes a method of learning with focus on data of type MCAR and MAR through a BN that represents the relationship between these variables Mohan et al. (2013).

Discriminative Learning of Parameters
Discriminative learning is characterized when the main objective is the increase of accuracy in BN. However, discriminative learning has a computational complexity greater than generative and is considered an NP-hard problem NP-hard Greiner and Zhou (2002).
In this type of learning the goal is to find parameters that maximize the conditional log-likelihood as opposed to simply maximizing the likelihood. However there is no closed formula to find the best parameters of the network, since the conditional likelihood cannot be decomposed Friedman et al. (1997). One of the consequences of this is discriminative learning to generally use heuristic search methods to establish conditional probabilities Su et al. (2008). Or hybrid approaches between discriminative algorithms and generative as in Raina et al. (2003); Xue and Titterington (2010); Kang and Tian (2006).
Among the surveys in this area, it is possible to quote those with a purely discriminatory approach (Greiner and Zhou (2002); Greiner and Zhou (2002)

Online Parameter Learning
In machine learning, online learning methods are those that learn from a set of data available in a sequential or interactive way. It is a type of adaptive learning and considers that the domain changes with time: The opposite of learning by batch, in which all data is available at the time of training.
Some of the algorithms most commonly used in the BN context use generative learning, such as in Cohen et al. (2001b) that proposes the VotingEM method based on the rules defined by Bauer et al. (1997) using concepts of maximum likelihood.

VotingEM
The Voting EM algorithm in Cohen et al. (2001b) is a direct adaptation of the EM (η) to be used online. The update rule is given by: where, d t = (y t , θ t−1 ), T = {0,… t,…} is the current temporal unit and 0 ijk θ is populated by random or pretrained values.
The learning rate η shows how much the past is reliable considering the data present. When η approaches 1 we consider the present data more reliable and the past knowledge is gradually discarded. The rate can be fixed for all learning or change over time (Section 3.4.2).

Adaptive VotingEM
One of the critical points of Voting EM is determining the learning rate η, because the parameter choice varies according to the application domain. In addition, a specific case k j x with parent configuration j i pa can be very constant or rarely appear in the database. With a fixed ETA the data influence on the CPT is always the same for all variables which makes the algorithm generic.
As a way to deal with the problem Cohen et al. (2001a) proposes the Adaptive Voting EM. It is based on the following principles: • The learning rate η should be reduced when approaching convergence • η should be increased when there is a large error between the average values of θ ijk e t ijk θ • A value η is defined for each j i pa being named η ij The method is based on traditional VotingEM, but the eta ij value is updated on each time interaction and it uses 3 parameters as input: • q: parameter that defines how many standard deviations of error is acceptable before increasing η ij • α: parameter that defines what is considered convergence in order to decrease η ij • m: parameter that defines in which proportion η ij will be increased or decreased The method variance is calculated by:  Cohen et al. (2001a) proves that η ij decreases proportionally to 1/t n where t n is the number of times that Pa i = j i pa which leads to an optimal asymptotic convergence at some local maximum.

Our Proposal
In this work we propose a hybrid method (discriminative and generative) that addresses incremental or online learning of parameters in Bayesian networks through a fuzzy system. This method is based on the VotingEM algorithm Cohen et al. (2001b) that proposes an incremental version of the EM (η) Bauer et al. (1997). Figure 1 synthesizes the proposed fuzzy system that has two types of input variables: Trend and classification error. The output variable is the adjustment level m that determines the variation of the learning rate η ij . The method is described on Algorithm 2.

Definition of Variables
The following definitions are made: • Trend: Long-term behavior of the time series that can be constant, growth or degrowth. Here, the term trend will be adopted in cases of growth or degrowth • Convergence: Constant trend behavior-when the series focuses around a certain point • Precision: Variation degree of a set of measures, in this case the variation of θ ij . A significant accuracy in θ ij demonstrates its convergence • Accuracy: Network hit in a classification problem. In the discriminative approach it is portrayed by the a posteriori conditional probability of the output variable • Classification error: Is the complementary measure to the conditional probability class_error = 1-P (.|.)

Trend
In online learning, the data D = {y 1 ,…,y N } is not fully available during network training and is made available incrementally in time, featuring a temporal series.
Online learning methods are usually approached in a generative way and determine the influence of a evidence y t in the CPT set θ that defines BN. The purpose of these methods is to reproduce the distribution of data in θ and the convergence is achieved by decreasing the influence of y t in a consecutive way in 1/t Cohen et al. (2001b).
The influence of the learning rate is achieved by η ij and the proposed method determines its value to each interaction of time. Similarly to VotingEM, the η ij rate is increased when the error in θ ij is considered relevant and diminished when θ ij is converging. The difference between the VotingEM and the proposed method is in the approach and the concept of error and convergence.
The proposed method is based on the concept of tendency: Considering that D is a time series we define qt as the CPT set that compose the BN in time t. Determine whether a series has a tendency is usually carried out through statistical tests at a level of α significance, with two assumptions: In this work are used two tests of the literature to determine the trend of the series: Mann-Kendall and Cox-Stuart tests. The tests assess whether or not there is a tendency by calculating the p-value and comparing it with α.
The trend tests are performed for each θ ijk , that is, for each pa ij set of X i . However, since the value of the learning rate is defined for η ij the trend should be calculated for the whole set θ ij = {θ ij1 ,… θ ijk ,…} in a global way.
In addition to determining whether or not there is a tendency to θ ij , other issues are raised: • How to quantify the trend?
• What would be a statistically significant trend? And a statistically non-significant trend? • How to use the quantification of the trend as a precision measure for θ ij In this work, a Fuzzy subsystem proposed using the p-values of Mann-Kendall and Cox-Stuart tests in order to answer these questions. The use of fuzzy functions to represent a statistical test was based on Costa (1999) that divides the p-value into three fuzzy sets: Highly significant, significant and non-significant (Fig. 2). The Trend Fuzzy subsystem is shown in Fig. 3. A new temporal window 1 T ij + Θ is created whenever the ηij value is increased, that is when convergence in the series is not detected.

Classification Error
The proposed method, in addition to seeking the precision in θ ij seeks to increase accuracy of the network considering an output variable X s . That is, seeks to decrease the classification error, calculated by: where, X s is the output node in a classification problem that assumes a value k s x at time t, t ρ is the set with the evidence of X s at time t and τ t are the evidence of the input data, such that y t = ρ τ ρ τ ∪ ∩ = 0 / . When there is no missing data we can reduce the Equation 7 to:

Fuzzy Inference
The change in the learning rate η ij is made by the combination between the error rate of X s and the quantification of the trend in θ ij . This combination of factors is made by a fuzzy system that has a set of input variables and an output variable called the adjustment level m, such that: The inputs in the proposed subsystem are defined by: •

Adjustment level m
The value of m is obtained by a fuzzy inference system of Mandeni type and defined by two fuzzy sets that model its behavior by increasing or diminishing η ij (Fig. 4).

Preliminary Results
The initial evaluation of the method performance was done by simulating an online environment considering the Bayesian network provided by Cohen et al. (2001b).
The proposed method was compared with two other methods of literature: VotingEM and MLE Online. The choice of these methods for initial comparison is due to its great popularity, efficiency and low computational cost Liu and Liao (2008;Zhou, 2015).
The initial experiments seek to evaluate the method in two distinct conditions: • Condition 1: Considering a totally random BN • Condition 2: Considering a pre-trained BN that has undergone abrupt changes in its probability distribution Condition 1 is simulated through 2000 independent and equally distributed (i.i.d.) samples generated from the CPT. To simulate learning, a random BN is created and the samples generated from the original BN are sent interactively to learning methods. The resulting BN after the simulation of Condition 1 is used as input in Condition 2.
Condition 2 is simulated by the abrupt change in the distribution of conditional probability of some values in θ ij at the network obtained at the end of Condition 1. After the change, in a similar way to Condition 1, 2000 i.i.d. samples were generated from this new BN. The objective of Condition 2 is to analyze the capacity of the proposed method of adapting to changes in the environment. The results evaluation is done considering Condition 1 and Condition 2. The Bayesian network (BN) used for the initial evaluation was proposed by Cohen et al. (2001b) and has three nodes: Parent, Child1 and Child 2 (adopted as gold standard). The CPT set that composes it is shown in Table 1. The gold standard BN was used for the generation of 2000 samples for Condition 1.
Condition 2 is simulated by changing three θ ij values in the gold standard BN: one in each X i node. Similarly to condition 1, 2000 samples were generated in the altered network. The experiment was performed using the parameters defined in Table 2 obtained empirically through experimentation. The MLE Online method does not require an initial parameterization. Figure 1, 5 shows convergence of methods. Figures 5  (a) (b) and (c) demonstrate convergence in three θ ijk parameters and the following observations are made: • The proposed method and VotingEM have similar convergences in Condition 1 • The proposed method perceives changes in the environment faster (Condition 2) • The variability of the proposed method is greater; • By not enabling the increase in the rate of learning, the MLE Online method does not have a good convergence Figure 6 shows the overall convergence by loglikelihood of the trained BN at each new sample. The proposed method has a faster convergence in Condition 2 than the other two methods.

Log-likelihood
The discriminative aspect is analyzed by Fig. 7, which reinforces the ability to decrease the error probability in the proposed method and its rapid convergence to lower errors. Figure 1 also demonstrates the abrupt drop in error by using the proposed approach reaching stability in sample 400, approximately. The VotingEM method only arrives at this stability around the sample 600.

Conclusion
The initial analysis of the results shows that the proposed method achieved a good performance both in the convergence η ijk and in the decrease of the probability of the classification error when compared to the other methods.
The proposed approach also simplifies parameterization by using only one configuration parameter while VotingEM uses three.
The following observations can be made: • The proposed method is more sensitive to the environment, which results in a greater variability in the estimation of θ ijk • MLE Online is not able to increase the learning rate during the simulation • The proposed method perceives the distribution change faster than VotingEM and increases η ijk more significantly The results reinforce the main characteristics of the proposed method: Perceive changes of distribution rapidly and unite the generative and discriminative approaches during learning. For this reason, there is a greater variation of probability distributions as a resource to decrease the probability of sort error.