Bayesian Network Inference in Binary Logistic Regression: A Case Study of Salmonella sp Bacterial Contamination on Vannamei Shrimp

Corresponding Author: Pratnya Paramitha Oktaviana Department of Statistics, Institut Teknologi Sepuluh Nopember, Surabaya, Indonesia Email: paramitha.oktaviana@gmail.com Abstract: Recently binary logistic regression has been used to identify four factors or predictor variables that supposedly influence the response variable, which is testing result of Salmonella sp bacterial contamination on vannamei shrimp. Binary logistic regression analysis results that there are two predictor variables which is significantly affect the testing result of Salmonella sp bacterial contamination on vannamei shrimp, those are the testing result of Salmonella sp bacterial contamination on farmers hand swab and the subdistrict of vannamei shrimp ponds. Those significant predictor variables selected have been modelled in binary logit model. This paper proposes to study the statistical associations between the two significant predictor variables and the contamination of Salmonella sp bacterial on vannamei shrimp and to build a numerical simulation of two significant predictor variables parameters using bayesian network inference. Directed Acyclic Graph (DAG) is applied for modelling binary logit model of significant factors in bayesian network inference.


Introduction
According to Hosmer and Lemeshow (2000), if there are p predictor variables, indicated by the vector x = (x 1 ,x 2 ,...,x p ) and each of these variables is assumed at least interval scale, so the conditional probability could be indicated by P(Y = 1| x) = π (x). The logistic regression model is: (1) Then the logit of that model could be written as: If those p predictor variables are discrete or have nominal scale, the method of choice is to use dummy variables. If a nominal scaled variable has m possible values, then m−1 dummy variables will be needed. Suppose that the j th predictor variable x j has m j levels. The m j -1 dummy variables will be indicated as D jk and the coefficients for these dummy variables will be indicated as B jk , k = 1,2,...,m j −1. Then the logit of this case could be written as: Binary logistic regression is a logistic regression where the response variable used is dichotomous (or it is qualitative data which has binary or two categories) and the predictor variables are polichotomous (it could be qualitative or quantitative data).
Recently binary logistic regression has been used by the researchers to identify four factors or predictor variables (X 1 , X 2 , X 3 , X 4 ) that supposedly influence the response variable (Y), which is the testing result of Salmonella sp bacterial contamination on vannamei shrimp. This response variable (Y) has two categories: 0= if testing result of Salmonella sp bacterial contamination on vannamei shrimp indicate that there is no Salmonella sp on vannamei shrimp; 1= if testing result of Salmonella sp bacterial contamination on vannamei shrimp indicate that there is Salmonella sp on vannamei shrimp. While there are four predictor variables used: X 1 : The testing result of Salmonella sp bacterial contamination on farmers hand swab (nominal scaled variable), X 2 : The subdistrict of vannamei shrimp ponds (nominal scaled variable), X 3 : The fish processing unit that supplaid by (nominal scaled variable) and X 4 : The pond area in hectare (ratio scaled variable).
This method obtain that there are two significant predictor variables, i.e., X 1 and X 2 . Those significant predictor variables have been modelled in binary logit model. This paper proposes to study the statistical associations between the two significant predictor variables and the contamination of Salmonella sp bacterial on vannamei shrimp and to build a numerical simulation of two significant predictor variables parameters using bayesian network inference. DAG is applied for modelling binary logit model of significant factors in bayesian network inference. Neapolitan (1989) in Stephenson (2000) explains that bayesian network is a specific type of graphical model, namely DAG. All of the edges in the graph are directed (the edges point in a particular direction) and there are no cycles (there is no way to start from any node and travel along a set of directed edges in the correct direction and arrive back at the starting node. The edges in bayesian network explain the joint distribution of all variables. The joint probability indicated by one set of edges can equally be indicated by another set. Chen et al. (2015) explains that bayesian network is a set of variables, X and Y, that present joint probability distribution, for i = 1,2,...,n:

Bayesian Network Inference
DAG is used to illustrate all of the parameters and variables in thae model and connect them using the edges (Liu, 2012). The icon specification in DAG is presented in Fig. 1. There are three nodes in DAG as shown as in the Fig. 1: 1. Constant Node: It is used as the icon of random variable, for example: x i ∼ N(µ,σ 2 ) 2. Stochastic Node: It is used as the icon of variable which is described by the other variables, generally to predict, for example: Logical Node: It is used as the icon of observation value, hyper-parameters or constant, for example: Link function in DAG is presented in Table 1.

Analysis and Result
Binary logistic regression analysis results that there are two predictor variables which is significantly affect the testing result of Salmonella sp bacterial contamination on vannamei shrimp (Y), those are the testing result of Salmonella sp bacterial contamination on farmers hand swab (X 1 ) and the subdistrict of vannamei shrimp ponds (X 2 ). All of the research variables are shown in Table 2.
The parameters estimation that is obtained by bayesian network is expected to show the statistical associations between X 1 and X 2 clearly. This bayesian network of binary logistic regression is also use the first reference category as same as the previous binary logistic regression. The DAG of this bayesian network is shown in Fig. 2. The model of DAG is denoted in Fig. 3.
In this bayesian network analysis, three markov chains iteration is used in simulation process. There are two conditions to continue bayesian analysis; those are the posterior distribution of parameters built should be stationary and the parameters should be convergence. Time series plot of history chains is used to check the stationary of posterior distribution. By looking Fig. 4, it obtains that the posterior distribution of parameters are stationary. Figure 5 is the Gelman Rubin statistics of parameters. It shows that the parameters are convergence. Therefore, bayesian network analysis could be continued.   Table 3.

Conclusion
The result of bayesian network analysis of binary logistic regression obtain the statistical associations between the significant predictor variables and the contamination of Salmonella sp bacterial on vannamei shrimp which is show in probability as following: •