The Tilted Beta Binomial Linear Regression Model: a Bayesian Approach

This paper proposes new linear regression models to deal with overdispersed binomial datasets. These new models, called tilted beta binomial regression models, are defined from the tilted beta binomial distribution, proposed assuming that the parameter of the binomial distribution follows a tilted beta distribution. As a particular case of this regression models, we propose the beta rectangular binomial regression models, defined from the binomial distribution assuming that their parameters follow a beta rectangular distribution. These new linear regression models, defined assuming that the parameters of these new distributions follow regression structures, are fitted applying Bayesian methods and using the OpenBUGS software. The proposed regression models are fitted to an overdispersed binomial dataset of the number of seeds that germinate depending on the type of chosen seed androot.


Introduction
The binomial distribution is normally used to model the number of successes obtained in a finite number of experiments. However, in these cases, it is often found that the variance of 1 Departamento Estadística, Universidad Nacional de Colombia. Email:ecepedac@unal.edu.co 2 Departamento Matemáticas, Universidad Nacional Colombia. Email: mvcifuentesa@unal.edu.co the response variable Y exceeds the theoretical variance of the binomial distribution. This phenomenon, known as extra-binomial variation (overdispersion), can lead to underestimation errors, lost efficiency of estimates and underestimation of the variance, wich that can in turn generate incorrect inferences about the regression parameters or the credible intervals (Collet, 1991;Cox, 1983;Williams, 1982).
There are several approaches to study overdispersed binomial datasets. Hinde and Demétrio (1998) categorized the majority of overdispersed binomial models in two classes: (1) those in which a more general shape for the variance function is assumed, by adding additional parameters; and (2) models in which it is assumed that the parameter of the distribution of the response variable is itself a random variable. In the first class, the double exponential family of distributions allows the researcher to obtain double binomial models, which allow including a second parameter, which independently from the mean controls for the variance of the response variable and can be modeled from a subset of some explanatory variables (Efron, 1986). In the second class, the beta binomial distribution, results by assuming that the response variable follows a binomial distribution and the probability parameter of the binomial distribution follows a beta distribution. From the parameterization of the beta distribution, in terms of its mean and dispersion parameter (Jorgensen, 1997), a parameterization of the beta binomial beta distribution in terms of its mean and dispersion parameters is presented in Cepeda-Cuervo and Cifuentes-Amado (2017).
Despite the versatility of the beta distribution, Hahn (2008) proposed the rectangular beta distribution as a combination between the beta distribution and the uniform distribution, to admit heavier tails than that admitted by the beta distribution. After that, Hahn and López Martín (2015) introduced tilted beta distribution, which has as particular cases the beta rectangular and the beta distributions.
In this article, we generalize the beta binomial regression models for fitting overdispersed binomial count dataset (Cepeda-Cuervo and Cifuentes-Amado, 2017) by introducing the tilted beta binomial linear regression model. For this, the tilted beta binomial probability is defined by assuming that the parameter of the binomial distribution follows the mean tilted beta distribution. In addition, the beta rectangular binomial models are presented as particular cases of the new proposed model, by assuming that the parameter of the binomial distribution has beta rectangular distribution. The proposed models are fitted using Bayesian methods. Finally, in order to illustrate of the tilted beta binomial model, we fit it to a seed germination count dataset and compare it with the rectangular beta binomial model and the binomial model, using their DIC values. This paper is organized as follows. After the introduction, in Section 2, the tilted and the reparameterized tilted beta distributions are presented. In Section 3, the tilted beta binomial distribution is introduced and the rectangular beta binomial distribution is presented as a particular case. In Section 4, the tilted beta binomial linear regression model is defined.
Finally, in Section 5, we analyze how the proportion of seeds that germinated on each of 21 dishes, is influenced by the type of seed and root, by fitting a tilted beta binomial linear regression, using the OpenBUGS software. The proposed model performance is compared with the binomial and beta binomial regression models.

The Tilted Beta Distribution
In different fields there is often a need to model continuous random variables that assume values in a bounded interval on a set of explanatory variables. Cepeda-Cuervo (2001) proposed the beta regression models, where mean and dispersion parameters follow regression structures (see also Cepeda and Gamerman (2005), Cepeda-Cuervo and Garrido (2015)). If the continuous variable Y assumes values in a bounded open interval (a, b), a beta regression models can be proposed, using the basic transformation (y − a)/(b − a). However, in order to admit heavier tails than is possible in the beta distribution, Hahn (2008) proposed the rectangular beta distribution as a new distribution that, like the beta distribution, has as domain the open interval (0, 1). The rectangular beta distribution consists of convex combination between the beta distribution and the uniform distribution U(0, 1). Subsequently Hahn and López Martín (2015), proposed the tilted beta distribution, consisting of a mixture of the beta distribution and the tilted distribution, which has as particular cases the beta rectangular distribution and the beta distribution. This section presents a reparameteriza-tion of the tilted beta distribution proposed by Hahn and López Martín (2015), in terms of the mean and the dispersion parameters of the beta distribution µ b and φ, respectively, and the mean of the tilted beta distribution µ t . The (µ t ,µ b ,φ,θ)-tilted beta binomial distribution results from the convex combination between the tilted reparameterized beta distribution and the binomial distribution.

The Tilted Distribution
A random variable Y follows an inclined distribution with a parameter ν (Hahn and López Martín, 2015) if its density is given by: (1) in terms of the mean, the density function is defined by: where 1/3 ≤ µ t ≤ 2/3, given that the moments, E t (Y n ), of a random variable Y which follows the density function (2) are given by: Their variance, V t (Y ), is given by:

Reparameterized Tilted Beta Distribution
The tilted beta distribution was introduced by Hahn and López Martín (2015), as the convex combination between the tilted distribution and the beta distribution. If this distribution is obtained from the combination of the mean tilted distribution (2) and the mean and the dispersion beta distribution, Beta(µ b , φ), the density function of the tilted beta distribution is given by (4): the mean and the variance of the tilted beta distribution are: The rectangular beta distribution (Hahn, 2008) is a particular case of (4) when µ t = 0.5 (the slope of the tilted distribution is zero). By replacing this value of µ t = o in (4), the density function of the tilted beta distributions is defined by: Let Y |p ∼ Bin(m, p) be a random variable that follows the binomial distribution, where p follows the tilted beta distribution, p ∼ BI(µ t , µ b , φ, θ). Then Y follows a tilted beta binomial distribution with parameters µ t , µ b , φ and θ, denoted by Y ∼ BIB(µ t , µ b , φ, θ).
The probability of this distribution is given by: where B(·, ·) denotes the beta function and f BB(µ b ,φ) (·) denotes the probability function of the beta binomial distribution, parameterized in terms of the mean and the dispersion parameters.
The behavior of the (µ t , µ b , φ, θ)-tilted beta binomial probability function is illustrated in The mean and variance of a random variable Y that follows the (µ t , µ b , φ, θ)-tilted beta binomial probability function are given by: where µ b , V b denote the mean and variance of the beta distribution, respectively, and µ t , V t denote the mean and variance of the tilted beta distribution.

(µ b ,φ,θ)-Beta Rectangular Binomial Distribution
Let Y |p ∼ Bin(m, p) be a random variable that follows the binomial distribution, where p follows the beta rectangular distribution (8). Thus, Y follows the (µ b ,φ,θ)-beta rectangular binomial distribution. This density function can be obtained as a particular case of the tilted beta binomial distribution (9), by replacing µ t by 0.5: From the equations of the mean (6) and variance (7) of the tilted beta binomial distribution, setting µ t = 0.5, the mean and variance of the rectangular beta distribution are obtained as:
In order to define the Bayesian tilted beta binomial regression model, the following a priori distributions are assumed for β, γ, δ and µ t :

Seeds Germination Regression Models
The dataset analyzed in this section is available in Spiegelhalter et al. (2003)openbugsExamples2014 and corresponds to the number of seeds that germinated from an initial quantity arranged in each of 21 dishes organized according to a 2 by 2 factorial design (2 seed types and 2 root types). These data were initially reported by Crowder (1978). The variables involved in the experiment are described below: • y: number of seeds germinated in each dish.
• n: number of seeds initially arranged in each dish.
where i = 1, . . . , 21. The TBB(µ t ,µ b ,φ,θ) model was fitted to the data using OpenBUGS, a free program used for Bayesian regression based on the Gibbs algorithm (Spiegelhalter et al., 2003). The posterior parameter inferences obtained from a sample of size 100000, burn-in of the first 10000, and taking one sample every 10 iterations to reduce autocorrelation, are summarized in  Table 1: Posterior parameter estimates of T BB(µ t , µ b , α, θ) model In Table 1

Chain convergence in the tilted beta binomial model
In the parameter estimation process, three posterior samples were generated beginning from different starting values. In all chains, the autocorrelation is close to zero for a lag greater than or equal than 10, and a burn-in bigger than 10000.
To check the convergence of the chains, two convergence diagnoses were applied: the Geweke diagnostic (Geweke, 1992) and the Brooks-Gelman-Rubin convergence diagnostic (Brooks and Gelman, 1998). The Geweke-Brooks plot for the chains of the regression parameters can be observed in Figure 3, where the value of the Z statistic versus the number of iterations is plotted to determine the burn-in of the chains. This figure shows that the statistic remains within the acceptance zone for a period of burn-in equal to zero. The second method applied is known as the Brooks-Gelman and Rubin convergence diagnostic. It was proposed by Brooks and Gelman (1998) and compares within-chain and between-chain variances through the estimation of the statistic of scale reduction R. Values of R well above 1 indicate that the chains have not converged. Figure 4 shows that for the regression parameters of this example, the R factor is very close to 1 after the 1000 iterations.

Models comparison
In order to determine the performance of the proposed model, the following models also were fitted to the seed germination dataset: binomial Bin(n, p), beta binomial BB(µ, φ) and beta rectangular binomial BRB(µ, φ, θ). The deviance and the deviance information Criterion (DIC) for each of these models are given in Table 2, which shows that the lowest average of the deviance and the lowest DIC value correspond to the tilted beta binomial and the beta rectangular binomial models, where the first one presents the lowest DIC value and therefore is the best model.

Conclusion
In this paper two new distributions are proposed: the tilted beta binomial distribution and the beta rectangular binomial distribution. From these distributions, assuming that their parameters follow regression structures, new overdispersion regression models for count data are proposed: the tilted beta binomial regression model and the beta rectangular binomial regression model. These models are fitted using Bayesian methods, and in the application, show better performance than the beta binomial regression models for statistical analysis of the seed germination dataset.
Given that the tilted beta distribution is flexible and allows considering varying amounts with greater likelihoods than the beta distribution in the extreme tail-area events, it permits accommodating different relative likelihoods of high versus low extreme tail-area events.
Thus, the proposed tilted beta binomial regression model which defines a more general overdispersion regression model than the beta binomial regression model, allows considering count events with high or low likelihood of occurrence and better estimation of the regression parameters, credibility (or confidence) intervals and statistical inferences in the analysis of binomial-type overdispersion data.