Zero Inflated Poisson and Geographically Weighted Zero- Inflated Poisson Regression Model: Application to Elephantiasis (Filariasis) Counts Data

Corresponding Author: Purhadi Department of Statistics, Faculty of MIPA, Sepuluh Nopember Institute of Technology Jl. Arief Rahman Hakim, Surabaya 60111 Indonesia Email: purhadi@statistika.its.ac.id Abstract: Poisson regression has been widely used for modeling counts data. Violation of equidispersion assumption can occur when there are excess of zeros of the data. For that condition we can use Zero-Inflated Poisson (ZIP) to analyze such data, resulting global parameter estimates. However spatial data from various locations have their own characteristics depend on their socio-cultural, geographical and economic conditions. In this paper, we first review the theoretical framework of Zero-Inflated Poisson (ZIP) and Geographically Weighted Zero Inflated Poisson (GWZIP) regression. We use Maximum Likelihood (MLE) method and EM algorithm to estimate the model parameters. The F test is used to compare the two models. Second, we fit these models to the number of filariasis case of East Java. In our case, there is the preponderance of zeros in the data set (65.79%). The results prove that the spatial dependence is absent, but there is weak spatial heterogeneity of the data (significance level α = 0.1). Based on F test, ZIP and GWZIP regression are not significantly different.


Introduction
Poisson regression is the well known method for modelling counts data. However this method assumes the equidispersion of the data (Bohning et al., 1999). Unfortunately this assumption is often violated in the observed data because data are often overdispersed. Generally, two sources of overdispersion are determined: heterogeneity of the population and excess of zeros (Khoshgoftaar et al., 2004;Mouatassim and Ezzahid, 2012). When the source of overdispersion is the excess zeros, the Zero Inflated Poisson Regression model fits counts data well (Lambert, 1992;Mouatassim and Ezzahid, 2012). One of method to estimate parameters of ZIP Regression is Maximum Likelihood Estimation (MLE) method. The log likelihood function can be maximized using EM algorithm.
Many research themes of ZIP Regression have been developed. Some of them are Lestari (2008) modeled counts data of commercial sex worker at Clinic for human reproduction, Putat Jaya, Surabaya; Bohning et al. (1999) used ZIP to analyze counts data on prevention of dental caries in children; and Mouatassim and Ezzahid (2012) fitted models to the number of claims in a private health insurance scheme. Those research themes showed that ZIP Regression fitted counts data better than Poisson Regression.
ZIP Regression estimates the parameters globally. However, data from various locations show the different conditions of them. Those are influenced by different socio-cultural, geographical and economic between them. Those conditions indicate spatial factors. Until now, the research topics of ZIP Regression have not taken into account spatial factors yet. In this study we review ZIP Regression and its development, Geographically Weighted Zero-Inflated Poisson (GWZIP) Regression to analyze excess zero counts data by considering spatial factor.
As illustration we use East Java elephantiasis (filariasis) poisson counts data 2012, with the proportion of zero counts data is 65.79%. Filariasis is an infectious tropical disease. Someone can get it from a bite by an infected mosquito, like malaria, leprous and dengue (Wulandari et al., 2010). The counts of Filariasis in East Java can be affected by spatial heterogeneity.
Some research themes about filariasis have been developed. For more details, one can refer to (Wulandari et al., 2010;Nasrin, 2008;Juriastuti et al., 2010). Most of them used techniques and tests to find influenced significant factors to filariasis case without taking into account spatial factors.
Different from ZIP Regression to filariasis case before, in this study we develop ZIP Regression by taking into account spatial factors.

Poisson Regression
Poisson Regression is special case of Generalized Linear Model (GLM). Standart GLM for counts data is Poisson Regression model with log link function. Poisson Regression is given by the equation below (Myers et al., 1990;Greene, 2003;Cameron and Trivedi, 2005): ‫݅ݕ‬ is respon variabel that follows a Poisson distribution, ߤ݅ is the average of counts of events during a specified period.
The vector x T i = ‫݅ݔ[‬ ,1 , ‫݅ݔ‬ ,2 ‫݅ݔ,…,‬ ,݇ ] contains the covariates and ߚ ܶ = [ߚ 1 ,ߚ 2 ,…,ߚ݇] is the vector of unknown parameters. The number k defines the dimension of the covariates vector incorporated in the model. The link function is ln (ߤ݅). Where: Maximum likelihood techniques may be used to estimate the parameters of the Poisson regression, using Iteratively Reweighted Least Square (IRLS) method (Myers et al., 1990) or Newton-Raphson method (Cameron and Trivedi, 2005;Greene, 2003). Given the assumption that the observations (y i |x i ) are independent, the ln-likelihood function is given by: The null and alternative simultaneous parameters hypotheses are given below: And the deviance statistic is (Myers et al., 1990;Greene, 2003;Cameron and Trivedi, 2005): We reject H 0 when D ( ) β ⌢ > χ 2 (α,n−p) , where (n-p) is the number of degree of freedom. The deviance value declines when the number of parameters in the model increases (McCullagh and Nelder, 1989).
Test for partial parameters is written by following hypotheses: The statistic test is: We reject H 0 when ܼ݆ >ܼߙ/ 2 , var ( )
The Y݅ variable is redefined by latent variable Z݅, where Z݅~Binomial (1, ߨ݅). MLE method is used to estimate parameters by using Expectation-Maximization (EM) algorithm. This algorithm is iterative method to maximize likelihood function with missing data or latent variable. The function of ln likelihood is given by the equation: ln L β, γ | y,z = z x γ -ln 1+ e -1-z ln y ! + (1-z ) y x β -e ∑ ∑ ∑ Estimation of the parameters of ZIP Regression is carried out by two step, expectation and maximization in EM algorithm.
Simultaneous parameters hypotheses testing of ߚ and ߛ are given by: And deviance statistic is written below: We reject H 0 when ‫ܩ‬ > χ 2 (α,n−p) , where (n-p) is the number of parameters under population minus the number of parameters under H 0 true. By the same way, we carry out testing of each ߚ and ߛ. Then partial testing of parameters ߚ and ߛ are given below: The test statistic that is used for the hypotheses testing is deviance as written on Equation (7).

Multicollinearity
Multicollinearity is a statistical phenomenon in which two or more predictor variables in a multiple regression model are highly correlated. Some authors have suggested a formal detection-tolerance or the Variance Inflation Factor (VIF) for multicollinearity. The formula of VIF is given below (Gujarati, 2004): R is the coefficient of determination of X j explanatory variable on all the other X variables. VIF above 10 indicates multicollinearity problem. The multicollinearity can be handled by dropping predictor variables that are highly correlated, increasing the number of sample, ridge regression analysis or Principal Component Analysis (PCA).

Spatial Effects
The problems of spatial data consist of spatial dependence and spatial heterogeneity. Spatial data modeling include spatial weight matrix which its elements are the function of Euclid distance between locations. Weight matrix is built by using kernel functions, one of them is Gauss function, that is: is Euclid distance between location i and location l, while h is optimum bandwidth that is gotten from minimum CV.
Spatial dependence indicates the dependence of observations between locations. The observation on one location influences the observation on other locations. We use Moran's index to test spatial dependence. The value of Moran's index is between -1 and 1. The formula is written below (Anselin, 1988): Moran's I test with the hypotheses H 0 : ߣ = 0 (dependence spatial is present) and H 1 : ߣ ≠ 0 (There is not spatial dependence), is carried out by using Z test: where, we reject H 0 when Z‫ܫ‬ > Zߙ 2 . Spatial heterogeneity effects can be identified by using Breusch-Pagan testing (Anselin, 1988). However BreuschPagan testing is sensitive to normality assumption. Because of that in this study we use Koenker-Basset test (Gujarati, 2004). The testing is carried out by regressing square error ( ) 2 i ε and square of the result of estimation. Then we test significance of the parameter. The hypotheses are H 0 : represents the absence of the spatial heterogeneity Vs H 1 : H 0 is not true. We test the hypothesis using Z test with criterion rejecting H 0 when ZI> Zߙ/ 2 or p-value < α.
The involvement of factor of geographical location in GWZIPR is expressed by a ‫,݅ݑ(‬ ‫)݅ݒ‬ coordinate. Geographical factor is the weight on GWZIPR model expressing local characteristic of parameters for each location.

Parameter Estimation of GWZIPR Model
By involving geographical factors on ZIPR and using MLE and Expectation-Maximization (EM) algorithm, the likelihood function and ln likelihood of GWZIPR model is given by the equation below: And: where, ‫݈݅ݓ‬ ‫,݅ݑ(‬ ‫)݅ݒ‬ is the weight for location ݅; ݅=1,…,݊.
Likelihood function on (15) is named incomplete likelihood because the first term is not known whether ‫݅ݕ‬ = 0 comes from zero state or poisson state. Because of that Yi is redefined by using Z݅ latent variable.
Simultaneous parameter hypotheses testing of GWZIPR model is given below: The parameters testing of β ‫,݅ݑ(‬ ‫)݅ݒ‬ and γ(‫,݅ݑ‬ ‫:)݅ݒ‬ We reject H 0 when G > χ 2 (α, p-q), where p is the number of parameters under population and q is the number of parameters under H 0 true.

Application
Filariasis or elephantiasis is an infectious disease caused by filaria worm and transmited by mosquitos bites. There are three species of worm can cause filariasis disease, those are Wuchereria bancroft, Brugiatimori and Brugiamalayi. Filariasis can be infected by all of species of mosquito like Anopheles, Aedes, etc.
Preventive efforts of filariasis case are conducted by health counseling activities, physical construction of a healthy house (DKRI, 2006), spraying, using wire netting, mosquito nets, mosquito coils and profilaksis.
Moreover some handling efforts of filariasis are reporting to the local health department, protecting the patients from mosquito bites, finding the sources of infection, special treatment and controlling vector (mosquito) in the endemic area (Mardesni, 2006).
In this study, we fit ZIP and GWZIP regression, the method that we develop to filariasis counts data, by taking into account spatial factors. The data contains information about filariasis counts and factors which are thought likely to influence the filariasis case.
The response variable is the number of filariasis per regency in East Java.
The covariate matrix contains the variables assosiated with health (including the age). Those variables are written at Table 1. Notice Y: The case of filariasis. X 1 : The percentage of households having healthy lifestyle behavior (Uloli et al., 2008), X 2 : The percentage of households having healthy outhouse (Rahayu, 2005), X 3 : The percentage of households having healthy trash can (Rahayu, 2005), X 4 : The percentage of households having healthy wastewater management (Soedarto, 1990). X 5 : The percentage of the residents 20-39 years of age (Wulandari et al., 2010;Juriastuti et al., 2010), X 6 : The percentage of healthy counseling activities (Nasrin, 2008;Juriastuti et al., 2010). u i and v i represent altitude and longitude

Descriptive Analysis of Filariasis Counts Data
Based on descriptive analysis as shown in Table 2, elephantiasis is rare case at regencies in East Java, minimum zero case and maximum only 4 cases. There are more than 50% in average of households in study area have healthy life facilities (outhouse, trash can, wastewater management). There is a tiny percentage (average 0.305%) of residents 20-39 years of age, minimum 0.26% and maximum 0.37%. The healthy counseling activities are rarely conducted, only 1.217% in average, minimum 0% and maximum 4.05%. It means there are regencies which not conduct healthy counseling activities at all.
Identification of multicollinearity is presented in Table 3. From the table, all VIF values are less than 10. This means there is no multicollinearity between predictor variables.

ZIP Model to Filariasis Counts Data
We estimate the parameters of filariasis counts data using R program. The results are presented in Table 4. Table 4 shows that by using significance level α = 0.05, there is one variable which is significant to ln model (parameterβ) and there is not variable significant to logit model (parameterγ). So we can state that the ZIP model is appropriate to model filariasis counts data.

GWZIP Model to Filariasis Counts Data
Spatial factors on the GWZIPR model are identified by testing spatial effect, dependence and heterogeneity. We use Moran's I to test spatial dependence. By using the significance level (α = 0.1), the result shows that there is not spatial dependence effect (p-value 0.7405 >α). The testing by using Koenker-Basse test indicates the existence of weak spatial heterogeneity, p-value 0.0933 < α. Then we involve spatial factor in the model by carrying out GWZIPR model.
In this study, the coordinate of altitude and longitude represent geographical factor of regencies/towns. We use euclid distance to measure the distance between regencies/towns. We get optimum bandwidth by choosing minimum CV. Then Euclid distance and optimum bandwidth are used into kernel function to obtain spatial weight matrix. We use Gauss Kernel function in this step. Table 5 is an example containing euclid distances and the weights on GWZIPR model of Pacitan Regency (Regency 1).
We use R program to estimate parameters of GWZIPR model using EM algorithm. The summary of parameter estimates of all regencies/towns is given in the Table 6. Table 6 shows that parameter estimates 5 β and 5 γ have high standard error. They are different from the standard error of other parameters, as the effect, interval confidence of those parameters are very width. This is associated with the result of testing of the parameter of each regency/town.
The result of the hypothesis testing of equality between GWZIPR and ZIP regression model shows that there is no significant difference between two models, F = 0.2935< F (0.1;14,24) =1.7974, we cannot reject H 0 .
Then the result of the simultaneous parameter hypothesis testing of GWZIPR model shows that G=278.5349 > χ 2 (12) =18.549, so we reject H 0 . It means GWZIPR model is appropriate to model East Java filariasis counts data. Base on those results, we can use ZIP or GWZIPR to build the model of East Java filariasis counts data. We carry out partial testing of the GWZIPR model to know which variables influence model significantly. The result shows that all variables except the percentage of the residents 20-39 years of age (X 5 ) are significant in all regencies/towns. This is in accordance with the summary Table 6 showing standard error of parameter estimates 5 β and 5 γ are large on all regencies/towns. An example of local parameter estimates of GWNBR model, Pacitan Regency is given in the Table 7.
By comparing Z (0.05) =1.64486 and ‫|݁ݑ݈ܽݒ‪|Z‬‬ from Table  7 we can conclude that five variables (X 1 , X 2 , X 3 , X 4 and X 6 ) are significant and the model for Pacitan regency can be written below: ( ) Equation (24) describes that if percentage of households having healthy lifestyle behavior (X 1 ) increase 1%, then it will increase the average filariasis counts 0.0045 and vice versa. By the same way we can interpret other variables. Unfortunately those are contrary to reality that the increasing of percentage of households having healthy lifestyle behavior (X 1 ), trash can (X 3 ) and wastewater management (X 4 ) should reduce filariasis case counts, not conversely. Those cases are probably caused by small observation (only 38 observations) and there is weak spatial heterogeneity of the data (significance level α = 0.1).
Logit model (parameter γ on the Table 7) shows that probability of no filariasis case (y i = 0) in each regency/town is influenced by X 1 , X 2 , X 3 , X 4 and X 6 variables. Based on both models in Equation (24) and (25) and Table 7, predictor variables which influence poisson state and zero state are the same.

Concluding Remarks
In this study, we have introduced two regression models for counts data: Zero-Inflated Poisson Regression (ZIPR) and Geographically Weighted Zero-Inflated Poisson Regression (GWZIPR). Maximum likelihood techniques are used to estimate the parameters of both models. The EM algorithm is used to maximize the likelihood by using Newton Raphson method. Moran's I and Koenker-Basset test are used to know the presence of spatial dependence and heterogeneity. Euclid distance and Gauss Kernel Function is used to obtain spatial weight matrix. Comparing ZIPR and GWZIPR are tested by using F Test.
The tests have proved the spatial independence of the number of East Java filariasis cases 2012. However, there is weak spatial heterogeneity of the data (significance level α = 0.1). In such data we have shown that ZIP and GWZIP regression models are not significantly different (based on F test). Nevertheless, the two models have different significant variables. The significant variable of ZIP regression models is X 6 . Then X 1 , X 2 , X 3 , X 4 and X 6 variables are significant on GWZIP Regression Model. Which one should we use? In such situation, probably it is a good idea if we choose the model that can give logical interpretation. In this condition the model of ZIP Regression gives more logical meaning than GWZIP Regression.