Review of Zero-Inflated Models with Missing Data

Corresponding Author: Chin-Shang Li Division of Biostatistics, Department of Public Health Sciences, University of California, Davis, USA Email: csli2003@gmail.com Abstract: The literature of count regression models covers a large scope of studies and applications that implemented simple and standard models for count response variables by using Poisson regression models, binomial regression models, negative binomial regression models, geometric regression models, or generalized Poisson regression models. These regression models have received considerable attention in various situations. Nevertheless in many fields, the distribution of the count response variable may display a feature of excess zeros for which the aforementioned regression models may fail to provide an adequate fit. To remedy this handicap, a class of distributions known as zero-inflated models is considered as the most appropriate approach for dealing properly with this issue of excess zeros. In addition to the zero-inflated problem, it happens quite often that the sample data sets under investigation are not completely observed. This refers to the missing data problem. In this study, our primary interest is in reviewing studies that considered simultaneously the missing data problem and the zero-inflated feature in modeling zero-inflated data. Moreover, we discuss their methodologies and results and some potential directions of the future research.


Introduction to Zero-Inflated Data
A regression model fit is generally a statistical methodology that helps estimate the strength and direction of the relationship between two or more variables. It is considered as one of the most powerful and popular tools used for making important decisions or investigating some assertions in many statistical studies and across various domains of science. Similarly, a regression approach to count data is one of the most important statistical techniques, which plays a big role in decision making and investigation. This simple but powerful tool may become frustrating even misleading if less sufficient attention is paid to some important aspects of statistical modeling, such as the assumptions of models, the specific features or patterns displayed by the data set, or the presence of missing data. Many computer software programs have made the implementations of estimation of regression models, e.g., for count data, easier than before, but there is still a high chance to obtain a bad fit, especially when fewer attention is paid to the underlining assumptions of models or the complexity of the data. For instance, the presence of excess zeros in the response count variable requires some precautions prior to proceeding with a model fit.
In this review work two main issues are of great concern, including the presence of a Zero-Inflated (ZI) feature in response data and the presence of the missing response or covariate data. Note that count data cover a considerable portion of data in statistical inference and they arise from various fields to include the social sciences, medicine and industry among others. For instance, for the number of new friends added on a user's facebook account a week, the number of customers' mails a day that a business company receives regarding goods lost or damaged, or the number of doctor and hospital visits occurring throughout the weekday or weekend; regular Poisson regression models, negative binomial regression models or generalized Poisson regression models may be appropriate to fit this kind of data. Among these standard regression models, Poisson regression models are the most popular tool used to fit count data (Cameron and Trivedi, 2013) because of their simplicity in application and interpretation of the results. Notwithstanding those advantages, a regular Poisson regression model cannot capture the ZI feature because it has only one parameter that is its mean. In the presence of the ZI feature, fitting a regular Poisson regression model has a tendency to overstate the significance level or underestimate the standard errors of the estimators of the model parameters. Consequently, inference based on the regular Poisson regression model fit is misleading and not credible in this situation. On the other hand, the negative binomial regression models (Cameron and Trivedi, 2013) and the generalized Poisson regression models (Consul and Famoye, 1992) are mostly seen as the backup solutions in case the regular Poisson regression model fit is not adequate. Contrary to the regular Poisson regression model, the negative binomial regression models and the generalized Poisson regression models and the generalized Poisson regression models have an extra parameter that can capture an additional effect, such as the ZI feature. Nevertheless, in many analyses of count data with excess zeros, these two regressions models may fail to adequately fit the data under study. In this case, ZI regression models or other mixture regression models (Mullahy, 1986) are better options (Allison, 2012). Ridout et al. (1998) and Ismail and Jemain (2007) provided a comprehensive introduction to the class of the ZI regression models. ZI models provide a wide and intensive area of research (Tu and Liu, 2016). Interestingly, the Scopus search engine developed by Elsevier reveals that in the last ten years. ZI models have been mentioned over 1,410 times as titles, abstracts or keywords among all articles. Compared to standard regression models, ZI models are considered to be more advanced methods and are required in order to account properly for the feature of excess zeros. For instance, there are Zero-Inflated Poisson (ZIP) models, Zero-Inflated Negative Binomial (ZINB) models, Zero-Inflated Binomial (ZIB) models and Zero-Inflated Generalized Poisson (ZIGP) models. Other models closely related to ZI models are hurdle models (Mullahy, 1986) and two-part models (Heilbron, 1994).
In general, a ZI model can be thought of as a mixture distribution of two components, including a count distribution, such as Poisson, binomial, negative binomial, or geometric and the degenerated distribution at zero. These ZI regression models differ from others in terms of the nature of the count distribution used for the probability mass function as given in expression (1). The ZI feature is generated by both sources (processes), including the count distribution component (random zeros) and the component of excess zeros (structural zeros). To the best of our knowledge, among the most used ZI models, the ZIP regression models proposed by Lambert (1992) are the most used in many applications.
Besides that, the ZINB regression models (Ridout et al., 2001), ZIB regression models (Hall, 2000), Zero-Inflated Geometric (ZIG) regression models (Nagesh et al., 2015) and ZIGP regression models (Famoye and Singh, 2006) have been proposed in some situations to account for the feature of excess zeros, where a ZIP regression model could not fit the data well. Note that the ZIB regression models and the ZIG regression models have received very little attention compared to the most used ZI models. Besides using expression (1) as a generic form, the zero-inflated power series regression models as given in Gupta et al. (1995) can be seen as another form of presenting the count data with excess zeros (see Section 3.4). Up to this day, different orientations have been taken under the ZI models and many interesting results are found in the literature. But most of these works have left aside the potential question of missing data.
Besides the issue of excess zeros in count data, another important issue that has been addressed in the ZI data analysis literature is the missing data problem. ZI data are very active in many statistical studies or applications in practice. Therefore, the response count variable or some covariates involved in a regression model are likely to have missing data. There are many reasons behind the missing data appearance. Some missings are intentionally created for technical or confidential reasons, while others are due to happenstance. In these past decades, many researchers have proved that missing data were not avoidable in statistical studies; see, e.g., Little and Rubin (2002) and Schafer and Graham (2002). Consequently, the problem of missing response or covariate data attracts great attention. Any failures in addressing properly the presence of missing data while analyzing a ZI data set possibly yield inaccurate estimates. Little (1992) pointed out that the missing process and the missing pattern needed to be well understood in order to apply appropriate methods in response to missing data. Therefore, methods summarized in Table 1 are very useful in dealing with missing data. Due to the importance of these matters, we review only those works that simultaneously studied the ZI feature and the missing data problem. We introduce briefly the ZI model framework and some important concepts related to missing data in Section 2. Section 3 presents only the most popular ZI models and their related missing data treatments. A conclusion is given in Section

Zero-Inflated Distributions
Prior to describing some popular ZI models and their applications, we first define a generic form for all ZI models. Let Y be a count response variable. The probability mass function of a ZI distribution can then be expressed as follows: Here p ∈ [0, 1] is a mixing weight for the accommodation of extra zeros. f(y; η, d) represents a regular count distribution; for instance, Poisson distribution, binomial distribution, geometric distribution and negative binomial distribution. In general, f(y;η, d) possesses two parameters η and d, where η and d represent its expected value and dispersion parameter, respectively. In practice p is linked to a set of covariates (χ 1 ) via a logit-linear predictor such that p = H(u) = H(β T χ 1 ), where H(u) = [1+exp(−u)] −1 , whereas η is linked to another set of covariates (χ 2 ) via a log-linear predictor η = exp(γ T χ 2 ) for unbounded count data. In many applications, the parameter d is neither modeled as a function of χ 1 nor χ 2 . Naturally, χ 1 and χ 2 do not have to be identical. For instance, Lambert (1992) assumed that χ 1 ≠ χ 2 , whereas Lukusa et al. (2016) assumed that χ 1 = χ 2 = χ, where χ = (1, X T ,Z T ) T for X and Z being vectors of categorical or continuous covariates. A special case is when p is a constant not depending on covariates (Li, 2011). To have a comprehensive review, we define p = H(β T χ 1 ) and η = exp(γ T χ 2 ). Other appropriate linear predictors can be used to model p; for instance, the probit-linear predictor given by p = Φ(β T χ 1 ) can be used instead of p = H(β T χ 1 ), where Φ is the cumulative distribution function of the standard normal distribution.
One of the most interesting features about the ZI models is that they are related to each other based on the behaviors of parameters p, d and η in expression (1). For instance, when d → ∞, the zero-inflated negative binomial distribution reduces to a zero-inflated Poisson distribution. When d = 0, the zero-inflated generalized Poisson distribution reduces to a zero-inflated Poisson distribution. But when p = 0, the zero-inflated negative binomial distribution, the zero-inflated generalized Poisson distribution and the zero-inflated Poisson distribution reduce to the negative binomial distribution, the generalized Poisson distribution and the Poisson distribution, respectively. Various relations can be established for the entire family of ZI distributions. The ZI regression models aim at estimating the unknown parameter vector θ = (β T , γ T , d) T by means of different optimization techniques, such as Newton-Raphson method and expectation-maximization algorithm (Dempster et al., 1977).

Some Important Concepts of Missing Data
Missing data are described as various codes indicating lack of response (Schafer and Graham, 2002).
Missing data are generally caused by technical problems or designs. But in some specific cases, e.g., privacy, missing data are deliberately created. The missing data should not be overlooked without a specific reason. Before applying any appropriate methods to deal with missing data, as listed in the taxonomy (Little, 1992), a data set needs to be described by means of descriptive statistics in order to obtain the information related to the missing data. If it is revealed that there are missing data, then the first important step should be to understand the missing patterns and the missing mechanisms. Let n be the sample size, Y the non-negative count outcome variable and X and Z covariate vectors, where Z is always observed. Assume that X is partially observed and W is a surrogate variable able to provide enough information about the missing variable. To account for missingness, an indicator variable, δ = 1 if X is observed and δ = 0 otherwise, is included. Similarly, the idea of X having missing can be extended to a situation where the response variable Y is incomplete. For the sake of illustration, when X is missing at random, the basic data structure is as follows: where n v denotes the number of validation data. The data set structure is often arranged in arrays, which is allowed to visualize clearly the different patterns of missing values. There are three main missing patterns (Rubin, 1976), including (i) the univariate pattern where missing data occur only on a single item (single variable) or group of variables of the same nature, while others are completely observed, (ii) the monotone pattern where missing values on items can be arranged in an increasing proportion from items with least missing values to items with more missing values and (iii) the general pattern where missing values scattered everywhere. Compared with the general pattern, the univariate and monotone patterns are not hard to handle in practice.
Let V = (Z,W) and the data set The missing mechanism plays an important role in dealing with missing data problems. Rubin (1976) distinguished among three processes of missing mechanisms, including Missing Completely At Random (MCAR), Missing At Random (MAR) and Missing Not At Random (MNAR). Under the MCAR, the selection probability is expressed as P(δ = 1|Y,X, V) = π. When missing is MAR, the selection probability is expressed as P(δ = 1|Y,X, V) = π(Y, V). Note that the MCAR and the MAR mechanisms are ignorable missing mechanisms. Under the MNAR, the selection probability is given by P(δ = 1|Y,X, V) = π(Y,X, V). The MNAR mechanism is a nonignorable missing mechanism. Note that in survey studies, clinical studies or other statistical techniques for data collection, it is difficult to distinguish between the MAR and MNAR, even if the MAR is from the MCAR unless additional information is available. Therefore, it is important to understand clearly the whole process and circumstances during the data collection stage. Due to the importance of the missing mechanisms, the estimation of the selection probability has been of great concern. For instance, Rosenbaum and Rubin (1983) and Robins et al. (1994) proposed a parametric estimation method, whereas Wang et al. (1997) and Wang and Wang (2001) proposed a nonparametric estimation method. Many techniques can be applied to estimate the selection probability provided that the condition that the estimate of selection probability π∈ [0, 1] holds.

Taxonomy of Techniques for Handling Missing
Although some methods for dealing with missing data are seen as more powerful than others, they all have some limitations. This happens when the model assumptions in the presence of missing data are not well understood or when the proportion of missing increases considerably. Based on Little (1992), Pigot (2001) and Ibrahim et al. (2012), the most popular methods for handling the missing data are summarized in Table 1. Note that contents in Table 1 are more technical and general than specific.
The introduction of ZI models and the missing data problem help understand various orientations authors have taken regarding the zero excess and the missing data issues. Table 1 will serve as a guidance of methods potential to be applied under the ZI regression models. Next, we review the most popular ZI regression models and the missing data problem.

Zero-Inflated Negative Binomial Models
A ZINB distribution can be seen as a mixture of two distributions, including a Negative Binomial (NB) distribution and a degenerated distribution at zero (Ridout et al., 1998;2001). Therefore, the ZINB distribution can be derived from expression (1) such that the function f(y; η, d) is a NB distribution, expressed as follows: where η = µ and p and d are identical to those in expression (1). Then, the probability mass function of the ZINB distribution is expressed as follows: where Γ(·) is the gamma function. Note that when p = 0, the ZINB distribution reduces to a NB distribution and when p = 0 and d → ∞, the ZINB distribution reduces to a regular Poisson distribution.
The likelihood-based method can be used to obtain estimates of β, γ and d via the Expectation Maximization (EM) algorithm (Dempster et al., 1977).
The ZINB distribution has been used in some interesting studies; for instance, Preisser et al. (2012) provided a review of some ZI models for data of dental caries indices in epidemiology. There are other interesting works; nevertheless, they do not tackle the missing data issue. We now turn our attention to the missing data problem under the ZINB regression model framework.
To the best of our knowledge, neither the missing response nor the missing covariates has been fully explored under the ZINB regression model framework except for Chen and Fu (2011)

Zero-Inflated Generalized Poisson Models
Extended from the Generalized Poisson (GP) regression model developed by Consul and Famoye (1992), the ZIGP regression model (Famoye and Singh, 2006) is a competitor of the ZINB regression model. It has the flexibility to handle any inflation or deflation in count data. Alike to the ZINB distribution, the ZIGP distribution is a mixture of two distributions, including the GP distribution that can be represented as f(y, η, d) in expression (1), where η = µ and the degenerated distribution at zero. In the past decades, the ZIGP regression models received significant interest and attention due to its flexibility to handle some unusual features of count data. Consequently, different variants of the ZIGP regression models have been developed and applied. The probability mass function of the ZIGP distribution is expressed as follows: When d = 0, the ZIGP distribution reduces to the ZIP distribution. When p = 0, the ZIGP distribution reduces to the GP distribution. For more details about the likelihood function for the ZIGP model and its optimization, see, e.g., Famoye and Singh (2006) and Ismail and Zamani (2013). Among the ZI models, the ZIGP model is becoming more attractive to researchers and its scope takes various directions. For example, Gupta et al. (1995) developed the zero-inflated modified power series distributions that cover many distributions, e.g., the ZIGP regression model. Famoye and Singh (2006) applied the ZIGP regression model under the frequentist context to fit domestic violence data with excess zeros. They found that it converged in all situations when fitting the ZIGP regression model, whereas it converged only in some cases when fitting the ZINB regression model. That supported the view that for this kind of data, the ZINB and ZIP models could not provide an adequate fit. Angers and Biswas (2003) investigated the fit of ZIGP regression model under a Bayesian framework where they discussed the use of noninformative priors to obtain the posteriors and to compare the performance of the ZIGP model with that of the Poisson and ZIP models used for the fetal movement data. Regarding the missing data problem, researchers have not yet shown much interest in studying the missing data problem under the ZIGP model framework, despite the indication that there is a growing tendency of work related to ZIGP models. Thus, there are no ZIGP models with missing data work to study, unlike the ZIP model with missing data or the ZINB model with missing data that are both illustrated under the ZIPS model (Samani et al., 2012).

Zero-Inflated Poisson Models
Among the ZI models, the ZIP regression model (Lambert, 1992) is the most popular. The ZIP distribution can be thought of as a population that includes two latent groups of subjects: The non-susceptible group consisting of those who are not at risk of an event of interest and the susceptible group consisting of those who are at risk of the event and may have experienced the event several times during a specific time period (Dietz and Böhning, 1997). From expression (1), the ZIP distribution is a mixture distribution that includes the Poisson distribution denoted as f(y; λ), where λ = η and the degenerated function at zero (Singh, 1963;Johnson et al., 2005). Alternatively, if d → ∞, then f(y; λ, d) → f(y; λ), where f(y; λ) = e −λ λ y /y! and f(y; λ, d) is a NB distribution function as given in expression (2). The probability mass function of the ZIP distribution is then expressed as follows: where λ is the Poisson mean. The ZIP distribution reduces to a regular Poisson distribution when p = 0. The likelihood of the ZIP model can be expressed as follows:  Lambert (1992). The ZIP regression models have been further disseminated and used successfully by some authors, e.g., Böhning et al. (1999), Yau and Lee (2001), Cheung (2002), Lu et al. (2004). Hall and Shen (2010) proposed a robust expectation-solution estimation method for ZIP regression models to overcome the case where the Maximum Likelihood Estimator (MLE) is highly sensitive to the presence of outliers. In addition, Li (2011) proposed a semiparametric ZIP regression model that can be used to assess the lack of fit of a postulated parametric ZIP model. Jansakul and Hinde (2002) proposed a score test for a ZIP model against a Poisson model. Li (2012) proposed a score test for a semiparametric ZIP regression model versus a semiparametric Poisson regression model. Similar to the ZINB and ZIGP models, researchers have not yet shown enough interest in exploring the missing data problem under the missing data framework. To the best of our knowledge, little has been done so far. At this level, we present the work of Lukusa et al. (2016).
By assuming that the missing covariates were MAR (Rubin, 1976), Lukusa et al. (2016) proposed a semiparametric Inverse Probability Weighting (IPW) estimator of a ZIP regression model in the spirit of Zhao and Lipsitz (1992) and Flander and Greenland (1991). The proposed estimating method was a Horvitz and Thompson (1952)-type weighted estimating method where the selection probability was π(Y, V) = P(δ = 1|Y,X, V). Following Wang et al. (1997) and Reilly and Pepe (1995), Lukusa et al. (2016)   S θ = E(S 1 (θ)|Y 1 , V 1 ). A simulation study was conducted to compare the semiparametric IPW estimator, true weight IPW estimator, the CC estimator and the MLE that was considered as the benchmark. Comparisons were made based on the average bias, standard deviation, standard error and the 95% coverage probability. Overall, the semiparametric IPW estimator was found to be asymptotically unbiased and more efficient than the CC estimator that was seriously biased and the true weight estimator π(Y, V) that had a bigger standard error. It means that even if π(Y, V) is known, which is not always the case, it should be substituted by ˆ( , ) in the estimating function using the true weight. Moreover, they illustrated the practical use of the proposed methodology with a data set from a survey study conducted in Taiwan in 2007 that consists of 7,386 respondents. The response count variable was the number of speed regulations that a motorcycle rider violated in a year (about 90% of motorcycle riders not violating speed regulations). Only the covariate related to the distance covered in kilometers had 15% of data missing while data of other covariates were fully observed. The analysis results overall showed that the performance of the semiparametric IPW estimator was very close to that of the CC estimator.
Besides the work of Lukusa et al. (2016), Pahel et al. (2011) used expression (1) to predict the missing dental caries data, particularly under the ZI regression model framework. Similar to Lambert (1992) and Lukusa et al. (2016), they used the ZIP regression model where process 1 and process 2 were generated by p = H(β T χ 1 ) and λ = exp(γ T χ 2 ), respectively, for χ 1 and χ 2 as sets of covariates. Similar to Lukusa et al. (2016), Pahel et al. (2011) assumed χ 1 = χ 2 = χ. In order to impute the missing dental caries data, they considered the complete case and some additional information. In Step 1, estimate ZIP model with non-missing caries data. In Step 2, generate predictions for levels of caries based on estimated coefficients. In addition, if the predicted probability in process 1 is less than say t, a uniform (0, 1) random variate, then the missing is filled in with 0 (meaning no dental caries); otherwise process 2 is used to fill in the missing values. The final result is a summary of all imputations based on the formula of Rubin (1987). To illustrate the performance of the proposed multiple imputation techniques for missing dental caries data, they used a real example where they compared the three model imputations: The Poisson model, the ZIP model and the ZIP model with random effects, respectively. Under the missing not MCAR for the response variable, they imputed the three models and computed their Akaike Information Criterion (AIC) value (Akaike, 1974). Although Pahel et al. (2011) and Böhning et al. (1999) studied the ZI data for the dental caries, the main difference is that Pahel et al. (2011) considered the missing data problem, whereas Böhning et al. (1999) focused on the problem of missing teeth.

Zero-Inflated Power Series Distributions
The ZIPS model is a two-component mixture model that consists of a Power Series (PS) distribution, such as Poisson, binomial, negative binomial and geometric distributions and a degenerated distribution at zero. The probability mass function of the PS distribution is expressed as follows: More interestingly, the ZIPS models include most of the ZI models except for the ZIGP models. Let y i be a realization of a random variable Y i that has a ZIPS distribution. Let χ 1i and χ 2i be covariate sets for the ith subject and define θ = (β T , γ T ) T a vector of parameters to be estimated. The likelihood function of the ZIPS model is then expressed as follows: where p i = H(β T χ 1i ) and λ i = exp(γ T χ 2i ), i = 1,..., n. The MLE of θ is obtained by optimizing the log-likelihood function ℓ(θ) = log L(θ). The ZIPS model framework seems to become more attractive for many researchers. For example, Bhattacharya et al. (2008) provided a general Bayesian setup to test for the ZI feature in a ZIPS distribution. Samani et al. (2012) used a likelihood-based approach. Under the ZIPS regression model framework, Samani et al. (2012) proposed the mixed Stochastic EM (SEM) and EM algorithms (M-SEM-EM algorithm) for parameter estimation in the likelihood-based approach. Unfortunately, so far, that was the unique work that addressed the missing data problem. We briefly present their approach. To capture the ZI feature, Samani et al. (2012) extended the idea of Samani (2011) known as the missing inflated power series distribution model. Assuming the response Y to be MNAR (Rubin, 1976), Samani et al. (2012) expressed the joint incomplete data model as follows: Here logit(p i ) = β T χ 1i . log(λ i ) = γ T χ 2i for χ 1i ≠ χ 2i . δ i is a binary missing indicator variable defined as δ i = 1 if y i is observed and δ i = 0 otherwise. χ 3i is another covariate vector. * 1 i Y and * 2 i Y are defined, respectively, as * To estimate θ, they maximized ℓ(θ, y) by a variant of the EM algorithm, known as a MSEM-EM algorithm. Furthermore, they computed the AIC value (Akaike, 1974) in order to compare different regression models under the ZIPS model framework. Overall, their simulation study results showed that the larger the sample, the better the estimate of θ. They applied the proposed methodology to the data set from the British Household Panel Survey regarding the number of visits to a hospital during the year by using the AIC to compare the Poisson, NB, ZIP and ZINB regression models. They found that the ZIP regression model was the best model because it had the smallest AIC value. They proceeded to fit the ZIP regression model and the results confirmed that it was MNAR because the number of visits showed a significant Poisson status on the probability of the nonresponse; Samani et al. (2012) for more details. Besides Samani et al. (2012), Chen and Fu (2011) proposed a parametric model selection when some covariates were MAR (Rubin, 1976). Considering the ZIPS distribution, they proposed a new model selection criterion in place of the classical AIC (Akaike, 1974) in order to account for the missing covariates that were assumed to be MAR. In effect, in the presence of missing data, the classical AIC misleads the conclusion of model selection. Interestingly, the proposed method can be implemented in the presence of missing data or without missing data. Chen and Fu (2011) showed that their developed method is a modified version of Monte Carlo EM (MCEM) algorithm that is based on the data augmentation scheme. We briefly present their idea. Let x = (x obs , x mis ) be the vector of covariates partially observed. In order to reduce the number of nuisance parameters that need to be estimated via the MCEM algorithm and allow a more convenient model specification for the distribution of covariates, following Ibrahim et al. (2005), Chen and Fu (2011) modeled the vector of covariates x i = (x obs,i , x mis,i ) by developing the following probability model: where α j is a vector of indexing parameters for the jth Under the assumption and excluding the missing data indicator from the model, the complete data probability function of subject i from the ZIPS regression model is given by P(y i , x i , r i |θ) ∝ P(y i , x i |θ) = P(y i |x i ,β,γ)P(x i |α) that leads to the complete data log-likelihood: where θ = (β T , γ T , α T ) T . In order to obtain the estimate of θ, Chen and Fu (2011) used the data augmentation techniques (Ghosh et al., 2006) and a modified version of the MCEM algorithm to maximize the log-likelihood ℓ ad (θ) that was obtained by including the latent variable into ℓ c (θ). Following Claeskens and Consentino (2008) who derived a version of AIC (Akaike, 1974) that is suitable for the situation of missing covariates, they proposed the new criterion for ZIPS regression models with missing covariates. See Chen and Fu (2011) for more details regarding their methodology. They conducted a simulation study to illustrate the application of the proposed method in selecting the best model among the four candidate models: Poisson, NB, ZIP and ZINB regression models. They illustrated the practical use of the proposed methodologies by using a real data set from the Female Consumer Lifestyle Study in which the whole data set was collected in six cities of China in 2003 on broad topics, such as lifestyle, the frequency of buying goods for slim and the average amount of purchases.

Hurdle Models
Closely related to the ZI models, the hurdle models were developed by Mullahy (1986) and were popularized by Cameron and Trivedi (2013) in order to deal with count data sets having more zero counts than allowed for by the Poisson and NB models. The difference between the hurdle and ZI count models is that the later can separately model the zero and non-zero counts. The hurdle models are two-component models: A hurdle component for zeros versus non-zeros and a truncated count component for positive counts. For the hurdle component, either a binomial model or a censored count distribution, such as a censored Poisson, geometric, or NB distribution, can be used to model zeros versus non-zeros. For a truncated count component, a Poisson, geometric or NB model can be used for positive counts. More specifically, the hurdle model combines a zero hurdle model P zero (Y = y) (right-censored at 1) and a count data model P count (Y = y) (left-truncated at 1), expressed as follows: For example, Mullahy (1986) used a Poisson distribution to model the zeros versus non-zeros and a zero-truncated Poisson distribution to model the positive counts. The hurdle models have been intensively applied in many studies. In the last decade, the hurdle models have been mentioned for about 2,000 times as titles or keywords; see the Scopus engine from Elsevier. Nevertheless, none of them addressed the missing data problem.

Multivariate Zero-Inflated Models
Besides the univariate ZI count model frame, it is possible to have several outcomes measured on each individual. For instance, Li et al. (1999) studied multivariate zero-inflated Poisson models in order to model outcomes of manufacturing processes producing numerous defect-free products while Wang (2003) studied the bivariate zero-inflated negative binomial models for bivariate count data with excess zeros and applied the proposed model to analyze the data of health-care utilization with a sample of 5,190 single-person households from the 1977-1978 Australian Health Survey. Yang et al. (2016) proposed a flexible MCEM algorithm for estimation of the Bivariate Zero-Inflated Poisson (BZIP) regression model when the response count is MAR. They applied the proposed methodology to a bivariate data set regarding the demand for health care in Australia. More details can be found in Yang et al. (2016).  Table 2 provides the existing references for ZIP models with missing data, the purpose of study and the methodology used to deal with missing data.

Discussion
Based on the missing mechanisms, Pahel et al. (2011) considered the missing in the count response variable as MAR. Samani et al. (2012) assumed the missing in the count response variable was MNAR. Chen and Fu (2011) and Lukusa et al. (2016) considered the missing in covariates as MAR. Yang et al. (2016) proposed a joint MAR mechanism for the bivariate count response variable. Regarding the purpose and methodology, Chen and Fu (2011) and Samani et al. (2012) considered the ZIPS regression model framework and implemented different variants of the EM algorithm to estimate the model parameters and to compute the AIC values for model selection in the presence of missing data. Lukusa et al. (2016) developed the semiparametric inverse probability weighting method for estimation of the ZIP regression model that used a nonparametric selection probability and showed that the proposed estimator had good asymptotic properties. In addition, they showed that their estimator was more efficient than the estimator that uses the true weight and the CC estimator that was seriously biased. Pahel et al. (2011) developed a multiple imputation method for missing dental caries data under the ZIP regression model. Column 2, the number represents the frequency of citations as title, keywords or abstracts. The number in bracket represents the frequency of the article titles related to the corresponding zero inflated regression model. −− means there is not yet an article of the corresponding zero-inflated model with missing data. MCEM, Monte Carlo expectation-maximization; M-SEM-EM, mixed stochastic expectation maximization and expectationmaximization.
Based on the non-missing caries data, they imputed missing caries data by using the Poisson model, ZIP model and ZIP model with random effects. Indeed, the ZIP model with random effects yielded the best result because this model accounted for the cluster effects. Yang et al. (2016) applied a straightforward likelihood approach to estimate parameters of a BZIP model when the bivariate count variable is missing at random. The EM algorithm and MCEM algorithm are the most used estimation methods in the literature of ZI regressions models with missing data. Moreover, in simulation studies from Chen and Fu (2011), the proportion of zeros generated is not clearly presented. Therefore, the ZI feature may not be clearly perceived. Chen and Fu (2011) and Samani et al. (2012) also generated the missing data where the missing rate was less than 25%. Regarding the work of Lukusa et al. (2016), their simulation study used only moderate and large samples. It can be interesting to see how the proposed method performed under small or moderate samples. Pahel et al. (2011) assumed the missing mechanism to be not MCAR. Nevertheless, it is not clear whether it was a MAR or MNAR mechanism. Yang et al. (2016), who fit a BZIP regression model, pointed out that in some cases, the estimator based on the CC method was closer to the MLE that was obtained by using the MCEM method. However, often under the MAR the CC estimate is expected to be biased. Further investigations are needed. In general, most of the methods developed in the literature of ZI models with missing data agree with the summary of the most used methods given in Table 1.

Conclusion
The missing data problem has been intensively studied from various angles in the regression model literature. Some studies investigated the missing data under specific distribution models or vice versa; particularly, we have reviewed the literature of ZI models with missing data. It is crystal clear that fewer works related to the ZI models dealt with the missing data problem and the ZI feature simultaneously. Chen and Fu (2011), Pahel et al. (2011), Samani et al. (2012 Lukusa et al. (2016) and Yang et al. (2016) seem to be the only appealing works; see Table 2. Surprisingly, the ZINB, ZIGP and hurdle regressions models are among the most used models for ZI count data. However, these three regression models with missing data, exclusively, have not yet been investigated. On the other hand, the ZIP, ZIPS and BZIP regression models have less than three works each on the missing data problems. Table 2 gives the whole picture of the ZI data literature in terms of the regression model appearance, the missing data mechanisms considered, the references and the methodology used to handle the missing data. We wish to inspire researchers to discover the research regarding ZI models with missing data. There are many extensions or future studies to be carried on. For instance, Chen and Fu (2011) and Samani et al. (2012) could include the asymptotic behavior of the proposed AIC. Lukusa et al. (2016) could assume a MNAR mechanism. Yang et al. (2016) might also consider the case where covariates in the BZIP regression model are MNAR. Finally, ZI data with missing values still have plenty of orientations yet to be investigated. ZI data are important in many studies and sectors of life. A relationship between Table 1 and 2 shows many potential studies that could be done. Thus with the information from Table 1 and 2, researchers are invited to come out with some comprehensive and intensive studies of ZI data with missing values.