Comparison of Probit and Logistic Regression Models in the Analysis of Dichotomous Outcomes

Corresponding Author: Amrutha Jose National Institute of Mental Health and Neurosciences (NIMHANS), Bengaluru, India Email: amruthaannjose@gmail.com Abstract: Probit and logistic regression models are members of the family of generalized linear models, used for estimating the functional relationship between the dichotomous dependent and independent variables. The current study is designed to find the performance of logistic and probit regression models in different conditions under multivariate normality. The objective of the study is to compare the performance of probit and logistic regression models under multivariate normal. A Monte Carlo simulation study was done in which artificial datasets were generated under multivariate normality. Datasets were generated by employing the latent variable approach, under different variance-covariance matrices, varying sample sizes and prevalences. For each of the combinations, 1000 simulations were carried out. Probit and logistic regression analyses were performed and compared using parameter estimates, standard error, Likelihood Ratio test, RMSEs, null and residual deviances, different pseudo R 2 measures, AIC, BIC and Correct Percent Prediction. A live data set was also used to compare the efficiency of the models. It was evident from AIC, BIC and RMSE values that logit and probit models fit the dataset equally well in all the combinations of sample size, correlation structure and proportion of outcome. However, sensitivity, specificity and CPP values showed that the logit model predicts the outcome better than the probit model in most of the situations. The results showed that the probit and logit models perform equally well under multivariate normality.


Introduction
Most of the outcome variables in Biomedical research are categorical in nature. Many times continuous variables will be categorized and is known as discretization. Even though studies indicate that dichotomization of the variables is a bad practice and results in loss of statistical power (Altman and Royston, 2006;Naggara et al., 2011), it is quite a common practice. For example, the Hamilton Rating Scale for , contains 17 items to be rated (Hamilton, 1960). It is used to diagnose whether a subject is having depression or not. The items in the questionnaire are scored on 3 point to 5 point scale. The subject is considered as normal if the HRSD score is between 0 and 7 and there is evidence of depression otherwise (Hamilton, 1960). In such a scenario where the outcome variable is dichotomous, binary logistic and probit regression models are the frequently used statistical methods for predicting the outcome variable based on a set of independent variables (Chai and Draxler, 2014).
The underlying dependent variable, Y * in binary logit and probit models can be defined as: where, K is the number of parameters involved in the model. In practice, Y * is an unobserved variable ranging from -∞ to +∞ which makes the observed Y, a dichotomous outcome variable (Cakmakyapan and Goktas, 2013;Alsoruji et al., 2018).
Logit and probit models are members of the family of Generalized Linear Models (GLM) and are commonly used to predict the categorical dependent variable based on a number of covariates or independent variables using the link functions -logit and probit respectively. Both models have been addressed in the literature and are used for the same purpose. However, studies are seldom available evaluating the efficiency of the models under different parameter estimates such as varying sample sizes and different proportions (prevalence) of outcome with a range of correlations between dependent and independent variables, which will provide evidence to propose appropriate methods under various situations.
In this paper, an effort is made to summarize the methodology of the binary logit model, the binary probit model, fit indices for both models with an application to a live data situation. A Monte Carlo simulation study was also conducted in which artificial datasets were generated under multivariate normality using a latent variable approach. Various conditions were imposed in terms of variance-covariance matrices assuming specific correlations between dependent and independent variables. Datasets were generated under different sample sizes as well as varying prevalences. For each of the combinations, 1000 simulations were carried out and the results of the simulation study were also documented.

Binary Outcome Models -Logit and Probit
Suppose that there is a variable y * ranging from -∞ to +∞ that generates the observed outcome variable, y. y * is assumed to be linearly related to the observed x's through the structural model: The variable y * is linked to the observed binary variable y by the equation: where, τ is the threshold or cut point value.
If * i y crosses the threshold τ (i.e., * i y >τ), then y=1, otherwise y = 0. The link between y * and the observed y is illustrated in figure 1 for the model y * = α + βx + ε.
In figure 1, y * is taken along the vertical axis, with the threshold τ indicated by a horizontal dashed line. The distribution of y * is shown by the bell-shaped curves which should be thought of as coming out of the figure into a third dimension (Long, 1997;Cakmakyapan and Goktas, 2013). Since the outcome variable is dichotomous, linear regression methods that use ordinary least squares for estimation cannot be applied. Maximum likelihood estimation which assumes some specific distribution of the errors should be used, instead. The most commonly used statistical models for predicting dichotomous outcome variable are the logit model which assumes logistic errors and probit model which assumes normal errors (Long, 1997).
In both probit and logit models, the measurement equation is Y = Xβ + ε. The logit models use the standard logistic probability distribution function and the event probability can be estimated as: The link function in the logit model is: The error distribution in the logit model is assumed to be a standard logistic distribution with mean, E(ε/x) = 0 and variance, Var(ε/x) = π 2 /3≈3.29 (Long, 1997). Similarly, the event probability in the probit model can be estimated as: where, Φ is the cumulative standard normal distribution. The link function in the probit model is Φ -1 (π(x)) and the error distribution is assumed to be normally distributed with E(ε/x) = 0 and Var(ε/x) = 1.

Measures of Model Fit
Measures of model fit are used to compare the competing models and to select a final model. In the linear regression model, the coefficient of determination (R 2 ) is the standard measure of fit. When the outcome variable is dichotomous, there are numerous fit indices such as pseudo R 2 but none of these measures gives a clear interpretation in terms of explained variation (Long, 1997). Another way of selecting the final model is by using the Likelihood Ratio (LR) test where it is tested that all of the slope coefficients are simultaneously equal to zero or not. Information criteria measures such as Akaike Information Criteria (AIC) and Bayesian Information Criteria (BIC) are also popular for comparing models especially when the models are non-nested. Root Mean Square Errors (RMSEs), Null and Residual deviances, Sensitivity, Specificity and Accuracy or Correct Percentage Prediction (CPP) values are also commonly used measures to compare binary outcome models.

Pseudo-R 2 Measures
In logistic regression, there are a number of pseudo-R 2 measures and are used to indicate how well a model fits a set of data. But there is no consensus on which measure is the best. Also, different pseudo-R 2 measures have different values for the same model and most of them are not seen as a number between 0 and 1. The pseudo-R 2 measures considered in the present study were as follows: where, k is the number of predictors in the model. R (Maximum Likelihood pseudo-R 2 or Cox-Snell R 2 ): It is also called the geometric mean squared improvement per observation and is defined as: • 2 & C U R (Cragg and Uhler's pseudo-R 2 or Nagelkerke R 2 ): As the fit of M β approaches M α , R 2 ML approaches 0. Maddala (1986) showed that R 2 ML only reaches a maximum of 1-[L(M α )] 2/N Cragg and Uhler (1970) suggest the normed measure: R 2 C&U is an adjustment for R 2 ML , which makes it possible for the R 2 to have a maximum value of 1. R are defined in terms of the likelihood function and can be applied to any model estimated by ML (Long, 1997). All else being equal, models with a larger value of the log-likelihood are preferred and 2 McF R provides a convenient way to compare log likelihoods across different models (Long, 1997).

Scaled Deviance
Scaled deviance or simply deviance is a fit index which is analogous to the residual sum of squares within the framework known as the generalized linear model. The deviance compares a given model to the full model M F . The full model has one parameter for each observation and can reproduce perfectly the observed data. Since the observed data are perfectly predicted, the likelihood of M F is 1 and the log-likelihood is 0 (Long, 1997).
To test that M F significantly improves the fit over M β , then the deviance is defined as: The deviance function is very useful for comparing two models when one model has parameters that are a subset of the second model. The null deviance shows how well the response variable is predicted by a model that includes only the intercept whereas residual deviance shows how well the response variable is predicted by the model with the inclusion of independent variables.

Likelihood Ratio (LR) Test
LR test is used to test the hypothesis that all of the slope coefficients are simultaneously equal to zero and its test statistic is defined as: Under H 0 , LR follows Chi-square distribution with degrees of freedom equal to the difference between the numbers of parameters in the two models (Long, 1997).

AIC
Another way to assess the performance of a model is by calculating fit indices such as AIC and BIC. AIC is a way of selecting a model from a set of models and is defined as: where, L(M β ) is the likelihood of the model and 'k' is the number of parameters in the model. The term -2lnL(M β ) ranges from 0 to +∞ with smaller values indicating a better fit. As the number of parameters in the model increases, -2lnL(M β ) becomes smaller since more parameters make what is observed more likely. Hence, the constant 2k is added to -2lnL(M β ) as a penalty. A smaller AIC value indicates a better fit. i.e., the chosen model is the one that minimizes the Kullback-Leibler distance between the model and the truth (Burnham and Anderson, 2003). AIC is often used to compare models across different samples or to compare non-nested models that cannot be compared with the LR test.

Corrected AIC (AICc)
AIC can severely violate the principle of parsimony in extreme circumstances. The failure of AIC to select an adequately parsimonious model can be a problem when the number of parameters in the model under consideration is more than (roughly) 30% of the sample size. It could be argued that a good model selection criterion should work even if the researcher tries a 'bad' (e.g., over-parameterized) model. In order to overcome this deficiency, Hurvich and Tsai (1989) introduced a corrected version of AIC, which includes a correction for small sample sizes as: where, 'AIC' is the standard AIC, 'k' is the number of parameters in the model and 'n' is the number of observations (Hurvich and Tsai, 1989).

BIC and sBIC
It is also called Schwartz Information Criteria (SIC) or Schwarz BIC (sBIC). It is proposed by Raftery (1996) as a measure to assess the overall fit of a model to allow comparison of both nested and non-nested models (Raftery, 1996). BIC can be defined as: .ln BIC L M k n β = − + where, 'k' is the number of regressors and 'n' is the sample size. Sample size adjusted BIC (sBIC) can be computed as: ( ) 2 2ln 4.log 24 and which can be used as a better measure to evaluate the model fit between the models. Similar to AIC, smaller BIC value indicates better fit and the difference in the BICs from two models indicates which model is more likely to have generated the observed data.

Root Mean Square Error (RMSE)
The RMSE is frequently used as a standard statistical metric to measure model performance by calculating the differences between predicted and observed values. These individual differences are also called residuals and the RMSE serves to aggregate them into a single measure of predictive power. It can be defined as: where, y i and ŷ i are observed and predicted values respectively. i.e., RMSE is the square root of the mean squared errors which make an excellent general purpose error metric for numerical predictions.

Sensitivity, Specificity and Correct Percent Prediction (CPP/Count R 2 )
In order to check the model adequacy, various evaluation measures based on estimated probabilities such as accuracy, precision (positive predictive value), recall (sensitivity), specificity and negative predictive value can be used. In the aforementioned evaluation measures, an aspect of the model's effectiveness in assigning each observation to the correct categories is measured.

AUC
A Receiver Operating Characteristic (ROC) curve is a standard technique of indicating the performance of a classification model by plotting sensitivity against 1-specificity for many possible thresholds. The higher the Area Under the ROC Curve (AUC) indicates a better discrimination power of the model. AUC is also referred to as the Index of Accuracy or Concordance Index.

Residuals
There are many types of residuals such as ordinary residual, Pearson residual and Studentized residual. They all reflect the differences between fitted and observed values and are the basis of varieties of diagnostic methods (Zhang, 2016). The default residual for the generalized linear model is Pearson residual.

Live dataset
In order to compare logit and probit models, data from the archives of the Department of Biostatistics, NIMHANS, Bengaluru has been used. The study is a descriptive study which involved 175 adolescents pursuing their high school (8 th to 10 th standards) or pre-university courses (I and II PUC) from schools and colleges in Bengaluru. The sampling technique used was a two-stage sampling procedure. In the first stage, schools/colleges were randomly selected from the list of schools/colleges that gave permission for the study. From the selected schools and colleges, classes were chosen randomly from different sections available and the students were chosen based on the cluster sampling method. The whole class chosen randomly was included in the study and those students, who gave assent and obtained consent from parents, were finally form the sample. The inclusion criteria were: (1) students pursuing high school and preuniversity courses, (2) students having knowledge of English and (3) students belonging to Indian nationality. The questionnaires had been administered in groups. One whole class was taken at once for the assessment.

Description of Tools
Socio-demographic datasheet: It included information about the socio-demographic characteristics, such as age, gender, birth date, place of living, family type, siblings, education, stream (for PUC students) and help sought with any mental health professional.
Difficulties in Emotion Regulation Questionnaire (Gratz and Roemer, 2004): DERS is a brief 36 item questionnaire consists of 6 subscales. Two subscales viz.
-impulsivity and level of emotional awareness were used for the study.
Cognitive Emotional Regulation Questionnaire (CERQ): The CERQ is a 36-item questionnaire consisting of nine subscales (Garnefski et al., 2002). CERQ positive contains positive refocusing, planning and positive reappraisal and putting into perspective while CERQ negative contains self-blame, other blame, rumination and catastrophizing. CERQ has good factorial validity and high reliabilities can be used for both adolescents and young adults (anyone above the age of 12 years).
BAR Emotional Quotient Inventory Youth Version: The scale has been considered as one of the best measures of emotional competence among youth ( Bar-On and Parker, 2000). It provides a combined socialemotional and personality attribute assessment of selfreported emotion-related functioning. The five subscales are interpersonal, intrapersonal, stress management scale, adaptability scale and general mood scale.
Distress Tolerance Scale (DTS): This scale measures the participants' perceived ability to experience and endure negative emotional states. The scale consists of 4 subdomains namely, appraisal, tolerance, absorption and regulation (Sinha et al., 2007). Higher scores on DTS indicate greater positive affect, less affective distress and liability.
Interpersonal Reactivity Index: This is a 28 item, five-point Likert scale consisting of response categories ranging from "not at all like me" to "very much like me". This scale comprises of 4 subscales, namely, a) Empathic concern scale b) Perspective-taking scale c) personal distress and d) fantasy/imaginal involvement, are not of concern to the current study.
Self-Efficacy questionnaire -Children: This scale contains, 24 items, representative of 3 domains, namely: (a) Academic self-efficacy (b) Emotional self-efficacy and (c) Social self-efficacy (Muris, 2001). This is a 5 point Likert scale having scores ranging from "not at all" to "very well". This scale is considered as a psychometrically sound measure of self-efficacy among adolescents.
Scale for Assessing Academic Stress (SAAS): It is a 30 item self-report measure developed for grade 8 to 12 standard students of English medium schools with students belonging to middle and higher socioeconomic status (Sinha et al., 2007). It assesses five major indicators of academic stress (cognitive, affective, physical, social/interpersonal and motivation) which would finally yield a total stress score.

Statistical Analysis
The variable, self-efficacy score among the study subjects was normally distributed. It was categorized into a dichotomous variable by considering the probability density function of the variable as well as by seeking the expert's opinion. The stepwise logistic and probit regression analyses were conducted to predict the self-efficacy among 175 adolescents using the type of education (High school/PUC), appraisal, Distress Tolerance Scale (DTS) score, empathy concern, perspective-taking, school stress (in terms of cognitive, affective and social as well as motivation), CERQ positive and emotional intelligence as predictors which were significantly contributing to the prediction of selfefficacy in the univariate models.
The LR test and the pseudo-R 2 measures were also showed more preference to logit model than probit model in the current data set. The marginal effects were also computed and were almost similar in both the models indicating equal fit of both the models to the current data set.

Simulation Study
A computer simulation is an endeavour to establish a reality or theoretical circumstance on a computer which can be utilized to contemplate and perceive how the framework functions and verifies the models and assumptions in the data. By changing various parameters in the simulation, forecasts might be made about the characteristics of the framework by recapturing the parameters used to generate the dataset. Hence, a simulation study is designed to find if there exists any precedence or difference between logistic and probit regression models in fitting under certain situations. The objective of the study was to compare the performance of probit and logistic regression models under multivariate normality.

Methodology
The study was a simulation study in which the data were generated artificially under the assumption of multivariate normality. The models used in the study are Generalized Linear Models especially binary logistic and probit regression models. The efficiency of both the models was tested under various situations namely different correlation structures between the dependent variable and independent variables, varying sample sizes and different proportions/prevalences of the outcome. For that, artificial datasets of a dependent and three independent variables were randomly generated using a multivariate normal distribution with an appropriate mean vector, µ = (12, 10, 0, 0).
Various correlation structures of High (R 2 ols = 0.90), Moderate (R 2 ols = 0.40), Low (R 2 ols = 0.15) and No (R 2 ols = 0.01) were imposed in terms of different variancecovariance matrices assuming specific bivariate correlation between dependent and independent variables. The different variance-covariance matrices used to generate the specified correlation structures in the dataset were: 25 9 2.5 3.5 9 16 0 0 2.5 0 1 0 3.5 0 0 1 These matrices were determined in such a way that they were positive definite and correlations between independent variables were zeros in order to avoid the multicollinearity problem in the datasets. Covariance values between independent and dependent variables were appropriately chosen to produce the pre-specified correlation among them. The covariance between the variables would be identical to their correlations if the data is generated under the multivariate standard normal distribution (Cakmakyapan and Goktas, 2013;Alsoruji et al., 2018).
After the data generation, two of the independent variables were converted into categorical variables using an appropriate cut-off value (p = 0.4; i.e., 40% of the subjects those are having high values in the variable are classified into exposed group keeping in mind that the baseline distribution is normal). The dependent variable was then transformed into a dichotomous variable for 17 different proportions of the outcome variable, i.e., prevalences of 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85% and 90% assuming that the underlying distribution is normal. In order to examine the effect of sample size, the data were generated under 10 different sample sizes: 20,30,50,100,150,200,500,1000,2000 and 5000. Thus, a total of 680 different combinations were considered in the study and the summary of the data generation process is given in Table 2. The data generation process was repeated 1000 times for each of the combinations which resulted in a total of 6,80,000 datasets.
Probit and logistic regression analyses were performed and the parameter, standard error and probability estimates were obtained. In order to compare the models, different estimation and tests of significance procedures were done. The performances of the models were computed and compared using AIC, corrected AIC (AICc), BIC and sBIC statistics with the assistance of Likelihood Ratio (LR) test. The null and residual deviances and different pseudo R 2 measures ( 2 were computed and the models were compared. In addition to that, RMSEs in the estimation, as well as Correct Percentage Prediction (CPP) were also used to compare the performance of the regression methods. Student's t-tests were used to test whether there is any statistically significant difference between logit and probit models in terms of goodness-of-fit under various situations. Also, R 2 ols 's were calculated from linear regression models for the dependent and independent variables in order to cross-check whether the pre-specified correlation structure exists in the dataset.   , 150, 30, 35, 40, 45, Low 200, 500, 50, 55, 60, 65, No 1000, 2000 Simulation Study Results Figure 2 gives the AIC values of probit and logit models. The results of all the 680 combinations are available. In order to accommodate the information in the table, only 9 different prevalence measures (i.e., 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80% and 90%) out of 17 and 6 different sample sizes (i.e., 50, 100, 150, 200, 500 and 1000) out of 10 are displayed. Both probit and logit models showed higher AIC values when the prevalence is 0.5 and decreased towards both lower and higher prevalence. For a higher prevalence rate with smaller sample size, the logit model fitted better than the probit model even though the underlying distribution is normal. This scenario can be seen with all the correlation structures.
It can be seen from figure 2 that as the degree of correlation between DV and IVs increases, AIC values show a decreasing trend indicating better fit for models of having higher correlation levels. But, it is evident that both the models fitted equally well in all the situations. Figure 3 gives an idea about the performance of the model based on sample size in terms of AICc values. It is evident from the study results that both the models perform equally well with respect to varying sample sizes.   The estimates of the models i.e., intercept (B 0 ) and the regression coefficients (B 1 , B 2 and B 3 ) are compared in Tables 3 to 6 respectively. As the logit and probit models can't be directly comparable in terms of their original estimates, the marginal estimates of the coefficients are computed and are displayed in Tables  7 to 10. The study results from Tables 11 and 12 showed that for smaller sample sizes, prediction of outcome in terms of both sensitivity and specificity respectively is higher in logit model compared to probit, even though the underlying distribution is normal. Also, it is clear that the logit model fits better in most of the situations especially when the prevalence is closer to 50%. It is evident from the results in Table 13 that the logit model has higher overall prediction (in terms of accuracy) than probit model in most of the situations. Please see Tables 3 to  13

Conclusion
The logit and probit models were considered in the study under various conditions of varying correlation structures, different prevalences and sample sizes. These models were compared using indices of AIC/AICc, BIC, sensitivity, specificity, correct prediction percentage, RMSEs, null deviance, residual deviance, LR tests, etc. The indices AIC/AICc, BIC, LR test, pseudo R 2 s showed that both models fitted equally well in all the situations. Clinicians are more interested in the correct prediction of the outcome among the study participants than any of the model fit indices. From the point of view, based on sensitivity, specificity, correct prediction percentage as well as RMSEs, it can be concluded that logit model fits better than probit model in most of the situations even though the underlying distribution is normal.

Strengths and Limitations of the Study
Even though there are simulation studies available in the literature, this is the first study which has considered many combinations in the simulation in LRtest order to assess the efficiency of the models. 680 different combinations of simulation parameters in terms of correlation, sample size and prevalence of the outcome are considered in the present study. In comparison with other similar studies available in the literature, a live dataset also considered and the models are compared in the current study. Above all, in order to compare the performance of the models many possible measures are estimated compared to other studies in terms of fit indices, prediction, discrimination and model fitting.
The inferences made in this study are only based on the data in which the underlying distribution is normal. Further studies have to be conducted for getting concrete evidence about which model fits better under different situations such as various underlying skewed distributions namely lognormal distribution, exponential distribution, etc. and also by introducing different types and levels of skewness and kurtosis.

Financial Support and Sponsorship
This research received no specific grant from any funding agency.

Authors' Contributions
Amrutha Jose: Study conception and design, generation of simulated datasets, analysis and interpretation of data, writing and drafting the manuscript.
Mariyamma Philip: Study design, Generation of simulated datasets, analysis and interpretation of data, helped to evaluate and edit the manuscript, drafting of the manuscript and Critical revision.
Lavanya Tumkur Prasanna: Acquisition of Real life data, Helped to evaluate and edit the manuscript and Critical revision.
M. Manjula: Acquisition of real life data, Interpretation of data, helped to evaluate and edit the manuscript, drafting of the manuscript and critical revision.

Conflicts of Interest
There are no conflicts of interest.

Supplementary Material
Supplementary material includes Tables 3 to 21