Measures of Explained Variation and the Base-Rate Problem for Logistic Regression

Problem statement: Logistic regression, perhaps the most frequently u sed regression model after the General Linear Model (GLM), is exte nsively used in the field of medical science to analyze prognostic factors in studies of dichotomou s tcomes. Unlike the GLM, many different proposals have been made to measure the explained v ariation in logistic regression analysis. One of th e limitations of these measures is their dependency o n the incidence of the event of interest in the population. This has clear disadvantage, especially when one seeks to compare the predictive ability o f a set of prognostic factors in two subgroups of a p opulation. Approach: The purpose of this article is to study the base-rate sensitivity of several R 2 measures that have been proposed for use in logisti c regression. We compared the base-rate sensitivity o f hirteen R type parametric and nonparametric statistics. Since a theoretical comparison was not possible, a simulation study was conducted for this purpose. We used results from an existing dataset t o simulate populations with different base-rates. Logistic models are generated using the covariate v alues from the dataset. Results: We found nonparametric R measures to be less sensitive to the base-rate as compared to their parametric counterpart. Logistic regression is a parametric to ol and use of the nonparametric R 2 may result inconsistent results. Among the parametric R 2 measures, the likelihood ratio R 2 appears to be least dependent on the base-rate and has relatively super ior interpretability as a measure of explained variation. Conclusion/Recommendations: Some potential measures of explained variation are identified which tolerate fluctuations in base-rate reasonably well and at the same time provide a goo d estimate of the explained variation on an underlyin g continuous variable. It would be, however, misleading to draw strong conclusions based only on the conclusions of this research only.


INTRODUCTION
The Search for R 2 analogs in logistic regression: Prediction of future outcomes based on a given set of covariates is a key component of regression analysis. In Ordinary Least Squares (OLS) regression analysis, the predictive accuracy of a linear model is often judged using the R 2 statistic. This statistic has several mathematically equivalent definitions and multiple interpretations such as the proportion of variation in the dependent variable explained by the regressors, a measure of the strength of relationship between the covariate(s) and the response and the squared correlation between the observed and the predicted response. This statistic is usually not used as a measure of goodness-of-fit as other tools are better suited to that purpose (Hosmer et al., 2011). When the outcome variable is dichotomous, logistic regression model is the most popular choice. In most instances, interest lies in determining how well the model predicts the probability of group membership with respect to the dependent variable. Unlike the OLS regression, more than a dozen of R 2 measures have been suggested for the logistic regression model (Mittlbock and Schemper, 1996;Menard, 2000;DeMaris, 2002;Liao and McGee, 2003). But the best form of R 2 is not clear yet. Mittlbock and Schemper (1996) reviewed 12 measures of explained variation for logistic regression, Menard (2000) six and DeMaris (2002) seven, with some overlap. Other authors have proposed adjusted R 2 analogs (see, Liao and McGee, 2003;Mittlbock and Schemper, 2002), for exmaple). Recommendations based on various researches were different as different criteria were used to evaluate the R 2 analogs.
Kvalseth's sixth criteria for a "good" R 2 statistic for the linear model (Kvalseth, 1985) requires an R 2 measure to be comparable across different models fitted to the same data. Menard (2000) extended this criteria requiring an R 2 measure to be comparable not only across different predictors but also across different dependent variables and different subsets of the dataset. With the help of an empirical example Menard (2000) demonstrated that R 2 measures in logistic regression are sensitive to the incidence of the event of interest in the population. Even if the coefficients associating particular variables to the outcome are the same in different populations, the value of the R 2 for populations with different incidence rates tend to be different. This phenomena is sometimes referred as the "base-rate" problem (Menard, 2000).
Having an R 2 measure that depends on the incidence of the response has disadvantage if one is seeking to compare the predictive ability of two different sets of prognostic factors or to compare the same set of factors in two subgroups of a population or in two different populations. If a R 2 measure depends on the underlying incidence of the disease under study then the R 2 values for these two cases could differ because of the difference in the underlying incidence and not because of different predictive abilities. This phenomena is illustrated with the help of an empirical study in the following subsection.
An empirical example: The data used in this example are a subset of the Framingham Heart Study with known values of the covariates (age, systolic blood pressure, serum cholesterol, current cigarette smoking status and diabetic status). A logistic model with the ten-year incidence of Coronary Heart Disease (CHD) was estimated and thirteen different R 2 measures were calculated. Table 1 presents estimated R 2 s for each of the thirteen R 2 measures. The measures are larger for the female group than the male group, with only a few exceptions. If we had performed OLS regression, we would claim that we are able to predict CHD better in women than in men. However, women developed CHD at only half the rate that men did and if our measures are affected by the underlying rate of disease, then it would be misleading to make such a claim.
For a more detailed examination of the effect of the base-rate on potential measures of explained variance, we conducted a simulation study.
The purpose of this article is to study the base-rate sensitivity of several R 2 measures in logistic regression. We use an actual dataset to simulate populations with different base-rate. Logistic models are generated using actually occurring covariate values. The organization of the study is as follows: We introduce the R 2 measures to be examined. Simulation methods and simulation results are discussed. Summary and concluding remarks are given.

R 2 -measures in logistic regression:
We present some of the R 2 measures which have been proposed in the literature to estimate explained variation in logistic regression. Consider n observations (y i , x i ) on a binary response variable y and a covariate vector x = (x 1 …x p ). The relationship between y and x is modeled by a logistic model Eq. 1: where β is a (p+1)-dimensional parameter vector. We denote the estimates from a logistic regression by For logistic model with binary y it can be shown that y = π , the mean of conditional probability of success for all possible combinations of the covariate values.
Ordinary Least Squares R 2 2 OLS (R ) : It is a natural extension of the coefficient of determination in OLS regression to the case of a binary y and is given by Eq. 2: is proposed as a measure of dispersion of a nominal random variable γ that assumes the integral values j, 1 ≤ j ≤ s, with probability π j (Haberman, 1982). If the outcome variable is binary, C(π) reduces to 2π (1-π), where π is the probability that Y=1. The Gini's Concentration R 2 is the given by Eq. 3:  Since the method of maximum likelihood is the primary method of parameter estimation in the logistic regression, it seems quite natural to extend this concept of explained variation to the logistic regression setting. Maddala (1983) and Magee (1990) proposed the following R 2 analog Eq. 5: , to produce Eq. 6: Contingency Coefficient R 2 ( 2 C R ): Aldrich and Nelson (1984) proposed an R 2 analog based on the model Chisquared statistics G M = -2log (L 0 /L M ). It is a variant of the contingency coefficient and is given by Eq. 7: 2 C R has the same mathematical form of the squared contingency coefficient and as such cannot equal one, even for a model that fits the data perfectly, because of the addition of the sample size in the denominator. Because of this limitation, Hagle and Mitchell (1992) proposed to adjust 2 C R by its maximum to produce Eq. 8: is the sample proportion of cases for which y = 1.
Squared Pearson correlation ( 2 p R ): In linear regression R 2 is mathematically equivalent to the squared correlation between y and y ⌢ , its sample fitted value according to the model. The same idea is extended to the case of logistic regression and the R 2 analog is obtained by squaring the correlation coefficient between y and π ⌢ as Eq. 9 (Maddala, 1983): simply the Pearson's product moment correlation between ranks of y and π ⌢ . If we denote the rank of z by R(z) and mean of the ranks of both variables by R (n 1) / 2 ≡ + then Spearman's ρ is given by Eq. 10: Spearman's Rho is very close to Pearson's product moment correlation in normally distributed samples. For notational consistency, we will use 2 S R to denote squared r S hereafter. Squared Kendall's τ τ τ τs ( 2 a τ and 2 b τ ): Kendall (1990) suggested three possible coefficients, which he designated as τ a , τ b and τ c . Only the first two of these coefficients are considered for our simulation study. Kendall's 2 a τ and 2 b τ are defined respectively as Eq. 11 and 12: And: where, sign(z) is defined as sign (z) =

Squared Somers'd:
Under the hypothesis that y causes or predicts π ⌢ , Somers (1962) proposed to use y d π ⌢ and for the hypothesis that π ⌢ causes of predicts y, the proposed coefficient is y d π ⌢ . The coefficients are defined respectively as Eq. 13 and 14: with sign (z) defined as above. Somers' d's penalize for pairs tied on y only, in directional (asymmetric) hypotheses in which y causes or predicts π ⌢ ; and to penalize for pairs tied on π ⌢ only, in hypotheses in which π ⌢ causes or predicts y. Kendall's τ b is the geometric average of both asymmetric Somers' d, i.e., b y y d d π π τ = ⌢ ⌢ . Because of this relationship, which is the same as the relationship between the classical regression coefficients and the product moment correlation (r 2 = b xy b yx ), it is often viewed as an analog of a regression rather than a correlation coefficient. For notational consistency, we will use 2 D R to denote squared y d π , hereafter.

Area Under ROC Curve (AUC):
Suppose that the population under study can be divided into two subpopulations based on the status of the outcome variable Y: D (diseased) if Y = 1 and D (not diseased) if Y = 0. Let F 1 (.) and F 0 (.) be the CDFs of π(x), the conditional probability of the outcome of interest, in D and D , respectively. Let c ∈ ℝ be such that: For a given value of c, the sensitivity and specificity of a classification model are defined as sensitivity = Pr (π(x)≥c| Y=1)=1-F 1 (c) and specificity = Pr (π(x) < c |Y = 0 = F 0 (c) respectively. The ROC curve is then obtained by plotting 1-F 1 (c) against 1-F 0 (c) for all possible values of c. The area under the ROC curve is then given by Eq. 15: where, π 1 (x) denotes conditional probability of disease in the diseased. The last equality follows because of the independence of the conditional probabilities in the two groups. Thus AUC represents the probability that a randomly chosen diseased subject is correctly rated or ranked with greater suspicion than a randomly chosen non-diseased subject.

MATERIALS AND METHODS
Simulation study: Consider a response variable Y and a covairate vector X = (X 1, X 2 ,… X p )'. Let us further consider m different populations or m subsets of the same population and assume that each of the covariates X 1, X 2 ,…X p has the same effect on the outcome variable Y in all populations (i.e. fixed effect across the populations) but each population has different proportions of successes (Y = 1). Using the logistic model, the odds of success in r th (r =1, 2,…, m) population is given by Eq. 16: The odds ratio of j th population relative to the k th population is then given by Eq. 17: This gives ( j) (k ) 0 0 log(t) β = β + . It follows that for a given t>1, ( j) . Where π (r) = p r (r) (Y = 1) is the base-rate in the r th population. Therefore, by fixing the odds ratio to some constant t > 0, it is possible to find a * 0 β which can be used to generate new Y* with odds of success t times the odds success in the original data. To design our simulation study, we elected to take advantage of naturally occurring covariate values by employing existing dataset to generate true logistic regression models. The data was a subset of the Framingham Heart Study data and consisted of 4,123 Men and women examined at a baseline examination and followed for 10 years. During the next 10 years, 370 (about 9%) developed Coronary Heart Disease (CHD). Males were twice as likely to develop CHD as females (6.0% for females, 12.7% for males). We simulated the logistic models as below.
(iii) Generate y* such that: (V) Repeat steps ii-iv for t = 3, 4,…, k. We used k =14 in our simulation. This yielded datasets with base rates ranging from 8.6-49.6%. (Vi) Repeat steps ii-v 10,000 times for each of the sample sizes 500, 1000, 2000 and 4000. However, sample size did not affect the average value of any of the R 2 measures. Therefore, we present only the results for the sample size 4,000.

Simulation results:
Intercorrelations of different R 2 measures and their correlations with the base-rate are presented in Table 2. Squared correlation of the R 2 measures with the base-rate are presented in the last row of the same table. Only two of the 13 R 2 measures, AUC and 2 D R , have very low (0.011) squared correlations with the base-rate. 2 L R has some advantage over the other R 2 measures in the sense of having a low squared correlation with base-rate, but it is still substantial.
Means of the parametric and nonparametric measures are plotted against the base-rate in Fig. 1 and 2, respectively. All the 9 parametric measures exhibit a monotonically increasing tendency with the base-rate (Fig. 1). 2 CS R is uniformly dominant over all other parametric measures, followed by 2 N R , across the levels of π. For small π (less than 0.2), 2 L R appearers to be the third largest measure, but as π approached to 0.5 other measures come to the fore forcing 2 L R to be the smallest R 2 measure for π >0.25. The remaining six parametric measures have almost identical means across the levels of π .
Among the nonparametric measures, the AUC statistic consistently resulted in very large mean values irrespective of the base-rate followed by the 2 D R (Fig. 2). These two measures, unlike the rest of the nonparametric R 2 measures, exhibit a negative correlation with π , which appears to be arising from the decreasing values of AUC for very low values of base-rate, particularly in the range 0%-2%. Otherwise, and 2 D R appear to be mostly invariant with respect to the base-rate. All of these measure had very small standard deviations. We did not find any noticeable difference in their standard deviations (Table 3).
We evaluated the base-rate sensitivity of R 2 measures by examining the rate of change in their means associated with the small changes in the baserate in the neighborhood of a given level of π . In doing so, we numerically computed derivatives of the R 2 measures with respect to the base-rate using the "dydx" function available in stata @ 9.1 software (Stata Base Reference Manual, 2005). We did not consider the sign of the derivatives as we were particularly interested in the magnitude, rather than the direction of base-rate sensitivity of these R 2 measures. The results are presented in Fig. 3 for the parametric and in Fig. 4 for the nonparametric R 2 measures. The marked points represent absolute values of the numeric derivatives of the R 2 measures evaluated at each level of π employed.   The large fluctuation observed at the higher end of π is attributed mainly to the error in estimating the derivatives at the end points. It is evident from Fig. 3 that 2 L R has a clear advantage over the rest of the parametric measures in the sense of having relatively small base-rate sensitivity. Like other parametric measures, it exhibits a steady decrease in base-rate sensitivity with increasing π , but with a considerably slower rate as compared to the other parametric measures. The rest of the measures are in fairly good agreement with each other in terms of their sensitivity to the base-rate, at all levels of π employed. They exhibit very high levels of base-rate sensitive at small values of π (<0.25). With increasing π , their base-rate sensitivity rapidly decreases resulting quite low base-rate sensitivity when π is close to 0.5.
Among the nonparametric measures, the AUC and the 2 D R appear to be the least base-rate sensitive at all levels of π (Fig. 4). In addition, these two measures demonstrate almost no fluctuation (except the fluctuation observed at the higher end of π , which, as mentioned earlier, is primarily due to the estimation error) in their base-rate sensitivity for π >0.2. 2 a τ , unlike other R 2 measures, exhibits a convex relationship with π . Its base-rate sensitivity remains in between that of 2 b τ , the second worst measure in terms of the baserate sensitivity, and 2 D R .

DISCUSSION
Summary and concluding remarks: The very existence of a plethora of R 2 measures for logistic regression sometime creates confusion about which measure to use in conjunction with a logistic regression analysis. Researchers have suggested various criteria for making judgment on these measures (for example see (Mittlbock and Schemper, 1996;Kvalseth, 1985;Sharma, 2006). Although the base-rate sensitivity of these R 2 measures has been documented (Menard, 2000;Gordon et al., 1979;Ash and Shwartz, 1999), the issue of whether this relationship to π is always a weakness of the R 2 measures is debated. Ash and Shwartz (1999) used a simple parametric model, applicable to a very specific situation, to clarify the effect of base-rate on 2 OLS R and argued that it was in fact a strength rather than the weakness of 2 OLS R . Because in real-world situations the value of a diagnostic test does, in fact, depend on the prevalence of the problem in the population being tested (Ash and Shwartz, 1999;Hilden, 1991). This idea was further augmented by Mittlbock and Schemper (2002). They argued that if the base-rate is either close to 0 or 1, then the outcome is already pretty much determined and there is not much uncertainty left to be explained. However, on the other hand if the base-rate is large (i.e. if π is close to 0.5) the total variability in the dependent variable is high and the covariates may explain more of the uncertainty.
However, having an R 2 measure that depends on the incidence of the response has clear practical disadvantages if one is seeking to compare the predictive ability of a set of predictors in two subgroups of a population. As illustrated in the empirical example presented, the analysis could lead to misleading conclusions. If a model, based on a particular R 2 value, shows better predictability in one population than the other, it may be simply because of the difference in the underlying incidence rate and not because of different predictive abilities of the set of predictors used. In this study we have examined the base-rate sensitivity of thirteen R 2 type measures that are reported to have potential to be used as measures of explained variation in logistic regression analysis. Eight of these measures are parametric and the rest are nonparametric in nature. All of the R 2 measures are sensitive to the fluctuations in the base-rate. The magnitude of the base-rate sensitivity varies greatly from one measure to another. Results show that nonparametric measures tend to be less base-rate sensitive than the parametric measures. Four of these, τ a , τ b , 2 D R and 2 S R , are measures of ordinal association. Use of measures of ordinal association with a logistic regression model may result inconsistent behavior. For example, if a weak continuous covariate is added to a model with a strong binary covariate, the proportion of a explained variance, as measured by a parametric R 2 , will increase slightly. But as a consequence of adding a continuous covariate in the model, ranks that were tied in the single covariate model are forced to slightly different values of the predictor. This may result a noticeable decrease in the proportion of explained variance, as measured by squared rank correlation, for example.
Among the parametric measures, 2 L R is the most base-rate invariant. In addition, its base-rate sensitivity fluctuates less as compared to other parametric measures, across the levels of π . The closest competitors are the 2 N R and the 2 CS R . The observed difference between the base-rate sensitivity of these measures and that of the 2 L R is only marginal.

CONCLUSION
Use of R 2 in logistic regression has become a standard practice and many researchers have recommended it: Stata @ reports 2 L R as the part of its logistic regression analysis; Menard (2000) also preferred 2 L R over other R 2 measures because of its interpretability and independence from the base-rate; and Liao and McGee (2003) recommended routine use of 2 L R for logistic regression analysis. In spite of its interpretability and relatively superior ability to withstand fluctuations in base-rate, it is often criticized as having small values (Hosmer et al., 2011). If we consider y to be a binary proxy for a latent continuous variable y* that follows a multiple linear regression model, then the R 2 analogs can be viewed as the estimates of the ρ 2 , the R 2 of the latent scale y*. Sharma and McGee (2008) found 2 CS R to be numerically most consistent with the underlying ρ 2 with 2 N R its nearest competitor. 2 CS R is based on the model chi-squared statistics and therefore has the advantages of being based on the quantity the model tries to maximize. Therefore, these two measures deserve serious consideration, especially when it is reasonable to believe that a underlying latent variable exists. They provide valuable information that 2 L R fails to provide, regarding the strength of relationship between the covariates and the underlying latent variable.
There are other potential factors whose effect on the base-rate sensitivity of R 2 measure is not studied in the current research. It would be dangerous to draw strong conclusions based only on the conclusions of this research. Some potential measures of explained variation are identified which tolerate fluctuations in base-rate reasonably well and at the same time provide a good estimate of the explained variation on underlying continuous variable.

ACKNOWLEDGEMENT
The data from the Framingham Heart Study was obtained from the National Heart, Lung and Blood Institute. The view expressed in this articals are those of the authors and do not necessarily reflect those of this agency.