Truncated Estimate in Log-Binomial Model: Algorithm and Simulation

Problem statement: Relative risk has concrete meanings of comparing t wo groups and measuring the association between exposures and out comes in medical and public health studies. Logbinomial model, using a log link function on binary outcomes, is straightforward to estimate risk rati os, whereas generates boundary problems. When the estim ates are located near the boundary of constrained parameter space, common approaches or p rocedures using software such as R or SAS fail to converge. Approach: In this study we proposed a truncated algorithm to estimate relative risk using the log-binomial model. We used simulation studies on both single and multiple covariates models to investigate its performance and compare with other similar methods. Results: Our algorithm was shown to outperform other methods regarding precisi on, especially in high dimensional predictor space. Conclusion: The truncated IWLS method solves the slow converge nce problem and provides valid estimates when previously proposed methods fa il.


INTRODUCTION
For datasets with binary responses, there has been tremendous research study done by statisticians on prediction and inference. So far, the logistic model has been among the most popular models and widely used in the fields of medical and public health studies due to its pleasant characteristics linking to estimation of the odds ratio (McCullagh and Nelder, 1999). Nevertheless, odds ratio may occasionally not be scientifically appropriate to measure the association between exposures and outcomes. Relative risk, which is a ratio of the probability of the event occurring in the exposed group versus a non-exposed group, has more concrete meanings in the sense of comparing two groups (Holford, 2002;Robbins et al., 2002;Spiegelman and Hertzmark, 2005). If we further explore the literatures, we found out that many studies used odds ratios obtained from logistic models to approximate the relative risks. Neglect of the fact that such approximation merely retains its validity under the rare disease assumption will produce largely biased estimates away from the null hypothesis and consequently lead to false positive conclusions in many scenarios (Zhang and Yu, 1998;McNutt et al., 2003;Chu and Cole, 2010). As an alternative, log-binomial models may be more favorable with direct connection of coefficient estimates to the risk ratio. However, the algorithm of log-binomial models sometimes fails to converge and produces an invalid Maximum Likelihood Estimate (MLE) (Baumgarten et al., 1989;Petersen and Deddens, 2010), attributable to the constrained space of the linear predictors using log link. The problem becomes severe when multiple covariates are considered (Lumley et al., 2006;Williamson, 2011). Many researchers have been working on the logbinomial model and offered several virtuous estimation methods. Lumley et al. (2006) had a comprehensive discussion on the issues of log-binomial model.
Among all, the methods in favor include: Maximum Likelihood Estimation; Nonlinear Least Squares; Scaling by the Average Prevalence; Duplication of Cases. Whereas some of these methods don't fit in the regression scenario. For instance, the Scaling by the average prevalence method requires that the model has the sole binary predictor, which renders the method limited in general. In contrast, MLE based algorithm is prevalent among the regression studies owing to the solid characteristics of inference. Major MLE based algorithms include: Searching the boundary; COPY method; Truncating fitted values (We also name it as "Truncated IWLS Method (T-IWLS)"); (Wacholder, 1986;Deddens and Petersen, 2003) The Searching the boundary algorithm may get to be intractable due to the difficulty in finding the boundaries in high dimensional parameter space. In this article, we focus on the discussions on COPY and Truncated IWLS methods. A modified algorithm of Truncated IWLS method is proposed.

MATERIALS AND METHODS
COPY method: Deddens and Petersen (2003) proposed a COPY method which could possibly obtain estimates close to the maximum likelihood estimator. The idea of COPY is to take C copies of the data and 1 copy with response Y set to 1-Y and fit the relative risk model to these modified data. They did not achieve clarity in several issues: (a) for the example they illustrated, they did not provide detailed proof showing the mechanism of why and how the method works; (b) further simulation results indicated that COPY method may not necessarily study well when C is very large and how to adjust for large C remains unsolved; (c) the performance of COPY method in models with multiple covariates is not yet promising according to our simulation results.

SAS solutions:
Here we use SAS 9.2 PROC GENMOD to explore the illustrative example of Deddens and Petersen (2003). We fit the data using 3 models: (a) log link with binomial distribution; (b) log link with Poisson distribution; (c) log link with no specific distribution.
From the table we observed that estimates from the log-binomial model failed. SAS 9.2 tried to fix the boundary problem by unspecification of the distribution. The estimates of-2.1538 and 0.2351 are not far from the exact MLE, although still outperformed by COPY Method. Moreover, estimates from the model with unspecified distribution lack interpretability in practical use.

Truncated IWLS method:
The log-binomial model belongs to the class of Generalized Linear Models (GLM). It is known that the Iterative Weighted Least Square method (IWLS) (Wolke and Schwetlick, 1988) is frequently used to estimate GLM parameters. We start with illustrating the algorithm of IWLS, particularly for log-binomial model and then introduce the proposed the method in subsequent context.

IWLS method:
Assume Y is Bernoulli distributed, we can write the p.d.f. as: where, p=E(Y). Set the linear predictor as: Log-link or n log(p)or p e η = = is adopted. Based on GLM theory on exponential distribution family, we can write: We can get the MLE of β 's using the following equation interatively until converge Eq. 1: However, the convergence problem occurs occasionally in estimation and leads to un-estimable model parameters using the above IWLS method. In SAS, we will receive warnings indicating the algorithm fails to converge.

Truncated IWLS method:
We lands on the method of truncating fitted values to get MLEs for log-binomial models to solve the boundary problem. The idea is very straightforward. p = E(Y) = e Xβ , which may exceed the parameter space [0, 1] in the iterative steps when values hitting the boundary, i.e., X β > 0. To solve the problem, we propose a constraint on Xβ in each iterative step. As Lumley et al. (2006) pointed out: An approximate MLE for the log-binomial model is to simply truncate the range of ρ (Wacholder, 1986). Usually we pick a threshold near 1, such as 0.999 and ρ is set to min (p, 0.999) after each iteration for the purpose of computing working residuals and working weights for the next iteration. In the i-th iterative step, we update β new using β dd when X β old <0, otherwise, β dd =T is used in (1). It is equivalent to construct a constrain: Here, 0 <T<1. It can be conjectured that T are preferred to be some large value close to 1 when the MLE is on the boundary of parameter space. The estimator improves the robustness although scarifying some degree of unbiasness in these extreme scenarios.
At the same time, we notice that iterations may be stuck within parameter areas which are impossible for convergence. To improve the efficiency of the algorithm, we add another step to assist iterative estimates in jumping out of such area. Meanwhile, this "jumping" step will help to solve the problem of instability and diverged estimates due to starting values. The idea is intuitive: if the steps fail to converge after a significant number of iterations, we replace ρ i by independent sample from the interior of the parameter space. For instance, we can sample ρ i ~ Uniform (0.1, 0. 9) .There is no definitive choice of the range for uniform distribution. The purpose is to start with new values when iterations rush to the wrong direction against convergence. This step makes this method favorable and greatly improves the truncated IWLS algorithm.
We can oversee its benefits when generalizing the application to multiple covariate cases. Consider the following simple example with 2 quantitative covariates.
For the log-binomial model, the parameter space is: Figure 1 illustrates the surface of the boundary which has consists of 4 segments. Using Truncated IWLS algorithm without the jumping step, we experienced cases when iterations kept bouncing back and forth in one particular corner. Adding this step avoids those redundant iterations.
Some issues still remain open to discussion: (1) the choice of T and its sensitivity; (2) the performance in multiple covariates model; (3) comparison with COPY and other methods. These issues will be addressed in the remaining context of this study. We begin with the illustrative example by Deddens and Petersen (2003) and explore the properties of the method using simulated datasets in single and multiple covariate models with dichotomous outcomes. The conclusion will be drawn based upon the comparison in the end.

Illustrative example by Deddens and Petersen (2003) revisited:
In this example, 10 pairs of observations are generated (Table 3). Y is the response variable with possible values 0 and 1. Variable X is the predictor in the range of (1, 10).
We use R to implement the algorithm of the proposed method. We compare it with Exact MLE, COPY method and log-binomial and log-poisson models from SAS PROC GENMOD (Table 1).      Table 4. The results confirm our conjecture that larger T could enhance the estimation. With large enough T, e.g., 0.999, 0.9999, the truncated method provides reasonable estimates compared with Exact MLE. The difference between the choice of T = 0.999 and 0.9999 is almost neglectable. It also outperforms the COPY method regarding precision. We tried different sets of the starting values; all of them converge to the estimates listed in Table 4. However, we cannot neglect the tendency of underestimating standard errors using truncated IWLS. Scale factors in GLM adjust the effect of over-dispersion. Originally, the scale factor for binomial model is 1. However, the distribution achieved by truncated algorithm is not exactly binomial since we add a constraint on the boundary. Thus, some correction factor is required. Observed from later simulation, the magnitude of the impact by truncation can partially measured by the prevalence at median values of covaraites. When the prevalence is low, the underestimation of the standard errors is trivial and almost ignorable. We will leave the discussion of the problem open to further investigation.

Single covariate model:
We used same simulation scheme as Deddens and Petersen (2003) to compare with the proposed method. Data were generated from the log-binomial model. Covariate X ranging from 0-10 is generated uniformly. Given the prevalence at X = 5 varying among {0.1, 0.3, 0.5, 0.7, 0.9), three values are selected for β 1 namely 0, medium and large. True values of β 1 and corresponding β 0 are listed in Table 5. Note that β 0 = log (prevalence) -5×β 1 . Each simulation involves 1,000 replications (same X's, different Y's). Table 6 displays the average slope and average estimated standard errors of slope from the combined method of PROC GENMOD and COPY method (details in Deddens and Petersen, 2003) and the proposed method using T = 0.999 and 0.9999.
Comparing to true population values (Table 5), the truncated IWLS method estimates are all as close as the combined method and indeed verify that truncated method is exact or at least asymptotic MLE. Using different values of T yield very similar precision and MSE (Table 7), indicating the truncation IWLS method estimates is very stable regarding the choice of T. This characteristic can make the method more attractive and realistic in application since T = 0.9999 is sufficient in dealing with various scenarios.
Multiple covariate model: Deddens and Petersen (2003) claimed that the simulations with 2 independent variables are consistent with the results for one dependent variable using COPY method.   However, COPY method also may fail to converge when the dimension of covariates goes higher. The parsimonious strategy is to make more copies, which heavily increase the burden of the computation. Our method is quietly efficient without much adjustment. One remaining concern is the choice of truncation threshold T. The following simulation is mainly designed to address the issue. Two sets of simulations are conducted. The first one is based on 2-covaraite model with covariates X 1 ranging from 0-10 and X 2 among [0,5]. Given the prevalence at (X 1 , X 2 ) = (5, 2.5), we select relatively bigger and smaller values of β 1 respectively. The specified values are listed in Table 8. We then compare the truncated IWLS with T = 0.999 and T = 0.9999 regarding the precision and MSE. Each simulation contains 1,000 replications (same covariates and different Y). Table 8 summarizes the population values of (β 0 , β 1 , β 2 ) in these 2-covariate models.
The results of simulations from 2-covaraite models are summarized in Table 10. The different truncation values yield very close results. Both obtain the estimates of parameters close to population values (Table 8)     Although the results are not listed in this article, results of simulations from 3-covariate models are consistent with those in 2-covaraites models. In fact, we do increase the number of covariates up to 8 in exploratory studies. The results support our conclusion that selection of T is trivial in truncated IWLS method once T is large enough. The suggested value in application is 0.9999.

DISCUSSION
In this study we discussed the importance of applying log-binomial models instead of logistic models to epidemiological studies when the rare disease assumption is invalid. In all, the truncated IWLS method study reasonably in both single and multiple covariates models. Our algorithm solves the slow convergence problem and provides valid estimates when previously proposed methods fail. Simulation results also show that the algorithm is not sensitive when the threshold in the truncation is selected close enough toward 1. Ad hoc methods such as multiple endpoints investigation could be used to obtain a working threshold, but more objective methods are in demand. Another potential research topic is to study the convergence performance in the truncated IWLS algorithm so that we can obtain further information on the convergence rate. Simulation studies to compare the convergence rate between the truncated IWLS algorithm and the algorithm of its corresponding untruncated model is also suggested.

CONCLUSION
The boundary problems in log-binomial models were solved by a newly developed truncated IWLS method. The proposed method outperformed the existing COPY method when multiple covariates co-exist in a log-binomial model and is therefore of practical value. Such models and algorithms can be widely used in highprevalence disease modeling, such as diabetes and cardiovascular diseases.