Parameter Estimation and Determination of Sample Size in Logistic Regression

The determination of Sample-Size is often an important step in planning a statistical study and it is usually a difficult task. Among the important hurdles to be surpassed, one must obtain an estimate of one or more error variances and specify an effective sample size of importance. The study was carried out to check for the estimation of parameter and sample sizes in logistic regression, although, there is the temptation to take some shortcuts. We to looked at two methods of obtaining sample sizes having obtain the parameter estimates by varying the response probability. The results from the real life data showed that when the response probabilities are small, an approximation of corrected term Equation 12 performs better than the approximation Equation 8, but it highly over estimates when the response probabilities are large.


INTRODUCTION
Logistic regression is a type of regression used when the dependent variables are categorical Adeleke and Adepoju (2010). The dependent variable may have two categories (e.g., alive/dead; male/female; Republican/Democrat) or more than two categories. If it has more than two categories they may be ordered or unordered. However, a lot of statistics is concerned with predicting the value of a continuous variable like Blood pressure, intelligence, oxygen levels, wealth and so on. But this kind of statistics dominates when your response variable is binary. It is highly robust and the independent variables do not have to be normally distributed, or have equal variance in each group. Logistic regression is useful in some situations when assumptions of linear regression fail. It requires a different type of data and its coefficient have different interpretations. Like linear regression, logistic regression allows results to be graphed with regression lines and prediction to be made given a set of conditions. In this study, our interest focuses on the determination and parameter estimation of sample size using logistic regression analysis. Literature reviews have shown many studies aimed at determining whether a particular variable has an effect on a binary response. Agresti (2007) argued that the study design should determine the sample size needed to provide a good chance of detecting an effect of a given size. He used simple logistic regression as a case study. His study did not provide much result for the multiple logistic regressions. This study therefore considers a thorough analysis on the multiple cases that enhances better approach to sample size determination. We begin by given background information on the related terms like power analysis.
Power analysis can optimize the resource usage and design of a study, improving chances of conclusive results with maximum efficiency. Power analysis is the most effective when performed at the study planning stage and as such it encourages early collaboration between researcher and statistician. Muller and Benignus (1992); O'Brien and Muller (1993) and Russell (2001), provide cogent discussions of these and related concepts.
Power analysis is often problematic in practice, being performed infrequently or improperly. There are several reasons for this: it is technically complicated, usually under-represented in statistical curricula and often not perform early enough to be effective. Good software tools for power analysis can alleviate these difficulties and help you to benefit from these techniques.

JMSS
We propose to develop sample size calculation methods within the proportional odds model structure. Such a sample size is needed to construct a test of hypothesis in Ordinal Logistic Regression (OLR) having desired power. The use of logistic regression has widely been accepted in scientific fields (biostatistics, epidemiology, engineering). This is because it is a simple and effective method to describe the effect of some explanatory variables on a categorical response variable.
Studies on parameter estimation in logistic regression revealed that the power and sample size estimation of different statistical approach within logistic regression model. Whittemore (1989) considered a test for a single parameter with other parameters treated as nuisance parameters. Much literature exists on approximations to the power and sample size of different statistical tests within logistic regression model (Mehta and Tsiatis, 1984;Hilton and Mehta 1993;Lui, 1993). Whittemore (1989) considered sample size approximations in the case of standard logistic regression with small response probability. At present, sample size issues in ordinal logistic regression setting do not appear to have been studied in depth in the literature. Sample size determination in multilevel designs requires attention to the fact that statistical power depends on the total sample sizes for each level. It is usually desirable to have as many units as possible at the top level of the multilevel hierarchy (Snijders, 2005). Russell (2001) offers some suggestions for successful and meaningful sample-size determination and also discussed is the possibility that sample size may not be the main issue; that the real goal is to design a highquality study. Lin et al. (2010) discussed some crucial issues in the problem formulation, parameter specifications and approaches that are commonly proposed for sample size estimation in microarray experiments. Roy et al. (2007) consider the problem of sample size determination for three-level mixed-effects linear regression models for the analysis of clustered longitudinal data. Three-level designs are used in many areas, but in particular, multicenter randomized longitudinal clinical trials in medical or health-related research.
Power analysis most effective when performed at the study planning stage and as such it encourages early collaboration between researcher and statistician. It also focuses attention on effect sizes and variability in the underlying scientific process, concept that both researcher and statistician should consider carefully at this stage. Muller and Benignus (1992) and O'Brien and Muller (1993) provide cogent discussions of these and related concepts. These references also provide a good general introduction to power analysis. Our focuses in this study is therefore to fit a suitable model and check the reliability of the model using logistic regression and to suggest sample size and power calculation methods for ordinal logistic regression to test statistical hypothesis.

MATERIALS AND METHODS
We deal with studies in which a random samples is drawn from the joint distribution of (Y, X) where Y is an ordinal response and X= (x 1 , x 2 , x 3 ,…, x p ) is a vector of covariates Equation 1: to the predictor X ' Since our response categories have a natural ordering, we use the proportional odds model that is Equation 2: where, a is a vector of the intercept parameters and γ' = (γ 1 , γ 2 ,…, γ p ) is the slope parameter vector without intercept term. If a j < a j+1 holds this model fits a common slope cumulative model based on cumulative probabilities of the response categories Equation 3 and 4: The OLR follows that Equation 5  . .
This model is known as Proportional odds Model because, the odds ratio of the event (Y≤ j) is independent of category indication.

Maximum Likelihood Function
When more observation on Y occurs at a fixed X t value, it is sufficient to record the number of observations t j n and the number of j outcome, for j = 1,…, k.
Thus we let Y t , t=1,…, n, be an independent multinomial random (response) variable, then Y t is ~ multinomial 1 2 t t t t k 1 2 k S n S n n : : : S n n ... n 1 Since we are dealing with cumulative probabilities, in term of the parameters of the cumulative transformations, the likelihood can be written as the product of k-1 quantities. The joint probability mass function of (Y 1 ,…Y n ) is proportional to the product of multinomial functions.
For a given sample size n, the likelihood of the observations y t , x t , t = 1,2,…, n is: does note depend on unknown parameters ( ) a ', ' γ .
The validity of this model shows that MLE

Sample Size Estimation
One of the main objectives of this write up is estimation of sample size and this is achieved by obtain a sample size that is just sufficiently large enough to be confidence of being able to achieve an inference with required precision. It is directly related to the cost and time involved in a survey or data collection.
Let us test the null hypothesis: 0 a H : 0 Vs H : γ = γ = γ % At a level with power ≥1-β when the distribution of γ is treated as normal with mean γ and variance σ 2 , the critical region is: where, Z a is 100(1-a)% of the standard normal distribution. The sample size n will be found so that the test has a specified power (1-β) at the alternative a H : γ = γ % , the sample size n is thus chosen so that: For both cases 0 and 0 γ > γ < % % For model of the form in Equation (2), (5) and (6) with one predictor i.e.: 1 logit ( ) a X π = + γ Hsieh (1989) uses an approximate sample size formula to obtain the sample size needed for testing H 0 : γ = 0. Here we need to guess the probability of success π at the mean of x. the size of this effect is the odds ratio θ comparing π to the probability of success one standard deviation above the mean of x. Let k = log(θ) An approximate sample size is Equation 9: 2 2 k 4 2 a 2 2 n Z Z e (1 2 ) / ( k ) In the case of Proportional Odds Model (POM), estimation of sample size with general response probabilities where we have more than two categories which can either small or large, then: where, e aj is small when a j + γ'X'≤0 (or e -aj is small when a j +γ'X'≥0).
We now prove for response variable with three categories with ordered probabilities. i.e.: Equation 11 discovered method of obtaining σ a used in (3.0) which is Var ( ') γ obtained in Equation 11 above. Hence, we can generalize it to multiple parameters, where we test the hypothesis of:

RESULTS
We illustrate by using the data on diabetics patients from a University College Hospital Ibadan. The data covers 10678 reported cases of patients with diabetes. Estimation of sample size using the method proposed by Hsieh (1989) in Equation 9. Assume 0.817013 π = if we go by the hypothesis that H 0 : γ 1 = 0 against the alternative H 0 : γ 1 ≠ 0 from Table  1a  If we now consider the effects of smoking and drinking of alcohol on induced diabetes patients i.e., 2 predictors, then the above can be seen in output of Table  1b above where both the coefficients having negative effect on induced diabetics patients. Although the Loglikelihood ratio for model selection support the full model of two (full model) predictors with 2.54 > 0.2803 value of chi 2 with 1 df.

Table 1a. Logit Diabetes versus Smoking
Since the pseudo R 2 is 0.0002 which implies that there is hardly multiple correlation between the predictors and th response variable, the odds ratio in Table ( 1a and 1b) shows that for a smoker, there is approximate value of 6% less times of having diabetis when compared with those who are not smoking, given that all other variable remain constant. The odds of having diabetes for an individual addicted to alcohol is just 0.6% less times those who are not drinking alcohol. Although this results look somehow, but the p-values for smokers (0.113) and individual adicted to alcohol (0.872) are not significant meaning that both factors considered are not really contributing to diabetes problem. These surport the result obtained in R 2.
we compute: where, n 1 is the n obtained when we have one predictor Hence: 2 n 56,946 ≅ Therefore, we require almost 57000 samples for testing H 0 : γ 1 = 0.
Using the above information we have the following result from our simulation of sample size for both Equation 8 and 12 respectively. Monte Carlo method for selected values of α = 0.05, β = 0.1 and e α1 = 0.05 as well as the value of γ>0 & k = 2 when the explanatory variable has the standard normal distribution. The results in Table  1 below show us that the approximation (3.0) is suitable when the response probabilities are small but it always under estimates.

DISCUSSION
Acording to the results of this study, the estimates of the parameters and sample sizes are obtained from both real life data of diabetes and simulation study, Table 1 and 2. Sample size obtained when the predictor is one is approximately the same when the the predictors are two using a real life data. The approximation with corrected term (3.4) performs better than the approximation (3.0) when the response probabilities are small, but it highly over estimates when the response probabilities are large. Also, the graphical representation of the sample sizes for the simulation is given in Fig. 1-3. Since the sample sizes depend on the two parameters, γ and α 1 , simultaneously, we fixed one parameter to obtain the other. If we change the two parameters simultaneously, the estimated sample sizes fluctuated too much.   Table 2. (Estimates of sample sizes for both equations (3.0) and (3.4)) (k = 2, e α1 = 0.05) (k = 2, e α1 = 0.25) (k = 2, e α1 = 0.5)

CONCLUSION
This study has developed a methodological framework to estimate the parameters of logistic regression and obtain sample sizes at different level of a and β. We have also proposed sample size calculation methods for logistic regression to tests for statistical hypotheses. We have also considered testing the multiple Science Publications JMSS parameters. We gave a simple closed-form formula for approximated sample sizes when the probabilities of the response categories are small. The results showed that an approximation of corrected term Equation 12 performs better than the approximation Equation 8 when the response probabilities are small, but it highly over estimates when the response probabilities are large.