Interface between the Ratio β with Area Under the ROC Curve and Kullback-Leibler Divergence Under the Combination of Half Normal and Rayleigh Distributions

Email: rvvcrr@gmail.com Abstract: Classifying objects/individuals is common problem of interest. Receiver Operating Characteristic (ROC) curve is one such tool which helps in classifying the objects/individuals into one of the two known groups or populations. The present work focuses on proposing a Hybrid version of the ROC model. Usually the test scores of the two populations namely normal and abnormal tend to follow some particular distribution, here in this study it is considered that the test scores of normal follow Half Normal and abnormal follow Rayleigh distributions respectively. The characteristics of the proposed ROC model along with measures such as AUC and KLD are derived and demonstrated using a real data set and simulation data sets.


Introduction
Receiver Operating Characteristic (ROC) curve is a classification tool which is widely used in classification to evaluate the accuracy of a test. ROC Curve is a graphical plot between false positive rate and true positive rate (Green and Swets, 1966). This tool helps in classifying the individuals/subjects into one of the two groups, normal and abnormal, by using a threshold. Usually, the test Score (S) obtained from the set of individuals will be of continuous type and underlies a certain distribution. In ROC literature, so far many models are proposed based upon bi-distributional assumptions, such as Bi-normal (Egan, 1975), Bilognoraml (Dorfman and Alf, 1968;1969), bi-gamma (Hussain, 2012) and many more. Recently, a new type of ROC Curve is developed based upon mixture of two distributions namely Half Normal and exponential distributions and referred as Hybrid ROC Curve (Balaswamy, et al., 2015).
In the present work, it is assumed that the test scores of normal and abnormal populations follow Half Normal (HN) and Rayleigh (RL) distributions. The motivation of considering Rayleigh instead of Exponential is its mathematical approximations, where it can easily be derived from all other distributions and also has a mathematical ease over other distributions. Moreover, the concept of ROC evolved by analyzing radar signals (Signal Detection Theory) and the important application of Rayleigh distribution is to analyze and assess the signals which are received by and from the receivers. Hence, the Rayleigh distribution is an apt one to identify the scatter in the abnormal population and helps in identifying the exact status of the objects/individuals. For further information on the applicability and ease of Rayleigh over Exponential distributions refer to Meintanis and Iliopoulos (2003).
In the next subsequent sections, the expressions for the intrinsic measures such as Sensitivity (S n ), Specificity (S p ), accuracy measure AUC and the divergence measure KLD expressions are derived. The proposed methodology is supported using a real data set (APACHE II) and simulation studies with various combinations of scale parameters for different sample sizes.

Methodology
Let the test scores be X and Y from normal (H) and abnormal (D) populations which follow Half Normal and Rayleigh distributions respectively. The cumulative distribution function and probability density functions of Half Normal is: where, σ is the scale parameter. The cumulative distribution function and probability density functions of Rayleigh is: where, σ is the scale parameter. In classification, the ROC Curve is a graphical plot, which explains the performance of a binary classifier as its discrimination threshold is varied. The curve is generated by plotting the False Positive Rate (FPR) against the True Positive Rate (TPR).
The expression for FPR (derived from HN Distribution) is defined as: On further simplification, the threshold value t can be obtained by the formula: where, Φ -1 (⋅) is the inverse cumulative standard normal distribution function. Similarly TPR (derived from Rayleigh Distribution) is defined as: Here H D σ β σ = and the above equation is the expression for ROC Curve based on HN and R Distributions. It is known that the ROC Curve is generated through the coordinates (1-S p , S n ), here the false positive rate (1-S p ) expression is derived from HN and true positive rate (S n ) is from R, hence the proposed ROC model (7) is referred as Hybrid ROC (HROC) Curve.
The proposed ROC Curve is completely dependent on the ratio β and as this ratio β varies accordingly shape of the ROC Curve varies. The three typical forms of ROC Curves are shown in Fig. 1, which are drawn at different values of β. The first case is the one, which stands as an example for better case of proposed ROC Curve (Dashed line) with β = 0.2370, the second one stands as an example for moderate case of ROC Curve (Dotted line) with β = 1.0122 and the third case is referred as a worst case of ROC Curve (Dash-Dotted line) with β = 1.6833. From Fig.  1, it is illustrated that the proposed ROC Curve gets influenced by the β and on conducting various experiments it is observed that the limit values of β varies from 0.1 to 1.8472, i.e., β = [0.1, 1.8472]. This refers to a clear meaning that as β tends to attain a lower value, the ROC Curve shifts towards the top left corner, otherwise.
The expression for the accuracy measure AUC can be obtained by integrating the ROC expression (7) over [0, 1] as: On further simplification, the closed form for AUC is as follows (for proof see Appendix): In conventional ROC methodology, a test or procedure will be considered if its AUC value lies above the chance line, i.e., AUC ∈ [0.5, 1]. As AUC attains a larger value better will be the discriminating power of the test considered.
In the next subsection, the scale parameters of both populations are estimated using the method of maximum likelihood estimated.

Estimation of Parameters
Here, X and Y are independent random variables from HN and RL distributions with scale parameters σ H and σ D respectively. Then the likelihood and log-likelihood functions based on distributions considered are given to obtain the estimates of the parameters σ H and σ D .
The likelihood and log-likelihood functions of HN distribution is: In the next subsection, the well known proximity measure namely Kullback-Leibler Divergence (KLD) is used measure the distance between both populations in the context of classification.

The Proximity Measure KLD in Classification
In information theory, the Kullback-Leibler Divergence (KLD) is used to measure the proximity between two density functions and is usually defined on likelihood ratio (Cover and Thomas, 1991). The KLD considered in the present work is to measure the distance between two density functions, which are used to construct the ROC Curve. Since the misclassification rate is completely dependent on the overlapping area of both populations and this varies, as the distance between both populations vary. Further, the divergence measure KLD is defined based on two probability density functions (Kullback and Leibler, 1951): This KLD measure has become popular in classification theory to explain the extent of accuracy in terms of closeness between two densities and the asymptotic properties of ROC Curve. Recently, the divergence measure KLD is used in Bi-normal ROC Curve, Bi-exponential ROC model and Bi-gamma ROC model to explain the symmetric and asymmetric properties (Hughes and Bhattacharya, 2013). Further, (Balaswamy, et al., 2014) considered the concept of KLD in classification to study and interpret the Single Truncated ROC Curve as well as used to identify the closeness between both distributions of normal and abnormal populations. In the same paper, they have also provided the functional relationship between slope of STROC Curve and KLD and then explained the asymmetric properties of STROC Curve.
As the importance of β (slope) explained in Fig. 1, the slope can also be defined using the ratio of two probability density functions: As the parameters of f(t) and g(t) varies, accordingly the shape of ROC Curve varies. The slope of proposed ROC Curve using the Equation 2 and 4 is: Further, the Kullback Leibler Divergence (KLD) can be defined in the context of ROC Curve analysis as follows (Balaswamy et al., 2014): Therefore, by taking expectation on both sides for Equation 10, we have: On further simplification, the expression for KLD[g||f] is: In similar way, the expression of KLD[f||g] can be defined as: Therefore, to derive the KLD[f||g], the expression for inverse slope of proposed ROC Curve is: In the next section, the simulation studies are conducted to explain the proposed methodology, further the results and discussions are made.

Results and Discussion
Simulation studies and a real data set (APACHE II) are used to demonstrate the proposed HROC Curve and its behavior.

Simulation Studies
The proposed methodology is demonstrated using simulation studies with various combinations of scale parameters of both populations for different sample sizes. Further, the computations are based on sample estimates which are obtained from each simulation data set using maximum likelihood estimation. The simulation studies are conducted in four different experiments. In the first experiment, three combinations of scale parameters are considered by varying the scale of normal population σ H = {0.2, 0.3, 0.4} with fixed variability in abnormal population σ D = 0.8. The second experiment is conducted with four different combinations of scale parameters by fixing the variability in normal population σ H = 0.5 and varying scale in abnormal population σ D = {1.2, 1.5, 2, 2.5}. Further, the third experiment is conducted with high variability σ H = 2 in normal population than the abnormal population σ D = {1.5, 1.2, 1.15} to explain the proposed ROC Curve. Finally, the equal variability is considered in both populations with σ H = 1 and σ D = 1 in experiment 4. Table 1 reports the sample estimates of scale parameters and the ratio β ⌢ along with accuracy measure AUC and the proximity measures KLD [g||f] and KLD [f||g].
As the ratio β ⌢ increases, the accuracy measure AUC decreases along with its corresponding KLD values (Experiment 1). i.e., the normal population gets skewed with increasing values of scale parameter and this skewed density influences the overlapping area of both populations. Therefore, the distance between both populations will be less and it leads to a reduced KLD and accuracy value (AUC).
Further, as values decreases, the AUC and KLD tend to increase by explaining a better accuracy (experiment 2), because the change in density of abnormal population influences the overlapping area of both populations. i.e., as the scale in abnormal population increases, its corresponding density curve moves towards right with a peak shape and this leads to create larger distance between both populations.
The third experiment is conducted with AUC nearer to 50% to explain the worst case scenario of proposed methodology. From experiment 3, it is reasonable to conclude that the β ⌢ can take values till 1.8472 and a least value of 0.1. So, the β ⌢ values can lie between 0.1 and 1.8472 for the proposed ROC Curve. If the β ⌢ lies beyond this interval, the normal and abnormal populations are inverted thus explaining a worst case of binary classification. In the last experiment 4, the equal scale parameters are considered to know the behavior of proposed ROC Curve and this experiment can be treated as a moderate case of classification with AUC nearer to 70% along with higher distance between both populations.
From Fig. 2a, it can be visualized that the accuracy of a test (AUC) increases when the ratio β decreases, since the overlapping area minimizes as this ratio β increases. Similarly, the distance between both populations is measured by using KLD, which explains the proximity between both populations. From Fig. 2b, it can be seen that KLD [g||f] is higher than the KLD [f||g] as explained in methodology. As the ratio β decreases, the KLD value increases which indicates moderate classification (Fig. 2b).
Further, ROC Curves are plotted to explain the behavior of proposed ROC Curve for the considered combinations of scale parameters. Figure 3 depicts the different forms of proposed ROC Curves. From Fig. 3a, it can be seen that the effect of scale parameter in normal population has drifted the shape of the ROC Curve towards chance line, which means that higher overlapping area between both populations is due to the increased value of scale in normal population. From Fig.  3b, it can be seen that the ROC Curve moves towards the top left corner of the plot as the overlapping area reduces, this is because of increased scale in abnormal population. Further, Fig. 3c depicts the case of overlapping where there is a similar kind of variability observed in both normal and abnormal populations.

Real Data Set
The real data set is about the ICU scoring system, Acute Physiology and Chronic Health Evaluation II (APACHE II). Patients admitted to the Intensive Care Unit (ICU) have a wide range of underlying pathologies and physiological abnormalities. Scoring systems have been developed in order to allow comparisons in outcome between these patients. The most commonly used scoring system is APACHE II which assumes that there is a strong and consistent underlying relationship between acute physiological derangement and the risk of death during acute illness. The APACHE II score is derived from 11 physiological variables, the Glasgow Coma Score (GCS) and the patient's age and chronic health status. This data consists a total of 111 respondents of which 66 are alive and 45 dead.
From this data set it is observed that the APACHE II scores for died patients follows Rayleigh distribution (KS-Statistic = 0.0681; p-value = 0.9758 at 0.05 level of significance) whereas the score for patients who are alive follows half normal distribution (K-Statistic = 0.1499; p-value = 0.0927 at 0.05 level of significance). Therefore, we have used this scoring variable to predict the mortality using the proposed methodology. The results for the prognosis of disease are reported in Table 2. From Table 2, it is observed that the accuracy of the test is 68.66% along with the ratio β = 1.05. This means that the APACHE II score is able to identify the patients with prognosis of disease with 68.66% of correct classification. In other words, the APCHE II score can describe the survivors from non survivors with an accuracy of 68.66% and the distance between normal and abnormal populations is 0.3957nits with respect to the abnormal population and is 0.2316nits with respect to the normal population. This means that the distance between both alive and dead populations reflects the good extent about the accuracy of APACHE II (with AUC = 0.6866). Further, the curve obtained for the biomarker APACHE II explains about 68.66% of accuracy only with the ROC Curve uniformly lies above the chance line (Fig. 4).

Conclusion
The present work focuses on proposing a new ROC model which is based on mixture of two distributions namely HN and R distributions and further the proximity measure KLD is used to measure the distance between both populations. The ROC model which is developed in this study is completely dependent upon the ratio β and using the behavior of this β, the characteristics of the curve is discussed. From the results obtained, the ratio β lies between 0.1 and 1.8472. Whenever the value of β approaches 0.1, the ROC Curve moves towards the top left corner of the plot with a better accuracy and whenever the value of β approaches to a value of 1.8472, the ROC Curve moves nearer to the chance line. i.e., higher the value of β lesser will be the accuracy and vice-versa ( Fig. 2a  and b). Further, the KLD is also computed to measure the distance and is found to be larger at least value of β and KLD is found to attain a least value at larger values of β. An interesting fact observed is that the KLD and AUC are explaining similar kind of information with respect to β, since KLD explains the distance between both populations and the other one AUC explains the overlapping area of both populations.
Further, the real data set for detection of mortality rate using APACHE II in ICU explains the accuracy of 68.66% along with the ratio β = 1.05. This means that the APACHE II score is able to identify the patients with prognosis of disease with 68.66% of correct classification. In other words, the APCHE II score can describe the survivors from non survivors with an accuracy of 68.66%.