Statistical Tool for Testing Agreement Level on Continuous Datasets

Corresponding Author: Basavarajaiah Mariyappa Doddagangavadi Karnataka Veterinary Animal and Fisheries Sciences University (B), India Email: sayadri@gmail.com Abstract: Various analytical studies explored the new innovation for testing agreement level in medical and life sciences which can be simulated by Cohen ‘’ based on the practical applications. From the past medical literature, many authors suggested that, there is some disproportion research gaps that exists in the statistical methods for measuring the agreements between two or more observers from ‘Cohen ‘’, these methods had some salient properties and analytical characteristics on qualitative data for testing the research hypothetical statements. The ‘k’ can be simulated based on few parameters, which can be estimated from the observed data sets at one point of time (t). This intervention will be restricted for the experimenter on measuring and comparing the extent of various agreements at varied time intervals t'. In this drawback, the present research article attempts to focus on the testing agreement level based on real values by using various mathematical iterations like bootstrap and Thompson (the measurement made on central tendency method). Since above cited methods extrapolate prediction values and Standard Errors (SE) on various agreements with continuous data scale. As per the model results, our formulated model will be able to measure and compare various parameters of our interest, we can also estimate various parameters from agreements between two or more observers by using ranking scale (converted in to random scale of measurement) at the same population in varied time interval ‘t'. The research findings clearly depend heavily on the exact distribution of Binomial and Poisson distribution with same dicatamous classification of the disease conditions. Results of bootstrap technique are more epoch rather than ‘k’ and it will provide a very good consistent prediction of different observer agreement level without any biased scale. This model also demonstrated how to examine various kinds of distributions at population level. Importantly, the driven model will explore fiducial limits of the parameters on the basis of agreement drawn with different time intervals t1, t2... tn (by using real-life datasets from anti-leprosy vaccine trial conducted in south India). We found that, the model results would be coaxial changes between various factors viz ( i) distribution of the sample estimators is non-Gaussianity, (ii) variance is underestimated and the confidence limits are asymmetric normally distributed data sets.


Introduction
In practical sight we determine the prevalence of diseases such as leprosy, tuberculosis and disease investigations that often depend on the results from population screening or new diagnostic tests (Ayoub and Elgammal, 2018;Bakeman and Quera, 2011). Although, none of these methods or procedures can be considered as perfect (Byrt et al., 1993), the classification of such results associated from quantum of response variables, classified into two categories namely positive or negative, normal or abnormal etc. (Brennan and Prediger, 1981;Carpentier et al., 2 2017;Cohen, 1960). Medical research conducted at commonplace at varied time intervals (Cohen, 1968;Feinstein and Cicchetti, 1990). Since, the accuracy of result is undoubtedly important for testing resulted findings and also improving accuracy of the results and standardization of the methodology between any pair of observers which is being usually adopted (Field and Welsh, 2007). This can be evaluated by computing the usual measure of agreement namely concordance or crude agreement (po) (Fleiss and Cohen, 1973;Feinstein and Cicchetti, 1990;Hoehler, 2000). However, the well-known measure of agreement Kappa () that would take care of chance of agreements When the rating scale will be multi-categorized (as opposed to the binary positive' or negative'), the weighted '' simulated accurate results by assigning some Weightage to particular parameters (each cell in the n  n Table 1) (Bakeman and Quera, 2011;Byrt et al., 1993). Discussed the different measures of inter observer agreements and their desirable properties of 'k' at defined time interval 't'. While, appraising the use of '' a long list of literature is available relevant to the observation of paradoxes' for its long ranging interpretation on the basis of a few real practical illustrations and useful recommendations to overcome them appeared elsewhere (Carpentier et al., 2017;Cohen, 1960;1968;Feinstein and Cicchetti, 1990;Field and Welsh, 2007;Fleiss and Cohen, 1973). However, discussion on the estimation of '' and its interpretation through a more generalized approach is still to be attempted. Besides with this entire research gap, the present study examines the changes between various levels on continuous data sets (random process) and its interpretation of '' at varied time intervals 'ti'. For example, if the same or different observers will attempt to assess the disease conditions on continuous scale with random process at population level (leprosy screening in selected sites at varied time 't' with various level (Rural, Urban and Peri urban districts), the rates ascribe the rates by using ranking scale appended with different geographical location considering with varied time intervals besides with extrapolation of assigned ranking (rates converted in to random scale) example different attributed scale converted in to numerical forms by using time interval (here time is random variable). The extent of improvements of agreement between any pair of observers would be considered as random variables and all the observations are randomly distributed with xij~N(,  2 ), it will be ensured that checking of randomness of the raters'. The agreements will takes place between two or more observers which solely depend on the distribution system and classification of attributes at population level (Sim and Wright, 2005;Kraemer and Bloch, 1988;Kang et al., 2013;McHugh, 2012). It can provide fixed agreements between the observers. Since, the real tool practiced in various set up, this measure will be usually estimated by newer techniques of bootstrap modified kappa 'k' approach to continuous data sets in which the same measure of agreement for classifications of un observed groups and extrapolation of predicted outcomes of raters subjects considering with slopes and Root Mean Squared error values (RMS). Assumed the Gaussianity for classifications of parameters of Leprosy population screening, derived an expression of variance ( 2 ) and related hypothesis testing (Landis and Koch, 1977). This uncertainty concerning will be extrapolated with the exact distribution of the classifications. As such being the case, the hypothesis testing and confidence intervals can lead to incorrect results. Example, the confidence intervals derived will be based on the methods of derived agreement distribution, the distribution is symmetrical and classifications of observers agreements (population screening) is headed by the traditional Kappa 'k', there is no adaptation of Gaussianity substantially differing from the raters' scale assumptions, as often to be the expression by traditional Kappa 'k' is underestimated likelihood of parameters and also the variance ( 2 ) is propagated through incorrect Confidence limits (CI) for testing the portion of agreements made by the different raters'. In this paradigm the present study explores and formulates new innovative stochastic model of modified Kappa 'k' for the estimation of agreement levels of continuous series data sets.

Model Formulation
Model considers real-life data sets of anti-leprosy vaccine trial to demonstrate the modified kappa 'k' by newer bootstrap techniques at greatest epoch with different iterations. Further, we also escalate the bootstrap robustness by using Thompson iteration method to examine the distribution of sampled estimators and likelihoods. The following illustrations were used for the formulation of this model, present study mainly concentrates on two or more observers with varying classifications, a most general situation for more than two observers, we persuaded to convert raters' ranks to continuous scale (data transformation). Suppose, the two doctors diagnose leprosy presenting attributes is (positive) or absent (negative) and the results are presented as follows: The proportion of subjects classified as positive by The expression for the chance agreement between two observers is given by:

Cube Root of Product Measures (CRPm)
The CRPm is used to identify the average agreement between the three values, the Fleiss Kappa and Krippendorffs alpha agreement is specifically used to escalate the agreement at greatest accuracy demonstrated presented in (

Extent of agreement was determined by the
Finally, the property of quadratic form is in the following form of Equation: Agreement measures = (Dm-1) applicable for both qualitative and quantitative data sets. Practically we compare and measure hypothetical statements of qualitative data of epilepsy considering group I and group II epilepsy subjects. However, the group I consider scholars who are trained by the physicians to diagnose the epilepsy based on sign and symptoms of 25 subjects in association with (group 2) mentor who is considered as group II. Five diagnoses were done on subjective approach, the implication is subject to test, if any one of the possible five diagnoses to each patients or subjects, on each stages of epilepsy signifies the various attributes. The continuous scale (transformed data sets) was analysed using R-statistical software, the following resulted findings were generated for testing agreement values.

Minimum and Maximum Values of po
The minimax value of po is used to derive the minimum value of crude agreement: Prevalence (PI) and Bias Indices (BI) in terms of pomin and pomax.
Prevalence Index (PI) is given by the difference between the estimates of proportions of subjects classified as positive and negative for the whole population, i.e., happened only when all the screened observations or individuals for both the observers cited positive i.e., p1 = 1 and p2 = 1; and equals 0 when p1+p2 = 1. It can be seen that |PI| = pomin discrepancy between the observers, if any, while assessing the frequency of occurrence of a given condition in a study group is denoted as bias. Bias Index (BI) = |p1-p2|, its minimum value is `0' when p1 = p2 and maximum value is 1 when p1 = 1 and p2 = 0 or p1 = 0 and p2 = 1. BI can be expressed as = 1-pomax.

Bootstrapping Technique
The nonparametric bootstrap technique is applied as statistical methods or tools for estimating distribution of attributed data sets, the method will help us to draw effective inference in both sample and population level. The technique used for the estimation of agreement between two or more observers is yet to be proposed. The methodological insight for nonparametric bootstrap resampling of the estimation of 'k' and other parameters of the data sets at varying degree of measurements or agreements between the observer. Let us assume, there will be an observed sample of n1 pairs of classification of the disease viz a1 of '+ +', b1 of '+ -' c1 of '-+' and d1 of '--' drawn by the two observers, while screening the population for leprosy cases in medical research: (i) Draw a random sample of 'n1' pairs   If ASLBoot<0.05 we conclude that, the agreement between the two observers at II Resurvey is significantly associated with (I-resurvey). We consider only three methods, viz naïve with variance unadjusted, percentile intervals and the bootstrap BCa confidence intervals.

Evaluation of Agreement through Simulations
We evaluated the statistical accuracy of the unobserved sample estimators from continuous data series of different raters. The results of the bootstrap procedure are presented. From results from first resurvey it is evident that the distribution sample estimators for agreement between two observers are non-normal. Similar findings may be noticed for the data at second resurvey as well as the combined sample (Table 3) presents the results of adopting naïve, percentile interval and BCa methods to obtain the point and interval estimators of agreement. We can see from this analysis that all the three methods provide the same average estimates. The naïve method for estimating 95% confidence interval always yield symmetrical results unlike the bootstrap method which provides asymmetrical results for all sample estimators of the agreement. The length of the interval is the lowest in the naïve method. However, the lower and upper confidence limits are lower than the corresponding limits in the bootstrap percentile and BCa methods. We evaluate the performance of the bootstrap for estimating the parameters of agreement; we undertook the similar exercise on the data from II resurvey. The results of the analysis are presented in (Table 3). Here again, we find that all the three methods provide same point estimates for each parameter of agreement. The confidence intervals through naïve method are symmetric vis-à-vis bootstrap percentile and BCa methods. The findings are otherwise similar to what we observe for first resurvey. The sample estimators of agreement except for kmax at II Resurvey are consistently higher. However, the variance of each estimator is consistently lower at second resurvey. The improvement in k suggests that the improvement can be as high as 10 percentage points. The sampling distribution of max k relative to k suggest that the former is more stable than the latter though the former is equally sensitive. Further, the diagnosis that will remain '+ +' or '--' at both the occasions also confirm the consistency in ratings by two observers.
In the 25 subjects with full set of 5000 replications, the model has yielded good predicted average agreements (0.91±0.06) with positive slope movement (slopes = 2.24 times of average value with ignorable Root Mean Square Errors (RMSE 0.03). The model prediction is perfect level and most robust in nature (Co-efficient of variation (R 2 = 98.0%; SE 0.06) when compared to Cohen's 'k'. The maximum agreement in case of predicted values eagerly falls on the subject of raters 22 (agreement level was 0.99; RMSE 0.032). For each subjects, the accuracy was maintained (0.80,0.88,0.90,0.98; n = 25) there are 25 simulations falling on each data points. Since, higher the observer accuracy abridged with better agreements levels. The ratio of agreement level in each subjects at different time period 'ti' is more consistent and also maintains greatest accuracy and more precision 2 n  

 
where n = number of subjects  2 is the variance of agreements. The newer simulation agreement was determined based on the Bootstrap-Thompson iteration method; the simulated figures were extracted on the basis of central tendency values with average observed rate of 25 subjects. The resulted findings were found to be more epoch (the modal value attained >9 rater scale) rather than traditional 'k' measurement ( Fig. 1). While all predictive subjects achieve a kappa agreements and above (  (Fig. 2). An increasing number of subjects and raters', the simulated agreements will be moved to the positive direction with stronger values of 'k', simultaneously, the predicted agreement is more epoch and it was found to be statistically significant and different between the observer and different agreement levels with varied time intervals 't' (iteration replication 10000 times; R 2 = 0.99%). From the research findings we notice that, the agreement level shifts lower when data points become lower. At observer accuracy level substantially very low, the subject agreement does not produce accurate results because the iteration points would not be generated with lower accuracy. However, many factors that affect values of Kappa include observer accuracy and number of subjects, as well as when observer distributes the subjects equally. There is no one value of modified kappa that can be (produced Bootstrap techniques) regarded as universally acceptable and propagation of simulation figures is highly associated with level of observer s, accuracy, precision and the number of subjects. With a fewer number of subjects (k<5), epically in binary classifications, modified kappa values need to be interpreted with extra caution. Eventually, in case of binary classification, predicted variability has the strongest impact on Kappa 'k' values and leads to the heterogeneous and also strongest impact on the observer agreements. On the other hand, when there are more observers (>25 subjects), the increment t of expected kappa value becomes flat. Hence simply determine the percentage of agreement. Moreover, the increment to f values of the performance matrices apartment form sensitivity also reaches the asymptote from more than 25 subjects (Fig. 3 to 5). The observer's agreement data points were plotted on the Box Cox and QQ plot from modified kappa statistics-bootstrap techniques to know the normality originated from the population with a common distribution. The results showed that, the observer agreement points is normally distributed and to optimize the bootstrap techniques with various methods of integration (Thompson). Optimal lambda value ranged from 1-2, that means each individual agreement can take a weight of raters value of 2.0 in the ordinal scale for estimation of density plot and also, the model has produced the QQ plot for weibull distribution to know the exact confidence limit of kappa 'k '. Figure 6 showed that the confidence limit of waited modified kappa 'k' was 6.88 with scale of 0.81 agreement level.

Discussion
The first point of interest emerges from the study; the estimates of agreement between two observers follow non-Gaussian distribution. Consequently, the sample estimators though unbiased provide lower variance and narrow confidence limits. Further the 95% confidence limits on each parameter are symmetrical. This emphasizes the need for estimating the agreement measures and its standard errors that allows its appropriate distribution status. Adjustment to these estimates as well as hypothesis testing also perform poorly as they are based on the assumption that the observations from population screening for disease conditions follow asymptotic normal distribution. The second point of interest in this study is that we can demonstrate through bootstrap that we can obtain accurate estimates of k and its standard error. Further, this approach is better than currently used methods for non-normal data. This approach also allows for testing the hypothesis whether the k values at I and II Resurvey are significantly different. Consequently we observe that the k at second Resurvey is significantly more than that at I Resurvey indicating that k should not be treated as static over a period of time if the same pair of observers are repeatedly employed for classification of the disease condition while screening the given population. This may ensure that the estimation of near true incidence of the disease. On the other hand if we use max k as a single index in terms of chance agreement, prevalence and bias indices, we can be at a safer side because this index is more stable while equally sensitive vis-à-vis k . We therefore propose max k to be employed as a measure of agreement for comparison and interpretation. Studied inter ratter reliability of categorical data sets, the 'k' will be used to verify the presence of the theme that were presented. The k co efficient is a statistical measures the inter ratter reliability of agreement that is used to assess qualitative documents and determine agreement between two raters. An important assumption underlying the use of the kappa co efficient is that the errors associated with clinicians rating are independent (Brennan and Prediger, 1981;Hoehler, 2000;Sim and Wright, 2000;Cohen, 1960;Richards et al., 2003). This requires the patients or subjects to be independent and ratings to be independent, so that each observer should generate a rating without knowledge and thus without influence, of the other observer's rating. The fact that the ratings are related in the sense of pertaining to the same intervention, however does not contravene the assumption of independence. Sim and Wright (2005) Reliability of clinicians rating is an important consideration in areas such as diagnosis and the interpretation of examination findings. Often, these ratings lie on a nominal or an ordinal scale. For such data, the kappa co efficient is an appropriate measure of reliability. Important factors that can influence the magnitude of kappa (prevalence, bias and non independent ratings) are key indicators for assessing the kappa statistics. The present research, the co efficient can be used for scaling with more than two or more categories for assessing the agreement levels inclusion qualitative and quantitative traits similar study reported by (Rigby, 2000) he assessed the intra observer and inter observer agreement of radiographic classification of scoliosis in relation to the king classification system. Emphasized the value of 10 multiple regression method and the importance of power and measuring effects rather than testing significance. For more than two raters, the mathematics is such that the two raters' are not considered unique. For instance, if there are three raters', there is no assumption that the three raters' who rate the first subject are the same as the three raters' who rate the second. Although, we call this more than two raters' case it can be used with two raters' when the raters' identified vary. The 'k' is the generalization for weights reflecting to the relative seriousness (Cohen, 1960) of each possible disagreement due to attributable factors. The analysis of variance approach for k = 2 and m > = 2 is due to (Landis and Koch, 1977). The kappa 'k' was first proposed by (Cohen, 1960). The generalization for weights reflecting to the relative seriousness of each possible disagreement is due to (Cohen, 1968). The analysis-of-variance approach for k = 2 and m ≥ 2 is due to (Landis and Koch, 1977). or (for an introductory treatment and chap. 18) for a more detailed treatment. All formulas below are as presented Let m be the number of raters' and let k be the number of rating outcomes. Carpentier et al. (2017) demonstrated the free response kappa in a computed form that the total numbers of discordant (b and c) and concordant positive (d) observations made in all patients, as 2d/(b + c+ cd). In 84 full body magnetic resonance imaging procedures in children that were evaluated by two independent raters', the free-response kappa statistics was 0.820. Aggregation of results within regions of interest resulted in overestimation agreement beyond chance. The free response kappa provides an estimate of agreement beyond chance in situation where we only have positive findings and are reported by raters'. Kang et al. 2013) when the observations are independent, confidence intervals can be computed using several methods, in case of clustered data, a common situation radiology, the present study we opted is a bootstrap based approach for testing various subjects on the basis different attributes. We sampled subjects (with cluster approach) and used all observations from any selected patient. Yang and Zhou (2014) 'k' is widely used to assess the agreement between two procedures in the independent matched pair data. For matched pair the data is collected in clusters, on the basis of the data method and sampling techniques, that propose a non parametric variance estimator for the kappa statistics without cluster correlation structure or distributional assumptions, further result demonstrated by the extensive Monte Carlo simulation that the proposed 'k' provides consistent estimation and the proposed variance estimator behaves reasonably well for at least moderately large number of clusters our study compliments this and we have derived this model based on Thompson iteration optimization methods applied various subjects to escalate the agreement levels of both qualitative and quantitative data sets.

Conclusion
The new statistical intervention tool is very useful to know the different association measure for testing the agreement between two or more observers on continuous scale with varied time intervals. The newer tool increases the precision of rater's confidence level in a single real number without any substantial loss of information. It is capable to reproduce the accurate predicted agreement levels and slopes of continuous data series. There is no complexity between inter observer agreement testing considering with the various multiple degrees on different time intervals.