Prediction Intervals for Future Observations Based on Samples of Random Sizes

Corresponding Name Mohammad Z. Raqab Department of Mathematics, The University of Jordan, Amman 11942, Jordan Email: mraqab@ju.edu.jo Abstract: On the basis of failure times of an informative sample (Xsample) of random size M of iid continuous random variables, we consider the prediction problem of failure times of another independent future sample (Y-sample) of random size N of iid variables from the same distribution. In this paper, we derive various exact prediction intervals for future failure times from the Y -sample based on the failure times from the X-sample. Specifically, prediction intervals for individual failure times as well as outer and inner prediction intervals are derived based on X-order statistics. Prediction intervals for the increments of order statistics are also investigated. Exact expressions for the coverage probabilities of these intervals are also derived and computed numerically. A practical example on a biometric data set is used to illustrate the results developed here.


Introduction
Prediction of future order statistics is of natural interest in many practical problems. Considerably work has been done on the one-sample and two-sample prediction problems, and both parametric and nonparametric inferential methods have been developed in this regard. Interested readers may refer to Gulati and Padgett (2003) for details on related developments. An excellent review of development on prediction problems till late 90s can be found in Kaminsky and Nelson (1998). Using the Bayesian approach, Raqab and Madi (2002) considered the prediction of the total time on test using doubly censored Rayleigh data. Basak et al. (2006) considered the maximum likelihood predictors for different lifetime distributions, when the data are progressively censored.
The prediction intervals are used extensively in reliability theory, survival studies and industrial applications for predicting the number of defective items to be produced during future production process. Distribution-free confidence and prediction intervals have been discussed rather extensively in the context of order statistics. While Wilks (1962;Krewski, 1976) discussed the construction of outer and inner confidence intervals for quantile intervals based on order statistics, concise reviews on this topic may be found in the books by David and Nagaraja (2003;Arnold et al., 2008;Ahmadi and Balakrishnan, 2011). Barakat et al. (2011) obtained prediction intervals for future exponential lifetimes based on random generalized order statistics. El-Adll (2011) discussed the problem of predicting future lifetimes based on three-parameter Weibull distribution. Barakat et al. (2014) obtained prediction intervals of future observations for a sample of random size from a continuous distribution. For similar prediction problems, one also may refer to Balakrishnan et al. (2005;Sultan and Ellah, 2006;Asgharzadeh and Valiollahi, 2010;Abdel-Hamid and AL-Hussaini, 2014;El-Adll and Aly, 2016). Prediction intervals based on two-sample prediction problems is discussed, for example, in Fligner and Wolfe (1976;1979) for order statistics. Recently, Basiri et al. (2016) developed various non-parametric prediction intervals of order statistics from a future sample based on observed order statistics. Barakat et al. (2016) constructed prediction intervals for future two-parameter exponential lifetimes based on a random number of generalized order statistics under a general set-up.
Let X 1:n ≤ X 2:n ≤ ... ≤ X n:n be order statistics of a random sample with fixed sample size n from an absolutely continuous cumulative distribution function (cdf) F(x) and probability density function (pdf) f(x): In reliability theory, X s:n represents the life length of a (n-s+1)-out-of-n system made up of n identical components with independent life lengths. When s = n; it is better known as the parallel system. For detailed discussion in this respect, see Barlow and Proschan (1981). In biological, agriculture and some quality control problems, we often come across situations where the sample size is a random variable (rv) and it is almost impossible to have a fixed sample size because either some observations get lost for various reasons or the size of the target population and its representative sample cannot be determined well. Let us consider the following practical example. Suppose that fish biologist would like to study the weight data of a specific rare fish living in the red sea over one-year period. The fish weight will vary depending on age, sex, season and recent feeding activity. Clearly, the number of fishes to be caught on daily or yearly basis is rv. The weight outcomes of this study is the ordered data, X 1:m ≤ X 2:m ≤ ... ≤ X m:m ; where X i:m is the i-th ordered weight and m is the observed value of a rv M: Although the distribution of M is unknown but we can estimate the empirical distribution of M by considering its frequency table over one-year period. Based on the observed ordered weights over one-year period and under similar conditions, we aim at providing non-parametric prediction intervals for the fish weights to be caught in the next year. It is often reasonable to assume that the data sample and future sample are independent and the random number of fishes to be caught is independent of the fish weights. If one introduces the random sample size as an extension of a model (mainly for statistical inference), one can usually assume that it is independent of the underlying variables. Other practical applications could be the study of sizes or intensity of brightness for comets and shooting stars penetrating the atmosphere of earth in certain areas.
In this paper, we discuss Prediction Intervals (PIs) for future order statistics, Y 1:N , Y 2:N ,..., Y N:N , from a future independent sample (Y -sample) of random size N from a continuous cdf F(x); based on informative sample (Xsample) X 1:M , X 2:M ,...,X M:M of random size M from the same distribution. Throughout this paper, we assume that the rv's N and M are independent of Y 1:n , Y 2:n ,..., Y n:n and the items of the informative sample X 1:m , X 2:m ,...,X m:m , for any n and m, respectively. This paper is organized as follows. In Section 2, we present some useful preliminaries. In Section 3, we derive Pls for a future order statistic. In Sections 4 and 5, outer and inner PIs for future order statistic intervals, respectively, are derived. Numerical computations for picking the appropriate order statistics for the establishment of these PIs are performed and presented in their respective sections. In Section 6, we show how the increments of order statistics of the observed X-sample can be used to construct upper and lower prediction limits for increments of order statistics of the future Y -sample. Finally, in Section 7, a practical example on a biometric data is used to illustrate all the results developed here.

Preliminaries
Let 1≤r<s≤m. Then, the interval (X r:m , X s:m ) is termed the coverage and I r,s:m = X s:m -X r:m, the increment. The marginal pdf of the r-th order statistic, U r:n from uniform U(0, 1) [see David and Nagaraja (2003)] is: The joint pdf of U r:n and U s:n is similarly given by: The following well-known lemma [see for example, David and Nagaraja (2003), P.18] will be quite helpful in simplifying some arguments in the proofs of the results established in the subsequent sections.

Lemma 1
With U i = F(X i ) and V j = F(Y j ), we have: : : where, U i:n and V j:n are the i-th and j-th order statistics arising from samples of independent and identically distributed (iid) uniform rv's U i 's, V j 's and d = denotes identical in distribution.
The following identities are quite useful in showing various results in the next sections.

PIs for a Future Order Statistic
Let X 1:M , X 2:M ,...,X M:M be order statistics from the Xsample with random size M of iid rv's from F(x). For given 0 < α < 1, suppose we are interested in obtaining 100(1-α)% PI for the k-th order statistic Y k:N from the Ysample with random size N of the form (X r:M , X s:M ), 1 ≤ r < s ≤ M, such that P(X r:M ≤ Y k:N ≤ X s:M ) = 1-α.We refer to the interval (X r:M , X s:M ) as a 100(1-α)% PI for Y k:N . In this section, we derive such two-sided PIs for Y k:N with coverage probabilities being free of the parent distribution F.

Theorem 1
Let {X i , i ≥ 1} and {Y i , i ≥ 1} be two independent sequences of iid rv's from the same cdf F(x) having random sample sizes M and N, respectively. Then, (X r:M , X s:M ) is a PI, based on the X-sample, for the future order statistic Y k:N from the Y -sample, with the corresponding prediction coefficient, being free of F and is given by: where, P M (m) and P N (n) stand for P(M = m) and P(N = n), respectively.

Proof:
We have: : : : : : : : : By the continuity of F, we immediately obtain: : : : : By the conditioning arguments, (3) and (5), we can write: Now, let us adopt the coverage probability based on (X r:M , X s:M ). On using the facts that M ≥ s and N ≥ k and applying the conditioning arguments on the rv's M and N (Raghunandanan and Patil, 1972;Buhrman, 1973), we obtain: : : : : : : : : : Therefore the result in Theorem 1 follows readily by (7) and (8).

Remark 1:
From Theorem 1, (X r:M , X s:M ) can be picked as a 100(1-α) PI for Y k:N with a specified coefficient level 1α (0 < α < 1) such that π 1 (r, s; k)≥ 1-α. The prediction coefficient π 1 (r, s; k), which is free of the parent distribution F, can thus be computed simply based on the probability values of the rv's M and N. Letting r = 0 and X r:M = -∞, we have one-sided PI of the form (-∞, X s:M ) immediately from above formula. Similarly, if s is an integer such that P (r≤M<s) = 1 and X s:M = ∞, we will get (X r:M , ∞) as one-sided PI for Y k:N . In this case, the prediction coefficient is reduced to: Under the settings and conditions of Theorem 1 with M and N being integers such that P(M = m) = P(N = n) = 1, m ≥ s and n ≥ k, then π 1 (r, s; k) reduces to: Under the settings and conditions of Theorem 1: a) If M, N have binomially distributions with parameters v and p [M, N ∼ B(v,p)], then the probabilities coefficients in Theorem 1 can be represented as: In this case, the prediction confidence coefficient reduces to: Therefore, the prediction confidence coefficient turns out: , then the probabilities coefficients in Theorem 1 becomes: The prediction confidence coefficient is readily obtained as:

∑ ∑∑
In the following lemma, we provide (1 − α) 100% PI that contains at least t observations of the Y -sample (N ≥ t).

Lemma 2:
Let us define J t,N as the number of future observations ≥ t out of N such that N≥ t. Under the condition and setting of Theorem 1, the coverage probability of J t,N based on the two-sided interval of the form (X r:M , X s:M ) is: Given M = m, N = n > t, we have: : , : : : : : Using the conditioning argument, the result follows immediately.

Corollary 2:
If the experiment conditions have not changed, we may assume P(M = x) = P(N = x) and then get PI, ˆ: : where m is the observed value of the rv M. Under the condition that both samples have equal random sizes (say N), the coverage probability in Theorem 1 is simplified to: It is worth mentioning that one can't apply Theorem 1 for any future order statistics except for the lower extremes. Otherwise, the value of k in π 1 (r, s, k) depends on N, which does not make any sense. However, if the goal is to estimate a specific parent quantile ξ p of order p(0<p<1), one may refer to Al-Mutairi and Raqab (2017). For predicting the sample median when the sample size N is random, we first obtain an integer k such that P(N≤k) ≥ 1/2 and P(N≥k) ≥ 1/2, then we immediately predict the corresponding value for Y k:N . For example, if N ∼ B(20, 0:3); then k = 6 and Y 6:N has to be predicted. Let us consider the case with P(N = n 0 ) = 1, where n 0 is either the observed value of N or just a hypothetical value or even the predicted value of N. Then, we may consider The values of π 1 (r,s;k) are presented in Table 1 for various choices of r, s and k. If the desired level of the PI is specified to be 1-α, we may choose r and s so that π 1 (r, s; k)≥1-α. Evidently, the distribution of the sample size plays an important role in evaluating the prediction coefficients. It can be easily checked that for fixed r, π 1 (r, s; k) increases rapidly when s increases for small values of s. For large values of s, π 1 (r, s; k) changes very slowly as s increases. That is, the computed values of these coefficients are not sensitive to the large values of s-r.

Outer PIs for an Order Interval
Suppose (Y k:N , Y l:N ) is a future interval of order statistics from the Y -sample and that we are interested in obtaining 100(1-α)% PI for it of the form (X r:M , X s:M ) such that P (X r:M <Y k:M <Y l:N <X s:M ) ≥ 1-α. Then, (X r:M , X s:M ) is termed the outer PI for the interval (Y k:N , Y l:N ). In this section, we describe how such these PIs can be constructed.

Theorem 2:
Under the settings and assumptions of Theorem 1, (X r:M , X s:M ) is an outer PI, based on the X-sample, for the future interval of order statistics (Y k:N , Y l:N ) from the Ysample, with the corresponding prediction coefficient, being free of F, given by: From Lemma 1, : : : : For 1≤r<s≤M and 1≤k <l≤N, let us consider: : : : : : : : : : : : : 1 1 : : : : 0 , : From Lemma 1 and the fact that (U r:n , U s:m ) with m and n being fixed, are jointly distributed as the r-th and s-th order statistics from the standard uniform distribution, we first have: 1 : : Since the inside integral is an incomplete beta function, it can be expressed as a sum of binomial probabilities using (4). Then the resulting expression is: : :   By a change of variable z = (1-w 2 )/(1-v k ), the integral in (11) can be transformed to an incomplete beta function, which upon using the identity in (4), yields: Upon substituting the expression in (12) into Equation (9) and using the change of variable arguments, the result readily follows. ∇

Corollary 3:
Under the settings and the condition that both samples are taken with equal sizes (say, N), then the corresponding prediction coefficient is: The values of π 2 (r, s, k, l) are presented in Table 2 for various choices of r, s, k and l. If the desired level of the outer PI is specified to be 1-α 0 , we may choose appropriate values of r and s so that π 2 (r, s, k, l) ≥1-α 0 . It is naturally expected that the outer PIs of the order intervals are observed to be wider to achieve the specified level 0.90. Therefore, given a specified level 1-α 0 , it is more likely to pick up outer PI for (Y k:N , Y l:N ) based (X r:M , Y s:M ) when s-r gets large. Clearly, the randomness of the sample size is affecting the behavior of π 2 (r, s, k, l) when comparing with the fixed sample size. It can be seen from Table 2 that π 2 (r, s, k, l) increases when s moves away from r but it is not sensitive to the large values of s.

Inner PIs for an Order Interval
Suppose we are interested in obtaining a 100(1-α)% PI for the future interval of order statistics (Y k:N , Y l:N ) from the Y -sample of the form (X r:M , X s:M ) such that P (Y k:N < X r:M < X s:M < Y l:N ) ≥ 1-α. Then, (X r:M , X s:M ) is termed the inner PI for the interval (Y k:N , Y l:N ) : In this section, we describe how such inner PIs can be constructed.

Theorem 3:
Under the setting and assumption of Theorem 1, (X r:M , X s:M ) is an inner PI, based on the X-sample, for the future record interval (Y k:N , Y l:N ) from the Ysample, with the corresponding prediction coefficient, being free of F, given by: Using arguments similar to those in the proof of Theorem 2, we have: 2 : : : : Now, upon using the conditioning argument, we can write: By using change of variables, the integral in (14) can be transformed into an incomplete beta integral which can be written as a finite sum of binomial probabilities using (4). The required result in Theorem 3 then follows by conditioning arguments on M and N: ∇

Remark 4:
When N and M are degenerate rv's at m and n such that P(M = m) = P(N = n) = 1 with m≥ s and n≥ l; then π 3 (r, s, k, l) is turned to be:  Table 3 presents the values of π 3 (r, s; k, l) for various choices of r, s; k and l. If the desired level of the inner prediction interval is specified to be 1-α 0 , we may choose appropriate values of r and s so that π 3 (r, s; k, l) ≥1-α 0 . It is worth mentioning here that it is expected to have shortest inner PIs to attain the specified level 0.90. Table 3 shows that, for fixed r, π 3 (r, s; k, l) reaches its maximum quickly for small values of s−r. The values of π 3 (r, s; k, l) are not sensitive to the small values of s−r and change significantly when s−r gets large.   Table 3: Values of π 3 (r, s; k, l) for some choices of r, s; k and l B(20, p)

PIs for Order Statistics Increments
The increments of order statistics I r,s:M = X s:M -X r:M , M ≥ s based on X-sample can be used in similar settings to construct PIs for the increments of order statistics * , : : consecutive order statistics. The order statistics increments are usually used to measure the variation in the case of location-scale distributions; see, for example, David and Nagaraja (2003, p.160). In this section, we describe how the record spacings I r,s:M can be used to construct upper and lower prediction limits for the increments of order statistics * , : k l N I . Let us first start with the following: By conditioning argument and the fact that Upon using (17) and substituting (16) into (15), we immediately conclude: Proceeding similarly, we also obtain: On the other hand, we clearly have: : : : : : and: : : : : : : : : where, π 2 (r, s; k, l) and π 3 (u, v; k, l)) are as defined in Theorems 2 and 3, respectively. Therefore:

Remark 5:
The increments of order statistics I r,s:M represent the width of the outer and inner PIs. Since the expected width of the PI can be considered as an optimality criterion while comparing different intervals, evaluation of E(I r,s:M ) is of natural interest. In this regard, sharp bounds for E(I r,s:m ) established by Raqab (2003) would be useful.

Illustrative Data Analysis
In this section, we apply the procedures developed here on a biometric data set analyzed by Prentice (1973) and Lawless (2003). The data represent the survival times of lung cancer patients, measured from the start of chemotherapy treatment for each patient.
The methods developed in the previous sections will be applied to analyze the survival times data assuming a random sample size N and that 20 patients begin the therapeutic program. From past experience, it is known that 20% of patients had left the therapeutic program for one reason or another. Consequently, a binomial random sample size is a reasonable assumption. That is, N ∼ B(20; 0:70). Based on the observed ordered data above, PIs for future order statistics and order intervals from another independent sample of random size under similar conditions were obtained with prediction coefficient of at least 1-α 0 = 0.90. These intervals are presented in Tables  4 and 5, respectively. To see how the PIs of future order statistics compare for random and fixed sample sizes, we have also presented in Table 4, the prediction coefficients based on fixed sample size n = 20. In Table 4, it is checked that the PIs based on the informative sample are shorter than those for fixed sample size with probability coefficient being at least the specified level 1-α = 0.90. For example, let us consider the PI of 6-th order statistic from Y -sample for random sample size as well as for fixed sample size. The shortest PI of Y 6:N is (X 2:M , X 10:M ) = (0.29, 2.0) with probability coverage 0.90 which is the same as the coefficient level 0.90 when the sample size is random. But for fixed sample size, the shortest PI interval of Y 6:20 is (X 2:20 , X 11:20 ) = (0.29, 12.86) with the coefficient level 0.902. Similar observations can be noticed for the PI of Y 8:N . Generally, one can conclude that the randomness of the N along its respective distribution represent essential factors in allowing the PIs to become shorter when compared with fixed sample size. The outer and inner PIs of a future order interval for both random sample size N and fixed sample size N = 20 are displayed in Tables 5 and 6, respectively. From Tables 5 and 6, it is evident that the outer PIs for random size N are shorter than those for fixed sample size N = 20 under the available coefficient level being at least the specified desired level 1-α = 0.90. On the contrary, these tables shows that the inner PIs of random N is wider than the inner PIs for fixed N = 20.     1-α r s (X r:M , X s:M ) π 1 (r, s; k, l) (k, l) 1-α r s (X r:M , X s:M ) π 1 (r, s; k, l) (5,7) 0   1-α r s (X r:20 , X s:20 ) π 1 (r, s; k, l) (k, l) 1-α r s (X r:20 , X s:20 ) π 1 (r, s; k, l) (5,7) 0. Further, we can also obtain the upper prediction limits for the increments of order statistics from the future sample with coefficient at least 1-α 0 = 0.90. For example, the upper prediction limit for Y 12:N -Y 5:N is X 9:N -X 1:N = 7.17 with prediction coefficient 0.902 while the upper prediction limit for Y 15:N -Y 5:N is X 14:N -X 2:N = 32.71 with prediction coefficient 0.903.