Creating Improved Models of Business Processes Through Statistical Analysis of Customer Orders with Multiple Types of Complexity

To gain a better understanding of the internal work processes in service-oriented supply chains, it is very important to design models that are able to realistically describe the components of the supply chain. To meet this goal, it is necessary to find suitable statistical distributions of the processing times for the orders passing the chain. In this article we examine sample data sets with more than 2,000 individual work times from four steps in the work processes of a time-based aeronautical supply chain and derive the best possible distributions fitting the sample data sets. To increase the realism of the model, both the data sets and the resulting statistical distributions were subdivided into several categories of order complexities, a task made more challenging by the limited amount of data available for the rarer high-complexity orders.


INTRODUCTION
Time is money and money is time -is one of the most commonly used phrases in the business world, but for airline suppliers it goes to the heart of the matter. Most airline suppliers face a fundamental problem: the time it takes to procure and deliver the replacement part to a airline is longer than the time the airline is prepared to wait for it [1] . To clarify the problems faced by this industry, we investigated the following realistic scenario.
Scenario approach: A customer places an order with 36 hours left until the deadline. Since the order is for a large number of replacement parts, processing it at each step of the supply chain is a lengthy task and it takes 11 hours until the order has moved through the Customer Service department. After this, the order passes to the Stock department, where a recent rush of high-priority orders (with a deadline of 25 hours or less) causes it to wait for 30 min. After this the order is assigned to an employee, but since it is late in the afternoon and he had a lot of trouble completing the high-priority orders, he is not quite as concentrated and has to break off working on it after 180 min at the end of his shift. Thus, the order waits for 14 hours until the next morning, when he completes the order after 30 more min. Now the order has only 7 hours left until the deadline and is thus tagged as a high-priority order. It now moves speedily through the departments without any lengthy waiting periods and it takes 35 more minutes to pass through the Commission department and 25 min until it has been processed by the Final Control department, completing the supply chain with 6 hours left until the deadline.

Preliminaries:
If we want to model the above scenario, we need to determine the time individual employees require to complete their part of the processing statistically. As we have seen, the time required for processing orders within individual departments can vary considerably and if we want to realistically model the supply chains within a company, we need to come up with a reliable statistic prediction. However, a variety of factors (employee fatigue, breaks and any of the numerous other interruptions of the daily work life) conspire to make the distributions of processing times different from the more common probability distribution functions. Thus, the processing times for each department and each order complexity must be analyzed individually to find the distribution that fits it best. Finding suitable statistical distributions for a variety of data sets is a frequent occurrence in other research and industries. You will find citations of this early work in for example the following literature [2,3] . In face of all this early work our analysis is important for two reasons. First, we took on the challenge of modifying and applying these approaches to the uniquely distributed processing times of process chains of the aeronautical supply industry. Furthermore, we tried to take the peculiarities of such supply chains into account by dividing orders into different categories of complexity. Before starting with modeling the distribution of times LT Fin needed to complete an order we summarize some stochastic notations: A tripel ( , P, , ( )) on a topological space , its Borel -algebra , ( ) and probability measure P is called a probability space. Random variables (r.v.) are measurable functions X: Ω → . Integration of a r.v. with respect to the underlying probability space is denoted by an expectation E, i.e., E(X):= Conditional probability or expectation given an event A∈ ( ) is denoted by P(. A) or E(. A), respectively. For a given r.v. X on a real probability space ( d ⊂ ) we write f X for its density distribution function with respect to the Lebesgue measure (d.d.f.) and F X for its cumulative distribution function (c.d.f.).
Statistical model: Typically, the exponential distribution EXP is used for simulating random waiting times in a mathematical context. It has both a simple c.d.f. F(t) = 1-exp(-t). Moreover it can be characterized by the following properties: (E2) EXP is parameterized simply by its expectation 1 .
(E3) EXP is time-invariant: For an EXP( )-distributed r.v. X the conditional distribution of X-t given the already passed time period t (t > 0 arbitrary) is again EXP( )-distributed, since it is easily computed that P(X-t >s X>t)=exp(-s)=P(X >s). (E4) The probability that the waiting time is greater than an arbitrary t decreases exponentially in t: Let X be EXP( )-distributed, then for arbitrary t>0 the probability P(X > t) = 1-F X (t) = exp(-).

Remark:
The time intervals between two events of a poisson-process with intensity are EXP( )-distributed. The poisson-process is a good model for the radioactive emissions a Geiger-counter receives over time. However, for the distribution of the time to finish a certain order LT Fin which we examine in this study, we have to assume slightly different properties analogous with (E1) to (E4): (A1) The distribution should not only be + -valued.
Since we typically observe a minimal time t min > 0 to finish a certain task, the distribution should only assume values greater or equal than t min . (A2) We cannot expect to find a distribution only parameterized by its mean. We will also need to know at least its standard deviation.
(A3) We expect that the distribution is not stationary in time. Finishing an order means that the progress of work can be observed. Hence, after waiting some time t>t min we expect to observe a shorter time still to pass for easy complexity work or a longer time still needed to finish the order for difficult complexity orders as complications during the work process tend to happen more often with the latter. Therefore, for arbitrary s>0 the inequality P(X >s+t X >t) < P(X > s) for short duration orders or P(X > s +t X > t) > P(X > s) for long duration orders should hold. (A4) The probability P(X >t) need not decrease at least exponential. For t . For simulating times of short duration orders we expect again exponentially decrease, for long duration orders, however, we will also allow distributions decreasing more slowly for t t . (A5) The modal value t mod of most of the empiric distributions is not on the boundary t min or t max . Therefore, we will typically fit to d.d.f. f with a maximum not on the boundary argsup {f(t);t 0g>t min .

Identifying the class of distributions:
Keeping positivity (A1) and the shape of the d.d.f. (A5) in mind, we have the following set of distributions for further consideration (in alphabetical order) with their mean µ and standard deviation . It is also important that all distributions are parameterized by the same amount of parameters (here two). Otherwise, distributions with more parameters would fit data more easily than those with less and the results would not be comparable:

Log-logistic-distribution LLO(a, b):
only for a b a a only for a a a a π σ π π π π π = > = − >

Lognorm-distribution LNO(µ, ):
The exponential distribution EXP( l) equals GAM(1; l) and WEI(1; l). Because of this and since it has only one parameter, we do not list it seperatly. For more details about these distributions we refer the reader the following statistical handbook [4] .
It is easily seen that depending on the properties for the remaining time (A3) and the longtime behavior (A4) we may classify the above listed distributions in the way described in Table 1.
BET(a, b, t max ) is an extreme case in the class of distributions for easy complexity orders as it has the feature that the probability for times t >t max equals 0. Notice also that LLO(a, b) has no finite second moment for a 2. For being able to use the method of moment fitting to find suitable parameters of this distribution we have to assume that a > 2.
Fit class of distributions to observed data: To fit the parameters of a certain distribution to some given data we use the method of moment fitting: Step 1: Calculate the estimates of the mean and the standard deviation (both estimates are unbiased and have minimal mean square error).
Step 2: Find parameters of the distribution such that mean and standard deviation of the distribution equal the estimated moments. This is done by simply applying standard mathematical software solvers on the squared difference of the actual mean and standard deviation and the respective target values.
We used this method instead of maximum likelihood estimators (ML-estimator), as we have to fit a rather big class of distributions and it is not true for all distributions that the ML-estimator of their parameters is unbiased and has minimal mean square error. The estimators of the mean and standard deviation, however, always are. Since for all of the above distributions and for LLO(a; b) with a >2 the first and second moment exists, we will not encounter any problems.
To test the hypothesis that the observed data is actually distributed as we just calculated and to find the distribution which fits best the data we use the Kolmogorov-Smirnov-test (K-S-test). We preferred the K-S-test to the x 2 -test for two reasons: * First, the K-S-test is independent of the assumed distribution. * Second, the K-S-statistics are more sensitive to singular deviations from the assumed distribution whereas the x 2 -statistics is only sensible to greater single deviations or deviations within many intervals.
According to (A1) we have to find the minimum t min 0 of the distribution (and for BET also the maximum t max ). We started with the observed minimum of the data and the iteratively examined smaller tmin. We finally took the t min with the smallest K-S-estimator (analogously for tmax).

Section 1:
The summary of the results for the work times for Section 1 (as well as the summaries for the other sections) can be found in Fig. 1.
We see that with the complexity of the order the mean time LT Fin increases as well as the empirical t min and t max (except for medium complexity orders).
For simple and medium complexity orders we may easily decide which distribution to choose: For both cases the gamma distribution is one of the best fits,  The best fitting distributions belong to different classes.
To decide which distribution fits best we use an additional tool: The Q-Q plot where the quantiles of the empirical distribution on the ordinate are compared to the quantiles of the fitted distributions (Fig. 2).
In the Q-Q plot the steeper ascend than the empirical distribution indicates that the fitted distribution puts less probability on the interval and vice versa. The steeper than asymptotic ascend for large t also means that the respective distribution is more heavy-tailed than the empirical distribution. For difficult complexity orders we see that all distributions have lower probabilities for the interval [6,10] and compensate that on the interval [10,14]. For times > 14 all distributions are parallel to the empirical distribution, but on different levels. The gamma distribution fits better for small times whereas the loglogistic distribution fits better for large times. Since the data for small times is more reliable (10 points on the

interval [0, 7] compared to 3 on the interval [14, 26])
we rely on the better fit for small times and therefore prefer the gamma distribution. For more reliable results we would need more data (Fig. 1).

Section 2:
Again, we see that with the complexity of the order the mean time LT Fin increases. For medium complexity orders we can easily decide which distribution to choose: The only best fit is the lognormal distribution. Therefore, it is classified as long.
For simple and difficult complexity orders we have the problem of deciding which distribution fits best: For difficult complexity orders the best fitting distributions belong to different classes and for simple no distribution fits properly according to the K-S statistics. To decide which distribution fits best we consult again the Q-Q plot: For simple complexity orders (Fig. 3) we easily see in accordance with the K-S statistics that the lognormal distribution is the best fit of the data. However, the probability for small t (up to 30 min) is larger and the probability for medium t (from 30 min to 80 min) is smaller than modeled by any of the distributions (Fig. 5). Since we observed a large number of data, resampling of the empirical distribution would be an option.
Resampling means that instead of simulating the unknown distribution with a best fit estimate the distribution is directly simulated by the empirical distribution. This is done by drawing numbers of the set of data independently and with equal probability.
Advantages of resampling compared to sampling from a fitted distribution 1. The simulated distribution equals exactly the observed distribution. 2. It works for arbitrary data and is easy to handle.
Disadvantages of resampling compared to sampling from a fitted distribution 1. Only events that have actually occured can be samples, i.e., rare events (here large LT Fin ) will not be simulated. 2. This only works for large numbers of observed data. Otherwise there would be too much variance due to random perturbations.
For difficult complexity orders the Q-Q plot does not give any additional information (Fig. 4). We would need more data to classify the distribution correctly.
Classifying the distribution as difficult and simulating LT Fin with a fitted lognormal distribution is the most plausible way to model LT Fin with the accessible data (Fig. 1).

Section 3:
Again, we see that with the complexity of the order the mean time LT Fin increases as well as the  . For medium and difficult complexity orders we can easily decide which distribution to choose: The only best fit is the gamma or lognormal distribution, respectively. Therefore and according to the parameter a = 2:396 for the gamma distribution it is classified as short or long, respectively. For simple complexity orders no distribution _ts properly according to the K-S statistics. As also seen for other orders the best fit is the gamma distribution. However, the empirical distribution has (at least) three local maxima (trimodal) and should be quite reliable with 567 observations. Therefore, the value of the K-S statistics is too large and we again suggest sampling directly from the empirical distribution (Fig. 6).
However, since the local maxima are at 5 min, 10 min, 15 min and 20 min there might be a flaw in recording data. Some recorded times seem to be rounded to the next multiple of 5 min. For a more detailed analysis of the distribution in this case the experiment should be repeated.

Section 4:
Again, the summaries for the task Section 4 can be found in Fig. 1. This time, we see that with the complexity of the order the mean time LT Fin increases as well as the empirical times t min and t max . For difficult complexity orders we can easily decide which distribution to choose: The best fit is the lognormal distribution. Therefore, it is classified as long.
For simple and medium complexity orders no distribution fits properly according to the K-S statistics. This time, the best fit is either the gamma or the lognormal distribution with only slight differences in the value of the K-S statistics. Let us again have a look on the Q-Q plots ( Fig. 7 and 8).
The distribution of the simple complexity orders is again bimodal and thus no distribution fits properly. Again we would suggest sampling directly the empiric distribution, if we were sure that the data is recorded correctly. Since the local maxima are at 5 min, 10 min, 15 min and 20 min, there is definitely a flaw in the recorded data. Nearly all recorded times seem to be rounded to the next multiple of 5 min (Fig. 9).
For more detailed analysis of the distribution of times, the experiment should be repeated. The best fit for medium complexity orders is either the lognormal or the log-logistic distribution having for all times nearly the same increase as the empirical distribution in the Q-Q plot (Fig. 8). For continuity reasons we suggest using the lognormal distribution for sampling.

RESULTS AND DISCUSSION
We found that depending on the difficulty of the order process in a service oriented supply chain different distributions qualified as the best fit for the respective data sets. For short or medium duration orders (recognizable as a rule of thumb by an empirical mean of less than 10) gamma and weibull distributions were the best fits. For long duration orders the loglogistic or lognormal distributions fitted best.  According to this result, we suggest to simulate LT Fin depending on the complexity of the order.
* LT Fin of short duration orders should be simulated with fitted gamma or weibull distributions. Keep in mind that as seen in Fig. 1 for gamma distributions a good estimate for t min is crucial. It is therefore very important to know the minimal time from data analysis or even better from theoretical considerations. Since the best fits of the gamma distribution have suitable minima t min according to the empirical distribution, the gamma distribution should be preferred. * For long duration order times LT Fin should be simulated by fitted log-logistic or lognormal distributions. In all cases the best fit was observed at t min =0. In most cases the lognormal distribution was the best fit, so we suggest to use always the lognormal distribution for sampling data. * Some of the empirical distributions were bimodal or one even trimodal. Since we can relay on many observations this should not be the result of statistical perturbations. In all these cases no distribution fitted properly. For actual simulations we therefore use the empirical distribution to sample from. However, since the local maxima of the empirical distribution function were always multiples of 5 s, there might also be a flaw in recording the data. We suggest the simulation of times LT Fin according to Table 2. For future work it might be interesting to find proper fitting bimodal distributions and extend the classification of distributions of time accordingly. In this context compare also the work on hazard models (technical background) by Bain and Engelhardt [5] and survival times (biological and medical background) by Lee [6] and Kalbfleisch and Prentice [7] .