Duration Modeling in Undergraduate Econometrics Curriculum Via Excel

The main goal of an econometrics course is to prepare students for more advanced levels of study and make them able to cope with the empirical problems they are likely to face. In recent years, there has been a tendency to purge merely theoretical topics from the econometrics textbooks by including only time honored methods. Survival analysis, however, has so far failed to attract the attention of textbook writers. I conjecture that the general perception of the issue as an advanced topic as well as software barriers have led to this attitude. In this paper, I work out an Excel file to outline the fundamental concepts of duration analysis at an introductory level. The file so generated may also be used by practitioners in appropriate cases with minimal learning cost. As such, I hope to dissipate the myth of difficulty associated with this subject matter.


Introduction
The undergraduate education aspires to equip students with tools most needed to tackle real life situations that students are likely to face upon assuming positions in the post-graduate world.
Arguably, one of the most application oriented courses in economics and business is econometrics. Realizing the need to provide students with the techniques they will make use of at work, econometrics textbooks try to tailor themselves to such a challenge. They mainly try to avoid being an encyclopedia of econometrics, which may include any topic under the sun ever discussed in academic circles, but rather concentrate on those select issues based upon the criteria indicated above.
Ironically, one of these topics, the duration analysis, has so far failed to find its way to undergraduate econometrics curricula. My survey of several undergraduate textbooks supports this observation i . As Hosmer and Lemeshow (1999) rightfully complain even though there are so many journal articles both in the theoretical as well as empirical sense on this issue, any course dealing with the duration analysis is considered beyond the reach of the undergraduate students, and even for graduate students, it is postponed to advanced levels of study. I contend, however, that a starter in the topic should (or may be "could") be included in the undergraduate curriculum, and it is actually quite possible to impart the fundamental ideas of duration analysis even with a minimal understanding of statistics and econometrics. In this paper, I generate a simple Excel file to introduce duration analysis in an easy to understand environment. Furthermore, as the concise literature review below will attest to the fact that the applicability of the duration analysis in business and economics as well as in many other social sciences, even practitioners may benefit from the discussion and the technique presented here if the case happens to be suitable for such an analysis with minimal learning cost.
The paper will proceed with a review of the literature with a special reference to economics and business applications of the survival analysis. The section afterwards will present the duration analysis in a rather non-technical way with as few formulas as possible. The major contribution of the paper is to be found in the section Life Tables for Survival Analysis where an artificial data set is used to generate non-parametric estimates of survival functions for different subject groups. The calculations as well as interpretation of the results are presented in this section, as well. Finally, the paper concludes with a plea for an effort to introduce duration analysis early on in higher education curriculum in economics and business.

Literature Review
Duration analysis is the study of the length of time it takes economic agents to leave a specific state, i.e. it is the "time to event" study. The duration analysis has found its applications in many fields. However, a common name for it is yet to come. The naming convenience is more of a reflection of its application context. Some of the possible names may be 'event history analysis' (sociology), 'failure time analysis' or 'reliability analysis' (engineering), 'duration analysis,' 'survival analysis,' 'transition analysis' (economics) (see Allison, 2001 on this). Encouraged (!) by the state of confusion in the naming sphere, I will make use of any one of these concepts interchangeably as I wish in this study.
Although the survival analysis in populations studies goes back to pre1700s (Hald, 1990), the econometric analysis of the duration of events (processes in transition) began in 1970s (Lancaster, 1997, Van den Berg, 2001. Examples of duration analysis in business and economics cover a wide variety of topics. But by far, the largest application has been in labor economics. The book by Lancaster (1997) makes use of labor economics examples throughout. I will cite a few papers in various branches of economics and business to convey the possible scope of this tool. Naturally this is not meant to be an exhaustive list. Lancaster (1997) is the most comprehensive econometrics book on the issue. Kiefer (1988) and Yamaguchi (1991) are other sources for a thorough analysis of duration methods. Kiefer (1988)  Other examples for the duration of employment related decisions such as unemployment benefits, strike durations and on-the-job training programs, and labor market participation, are Nickel (1979), Lancaster (1979), Heckman and Borjas (1980), Kennan (1985), Mortensen (1986), Bonnal, Fougere and Serandon (1997), Butler, Anderson andBurkhauser (1989), Meyer (1990), Kim (1990), Devine and Kiefer (1991), Sahin (1999), Bollens andNicaise (1994), Cockx andBardoulat (2000), Cockx and Dejemeppe (2002) and Sahin and Genc (2003).
Population and migration economics has extensively used the duration analysis to study such topics as the duration of marriages, migration time lengths, time until the birth of a child as well as issues surrounding family breakdowns. One can refer to Lillard (1993), Lindstrom (1996, Walker (1990) andFerguson, Horwood andShannon (1984) as examples. Vilcassim and Jain (1991), Antonides (1988) and Boizot, Robin and Visser (1997) represent various applications in the marketing and consumer economics fields. Gönül and Srinivasan (1993) apply a competing risks model with unobserved heterogeneity to study households' repeat-purchase and switching patterns among brands. Fok et al. (2002) study the impact of promotion on the duration between purchases. Pakes and Schankerman (1984) can be mentioned as an example in the industrial organization field where they study the duration of a patent. What motivates someone to take a major investment decision after a lull period is studied in Anti Nilsen and Schiantarelli (1998). Determinants of retention in graduate level economics programs is studied in Ridder and van Ours (2000).
To do justice to non-economics fields where perhaps these techniques have found larger audiences, the reader is referred to Yamaguchi (1991) for Sociology, Collett (1994) and Hosmer and Lemeshow (1999) for Biometrics and Medical Sciences and Cox and Oakes (1984) for Statistics applications of the duration analysis.

A Brief Introduction to Survival Analysis
The survival analysis deals with the length of time it takes for something to happen. For example, how long it takes for someone to leave the state (spell) of unemployment, or to leave the first job, or how long it takes for a machine to fail to operate? A more interesting issue is the impact of a certain treatment on the length of time it takes for an event to happen. To illustrate the point we can mention the survival (retention) times of those employees in jobs with a day care center versus those in jobs without a daycare center; or in jobs where childcare is provided and jobs where it is not.
The basic difference between the traditional microeconomic processes and the duration analysis is that the former is assumed to be complete while the latter is not necessarily completed by the time data are collected. This point calls into question the validity of inferences based on the assumption of completeness because in real life economic agents do not necessarily make their decisions on the same time points data are collected. Moreover, some of the agents may have dropped out of the observation process for various reasons, a case which is called Censoring. Ignoring censoring could potentially lead to unreliable results (Green, 1993, Chapter 22). For censored agents all we know is that they had survived until the last time we took data about them. This is called right censoring. For more on this issue, the reader is referred to the aforementioned references ii .
A related concept in duration analysis is the hazard (rate) function, h. Simply put, hazard rate is the probability of an event happening immediately after period t once it is known that it has not taken place until time t. For example, it is the probability of someone finding a job today (exiting the current state of unemployment today) given that he has not been able to find one so far. Statistically, the hazard function of an event can be written as iii The only difference between the hazard function and a probability density function, pdf iv , is the conditionality imposed by the hazard function because pdf measures the probability of someone finding a job today irrespective of how long the person might have been in the job market.
On a related note, notice that if we interpret the hazard function as the risk of an event occurring at any time once time t is reached, the inverse of the hazard function, 1/h(t), can be interpreted as the expected length of time until the event occurs assuming h(t) remains constant (Allison, 2001).
The statistical representation of the survival function is where F stands for the cumulative distribution function, cdf. One can easily relate the two equations to one another via where f is pdf. It is clear that there is a well-defined road map among f, h, F and S. Thus, analyses on a subgroup of them can be transferred to others through statistical methods, which means that discussing the hazard of a phenomenon will readily yield its characteristics in terms of its survival and vice versa.
There are parametric as well as nonparametric methods of survival analysis v . The nonparametric analysis is handled through the so-called life tables. It is a ratio analysis whose goal is to find/describe the proportion of agents surviving at various time intervals. Life tables are mostly drawn for two different groups (or potentially for more groups), one which has gone through some treatment (treatment group) while the other (the control group) which has not. The basic idea is to see the impact of treatment on the survivorship of the subjects under study.
Interpreted this way, different life tables for different categories of economic agents yield comparably similar results to the parametric analysis without actually running one such analysis.
If the impact of several independent variables, called covariates in the duration literature, need to be detected, the parametric counterpart of life tables can be employed. But that requires quite a few assumptions to be made by the researcher regarding the shape of the distribution of the survival. Although this method may be more akin to the popular regression methods, one should not forget that the censoring of the data could make parametric methods subject to a lot of personal preferences on the part of the researcher, which may lead to a variety of results depending on the parametric distribution chosen. That is why, life tables for various groups can be used with minimal loss of qualitative generality for inference purposes.
In what follows, a simple method to obtain life tables is discussed.

Life Tables for Survival Analysis
As mentioned before, the main reason duration analysis is left out of the undergraduate curriculum is the supposed difficulty associated with the method in question. Obviously, the cost of learning a software might also play a role in this outcome. But I believe that this important tool could easily be incorporated into the undergraduate curriculum without much cost as long as students are capable of using Excel, which is almost given.
In this section, I will generate a hypothetical data set, and thereafter discuss the basics of survival analysis via the life tables: Suppose we follow up 13 people in the job market for 15 months. In reality, this is a very small data set, and all statistical troubles with small samples should come to mind when interpreting findings from such a data set. However, for us, it serves the purpose of illustrating our points. We are interested in the time length they stay in the job market. All individuals are high school graduates. However, some of them took college level pre-business classes (Treatment Group) before entering the job market while others did not (Control Group). The treatment group earned a pre-business certification as result. We would like to see the survival of both groups individually, and in a comparative sense. The comparison of survival in the job market of the high school graduates will give us the partial impact of the pre-business classes (certificate) on finding a job. The data set is shown in Figure 1 vi .

Figure 1 about here
In the data set, ID is a personal identifier such as Person 1, Person 2. And it refers to the "individual," that is the subject that is followed up throughout a given state to see if he or she has exited the current state. In our case, the individual may or may not leave the current state of being unemployed. If he or she has already left the state of unemployment, we code 1 under the column Unemployed for him or her. If the individual has not left the given state after our data collection period comes to an end, in this example that is 15 months, he or she is coded 0 under the same column. Thus, Unemployed is binary status indicator variable. The person who has left the job market is said to have "exited" the state.
Another dummy variable is PreBus, which takes on a value of 1 for the subjects who took the college level pre-business classes (Treatment Group), and 0 for those who did not (Control Group). Duration in the data set refers to the length of time in months a person has been/stayed in the job market.
There are 8 people in our data set who fall in the control group and 5 in the treatment category. Of these 13 people, all but one, ID 13, have exited the state of unemployment. Thus, the individual 13 is censored in our example vii . The minimum stay in the job market is 1 month while the maximum surpasses 15 months. On the average, people stay in the job market 6.69 months with a standard deviation of 5.38 months viii . The median duration is 5 months. In practice, it is worthwhile to look at the descriptive statistics of each group, but it is left here to the reader since it does not constitute the main subject of this paper.
To generate life tables we divide the observation period into 10 equal subperiods of 1.5 months. The choice of a range, r, is arbitrary, and basically depends on the interests of the researcher. We re-present the data pertaining to the two sub-groups in Figure 2 for the control group and in Figure 3 for the treatment group.

Figures 2-4 about here
Concentrating on the control group first, we carry out the calculations as in Figure 4. R0 and R1 in the figure refer to the beginning and the end of the subperiod, respectively. Thus, Notice that the beginning of the very first subperiod is zero while its end is dictated by the researcher, in our case, r = 1.5 months. The rest follows the suit. The column labeled Entered gives the total number of people still in the given state (process), that is, those who are still unemployed. Initially, everybody is in the unemployment state, but in the subsequent subperiods, this number is reduced as people exit the state as shown in the column titled Dropped. If a person is censored, a 1 is coded in the Censored column, otherwise a 0. Only the last observation is censored in this category because this person has not left the job market by the time our observation (data collection) period came to an end.
Proportion:q represents the ratio of those who could NOT survive the subperiod. This is called the Failure Rate ix . It is calculated as the ratio of those who completed (those who could NOT survive) the subperiod to all those who entered it less censored ones. Another proportion concept, i.e. Proportion:p gives the ratio of those who could survive the subperiod. It is calculated as 1-q. This is called the Survival Rate. However, what we mean by survival, as in equation 2, is given by the column titled P, which is called the Cumulative Survival Function.
Observe that the first value in this column is 1. This is simply a convenient assumption to get the process going. It means that initially all individuals would have a survival of 100%. The cumulative survival function in each subperiod is calculated as 4.
where j represents the current subperiod. For the sake of completeness, we present in the figure what is called the Cumulative Failure Rate, 1-P. It has a corresponding interpretation to the Cumulative Survival Rate. The numerical values of these calculations are presented in Figure 5.
Figures 5 and 6 about here Figure 6 depicts a survival chart for the control group, which shows the cumulative survival rate (column labeled P) against the duration (column labeled R1). The choice between R0 and R1 is not standard, but most of the literature prefers R1 to R0. Alternatively, the midpoint between R1 and R0 could be used. The main trick at this stage is to interpret the graph, which is nothing but to correctly read the numbers off of it. As is clear from the graph, at least 60 per cent of the subjects in the control group could still survive up to the 9 th period. Also about 25 per cent of the individuals stayed in the job market for the whole observation period of 15 months.
If all we are interested is this group we can stop here. But usually, we would like to also measure the impact of an independent variable on the survivorship. It happens to be the college level pre-business courses in our case. We replicated all of our calculations for the treatment group in Figures 7 through 9. Since the process is identical we will not repeat the discussion on calculations here.

Figures 7-9 about here
A close inspection of Figure 9 reveals that about 60 per cent of the individuals exited the current state by the 4 th period. In comparison to the control group, this is a much faster exit rate.
That is, those who took the pre-business classes while at high school to earn a pre-business certificate tend to find jobs a lot faster than those who did not. Additionally, the percentage of those who stayed in the job market for the whole period is relatively less among the treatment group than the control group x . Hence, we conclude that a pre-business certification earned by taking college level business classes while at high school makes finding a job after graduation easier with respect to those who failed to obtain that certificate.
Before concluding, I believe it would be worthwhile to mention that the hazard function and pdf can be easily incorporated into the Excel file generated above. In the interest of space, however, the Excel snapshots are not included here. For the hazard function and the pdf, all one has to do is to enter the following formula into Excel, respectively: where all the variables are as defined before. All of the components of these computations are readily available in the Excel file already generated.

Conclusions
Duration analysis (also known as survival analysis besides quite a few other names) has found sympathetic audiences in engineering and medical research as well as several social sciences for a long time. Its history in economics and business is short by comparison, but arguably rich. In almost all sub-branches of economics and business we can find some application of survival analysis. Nevertheless, this toolbox has been avoided by the undergraduate econometrics textbooks since it is considered an advanced topic beyond the possible grasp of students at this level. Given the perspective it can offer to students in formulating real life problems, I believe an introduction to this topic, however basic, ought to be made at the early stages of education to alert the students to these kinds of problems. Furthermore, practitioners, who may not have plenty of time to invest and venture in new methods of econometric analysis, too, should find the technique interesting for suitable cases, and the way it is presented here easy to comprehend.
With mainly these two interest groups in mind, I worked out a simple Excel file which uses an artificially generated data set to introduce the reader to duration analysis. Without being bogged down with the theoretical jargon full of formulas, but still keeping the integrity of the scientific treatment of a topic, I show the calculations needed to generate life tables for two different groups, and provide interpretation of the findings with the help of graphs. This provides a relatively complete presentation of the life tables analysis procedure in an easy to understand and apply manner xi . Needless to say that there are certain software packages which carry out these calculations with a point and click of a mouse, but my presentation opens up the black box for the student, who may proceed from here on with more sophisticated applications of the survival analysis. Endnotes i See the Appendix for the information regarding the textbooks surveyed. ii There is also a difference between statistical duration analysis and the statistical analysis, which lies in the fact that economic decisions have to be optimal choices (Lancaster, 1997). iii The treatment of the subject is rather terse here, but the interested readers are referred to the books mentioned in the paper above. iv pdf can be written as v There are also semi-parametric methods such as Cox Proportional Hazards Model, which requires less effort in the specification of the distribution. vi Notice the cell coordinates in the Excel file. vii There is nothing special about ID 13 or its being the last person in the data set. Any other person might have been censored, too, because of the fact that, say, we have lost track of him or her sometime during data collection period. Nothing would be affected by this incident as far as calculations are concerned. viii A word of caution is in order here regarding the mean value of duration with partially censored data. The mean is at least large as reported above since due to censoring some of the observations would have increased the average if they had exited the study during the follow-up time. In that case, median might be a more sensible statistic to consider. ix Unfortunately, naming is not standard in duration analysis studies, and reflects the context, as mentioned before, most of the time, and maybe even confusing at times. It takes some getting used to, but it is worthwhile to invest in it.
x It is also common to superimpose one graph on the other for better visibility. xi A more involved worksheet with further statistics can be requested from the author. xii This list does not include the books mentioned in the appendix.