Estimation of the COCOMO Model Parameters Using Genetic Algorithms for NASA Software Projects

: Defining the project estimated cost, duration and maintenance effort early in the development life cycle is a valuable goal to be achieved for software projects. Many model structures evolved in the literature. These model structures consider modeling software effort as a function of the developed line of code (DLOC). Building such a function helps project managers to accurately allocate the available resources for the project. In this study, we present two new model structures to estimate the effort required for the development of software projects using Genetic Algorithms (GAs). A modified version of the famous COCOMO model provided to explore the effect of the software development adopted methodology in effort computation. The performance of the developed models were tested on NASA software project dataset [1] .The developed models were able to provide a good estimation capabilities.


INTRODUCTION
In recent years, the development of large-scale software projects gain a growing interest [2,3] . Being able to define, the software size, the development duration and the required facilities became more and more a challenging task. The reason is software architecture, requirements, tools and techniques became more complex.
Project manager will significantly need to identify the cost estimate so that he can evaluate the project progress and have better resource utilization. It was found that the main cost driver is the effort [4] . The primary element which affects the effort estimation is the developed line of code (DLOC). The DLOC include all program instructions and formal statements.
One of the famous model structures used to estimate the software effort is the COnstructive COst Model (COCOMO). COCOMO was developed by Boehm [4,5] . This model was built based on 63 software projects. The model helps in defining the mathematical relationship between the software development time, the effort in man-months and the maintenance effort [6] .
Soft-computing techniques were explored to build efficient effort estimation models structures. In [7] , authors provided a survey on the cost estimation models using artificial neural networks. Fuzzy logic and neural networks were used for software engineering project management [8] . A fuzzy COCOMO model was developed [9] .
Recently, many questions about the applicability of using evolutionary computation techniques to build estimation models were introduced [10] . The objective of this study is to focus on building an evolutionary model for estimating software effort using genetic algorithms. GAs will be used to estimate the parameters of a COCOMO type effort estimation model. Genetic algorithm is an adaptive search algorithm based on the Darwinian notion of natural selection. GAs searches the space of all possible solution using a population of individuals which is considered as potential solutions of the problem under study. These solutions are computed based on their fitness. The solutions that best fit to the objective criterion survive in the upcoming generations and produce "offspring" which are variations of their parents [24] .
Stochastic algorithms: There exist many engineering and computer science problems for which no adequate, robust and global algorithms exist. Most of these problems are optimization problems [11] . There are two classes of algorithms often used to deal with such complex problems.
They are the deterministic and the stochastic algorithms. The deterministic algorithms usually provide approximate solutions and not optimal ones. A priori knowledge about the starting search location affects the search process. Poor starting points significantly direct the search toward local optimal solution. This represents a challenge for the deterministic search. For hard optimization problems, it is often recommended to use probabilistic algorithms. These algorithms do not assure global optimal solutions but they have the advantage of randomly generating solutions with higher level of performance accuracy.
Genetic algorithms: Genetic Algorithms (GAs) are among those stochastic search algorithms. They are adaptive search procedures which were introduced by John Holland [12] and extensively studied by Goldberg [13] , De Jong [14,15] and others [16] . GAs has been successfully used in a wide variety of difficult numerical optimization problems. They have been successfully used to solve system identification, signal processing and path planning problems [17][18][19][20] .
Evolutionary process: The evolutionary process of GAs starts by the computation of the fitness of each individual in the initial population. While stopping criterion is not yet reached we do the following; * Select individual for reproduction using some selection mechanisms (i.e. tournament, rank, etc.). * Create an offspring using crossover and mutation operators. The probability of crossover and mutation are selected based on the application. * Compute the new generation. This process will end either when the optimal solution is found or the maximum number of generations is reached.

Representation:
In all Evolutionary Algorithms (EAs) techniques, it is required to transfer the problem from its real domain to the domain of EA. GAs offer different kinds of representations. Holland introduced the binary string representation [12] . Michalewicz showed that for real-valued numerical optimization problems, floating-point representations is more efficient and can lead to faster convergence to the optimal solution domain [21] . This representation scheme is closer to the real problem domain and can achieve higher performance and accuracy.
Genetic algorithms versus conventional search algorithms: One of the major advantages of GAs compared to conventional search algorithms is that it operates on a population of solutions not only a single point. This makes GA results more robust and accurate. The solution provided by GAs is more optimal and global in nature. GAs are less likely to be trapped by local optima like Newton or gradient descent methods [22,23] . GAs require no derivative information about the fitness criterion [13,14] .This is why it is very suitable for both continuous and discrete optimization problems. In addition, GAs are less sensitive to the presence of noise and uncertainty in measurements [24,25] . There are some features which make genetic algorithms different from conventional search algorithms. Goldberg [13] stated that: * Genetic Algorithms implement the search using a coded solution not the solutions themselves. * Genetic Algorithms is based on a population of candidate solutions, not just a single solution. * Genetic Algorithms evaluate individual based on their fitness function not the derivative of the function. * Genetic Algorithms use probabilistic operators (i.e. crossover and mutation) not deterministic ones.

Problem formulation:
To see how these ideas are applied to function optimization, suppose without loss of generality that we want to minimize a function of n parameters f(a 1 , a 2 , ..., a n ).
is identified as a search space for each parameter. f(a 1 , a 2 , ..., a n ) is positive function. a i D i . Candidate solutions are defined as n-dimensional vectors of parameters of the form: a 1, a 2, ....., a n which can be viewed as "Chromosomes" and the individual parameters as "genes". For each such vector of parameter values, its associated function value serves as its fitness, with lower values preferred for minimization problems.
The GA search process is based on using a population of individuals each of which is evaluated based on its fitness value. Individuals with higher fitness are selected to produce offspring which inherit many but not all of the features of their parents. This is achieved using genetic operators like mutation and crossover [13,14] .

Fitness function:
The evaluation criterion to measure the performance of the developed GA based models is selected to be the Variance-Accounted-For (VAF).The VAF is calculated as: Experimental results: Experiments have been conducted on a data set presented by Bailey and Basili [1] so that we can develop an effort estimation model. The dataset consist of two variables. They are the Developed Line of code (DLOC), the Methodology (ME) and the measured effort. DLOC is described in Kilo Line of Code (KLOC) and the Effort is in manmonths. The dataset is given in Table 1. The data for the first 13 projects were used to estimate the model parameters and the other 5 projects were used for testing their performance.

Effort model based DLOC:
The COnstructive COst Model (COCOMO) was provided by Boehm [4,5] . This model structure is classified based on the type of projects to be handled. They include the organic, semidetached and embedded projects. This model structure comes in the following form: Normally the model parameters are fixed for these models based on the software project type [4,5] . Our goal is to use GAs to provide a new estimate of the COCOMO model parameters. This will allow us to compute the effort developed for the NASA software projects. The estimated parameters will significantly generalize the computation of the developed effort for all projects. We used GAs to develop the following model.

Ef fort =4 .9067(DLOC ) 0.7311
(2) In Table 2, we show the actual measured effort over the given 18 projects and the effort estimated based the GAs model. The tuning parameters for the GA evolutionary process, to estimate the COCOMO model parameters, which include the population size, crossover, mutation types and selection mechanisms are given in the Table 3. We used the GAOT Matlab Toolbox to produce our results [26] . The computed VAF criterion was 96.3138. Figures 1-3 show the measured and estimated GA effort, the convergence process for GAs (i.e. the best so far curve of the VAF) and the convergence of the GA model parameters after each generation.

Proposed effort models based DLOC and ME:
To consider the effect of methodology (ME), as an element contributing to the computation of the software developed effort, we proposed two new models structures. We will call them model 1 and model 2.
They are variations of the famous COCOMO model. Now, we will explore the modeling process of the proposed models and describe the mathematical equations for the two models. We proposed these models based on some theoretical aspects related to linear model structure development process. Adding the effect of ME will improve the model prediction quality as given in model 1. It was also found that adding a bias term similar to the classes of regression models helps to stabilize the model and reduce the effect of noise in measurements.

Model 1:
The proposed model structure considered the effect of ME as linearly related to the effort. The proposed model structure have there parameters a, b and c.

Ef fort =a(DLOC ) b + c(ME )
Our goal is to find the model parameters which most suited to accurately and the software effort for project development. In Table 4, we show the actual measured effort and the estimated effort based on the proposed model 1 using the same dataset. The model parameters were estimated and the developed model was as follows: Model 2: A slightly better estimation capabilities was achieved using developed model 1. This is why we decide to modify the model by adding a new bias parameter to the above model and re-estimate the new model parameters, model 2, using GAs.
The proposed model 2 is given mathematically as follows: Ef fort =a(DLOC ) b + c (ME )+ d (5) The estimated parameters a, b,c and d for model 2 were estimated using GAs as follows:  In Table 5, we show the actual measured effort and the estimated effort based on proposed model 2. The tuning parameters for the GA evolutionary process which includes the search space for the model parameters, population size, crossover probability and mutation probability are given in the Table 6.

RESULTS
We developed two new model structures, as variation of the COCOMO model to compute the effort required for each of the 18 projects. Our intention concern the development of model structures which can generalize the effort computed for all projects under study.
Genetic Algorithms were used to estimate the COCOMO model parameters. Two models, model 1 and 2, were provided. The prediction capabilities for the three models are shown in Table 7. From the Table, it can be seen that taking into consideration the effect of ME helps to improve the computed VAF. The two proposed models successfully improved the performance of the estimated effort with respect to the VAF criteria.

CONCLUSION
In this study we proposed two new model structures to estimate the software effort for projects sponsored by NASA using genetic algorithms. Modified versions of the famous COCOMO model were provided to consider the effect of methodology in effort estimation. The performances of the developed models were tested on NASA software project data presented in [1] . The developed models were able to provide good estimation capabilities. We suggest the use of Genetic Programming (GP) technique to build suitable model structure for the software effort. GP can find a more advanced mathematical function of both the DLOC and ME such that the computed effort will be more accurate.