Cross Validation Evaluation for Breast Cancer Prediction Using Multilayer Perceptron Neural Networks

: Problem statement: The presence of metastasis in the regional lymph nodes is the most important factor in predicting prognosis in breast cancer. Many biomarkers have been identified that appear to relate to the aggressive behaviour of cancer. However, the nonlinear relation of these markers to nodal status and also the existence of complex interaction between markers have prohibited an accurate prognosis. Approach: The aim of this study is to investigate the effectiveness of a Multilayer Perceptron (MLP) for predicting breast cancer progression using a set of four biomarkers of breast tumors. The biomarkers include DNA ploidy, cell cycle distribution (G0G1/G2M), steroid receptors (ER/PR) and S-Phase Fraction (SPF). A further objective of the study is to explore the predictive potential of these markers in defining the state of nodal involvement in breast cancer. Two methods of outcome evaluation viz. stratified and simple k-fold Cross Validation (CV) are studied in order to assess their accuracy and reliability for neural network validation. Criteria such as output accuracy, sensitivity and specificity are used for selecting the best validation technique besides evaluating the network outcome for different combinations of markers. Results: The results show that stratified 2 -fold CV is more accurate and reliable compared to simple k-fold CV as it obtains a higher accuracy and specificity and also provides a more stable network validation in terms of sensitivity. Best prediction results are obtained by using an individual marker-SPF which obtains an accuracy of 65%. Conclusion/Recommendations: Our findings suggest that MLP-based analysis provides an accurate and reliable platform for breast cancer prediction given that an appropriate design and validation method is employed.


INTRODUCTION
Breast cancer has been identified as the most widespread cancer amongst women and also the major cause of female cancer death all over the world (Etchells and Lisboa, 2006). An important factor influencing the breast cancer mortality rate is the efficacy of treatment intervention which in turn is influenced by the stage and accuracy of prognosis. Hence, accurate prognosis in patients with early stage breast cancer is of significant importance to reduce mortality rate.
Several prognostic factors including patient age, tumor size, tumor grade, DNA content (ploidy) and receptor status have been identified for nodal metastasis prediction with the hope to avoid axillary lymph node dissection (Lyman et al., 2005). However, no individual or combination of these prognostic factors has replaced nodal dissection for node status determination (Giuliano et al., 1997).
Amongst prognostic markers, those that can be obtained via minimally invasive methods are preferred for determining nodal status and survival prediction so to minimize patient morbidity along with mortality. Several studies have investigated different prognostic factors in an effort to define the prognostic value of these markers and find an optimal combination of markers which can be used as an accurate and reliable predictor for breast cancer prognosis. However, the complex interaction of these markers with nodal status and survival rate besides the existence of inter-relation between the markers has prevented accurate predictions using these markers.
Multivariate statistical methods have been widely used to investigate the prediction significance of prognostic factors. These multivariate models mainly include logistic regression (Hosmer and Lemeshow, 2000). However, there are several inadequacies in these methods which present doubts in their reliability. The study conducted by Concato et al. (1993) on the deficiencies of these statistical methods has investigated the present problems of multivariate analysis in medical research. Some of the reported problems include over fitting of data, not considering the inter-relation between markers and unknown method of selection among candidate markers which necessitates the need for improvement in medical research using these multivariate statistical methods. Multivariate regression methods are also prone to over-optimistic results which lead to misleading interpretation in defining the prognostic value of the investigated markers (Altman and Lyman, 1998).
Another approach that has been widely used for the aim of cancer prognosis is Artificial Neural Network (ANN) (Schwarzer et al., 2000;Ahmed, 2005;Kaur and Wasan, 2006;Ashidi et al., 2007). ANNs are parallel processing structures consisting of basic processing units (neurons) which are interconnected by weighted links. ANNs have the ability to learn patterns existing in data and hence perform classification and prediction for new data. There are different types of ANNs depending on their structure and learning process. The connections between neurons can be formed in different directions. In feed forward ANNs, all connections are set up in one direction from network's input towards the output. In addition, the learning process can be supervised or unsupervised depending on whether the input data is associated with known outputs during learning or not.
ANN has been confirmed as a robust method for the aim of cancer prognosis (Burke et al., 1994). It is also superior to conventional methods employed for breast cancer prediction such as Tumor, Node, Metastasis (TNM) staging system and logistic regression (Burke et al., 1997). One of the main advantages of ANNs over conventional methods is their ability in capturing the complex and nonlinear interaction between prognostic markers and the outcome to be predicted. They also enable taking into account the inter-relation between markers which can significantly improve the prognosis in oncology.
An ANN can have different structures based on the type of its input-output data and also its application. Among available structures, Multilayer Perceptron (MLP) has been widely used for the aim of cancer prediction and prognosis (Schwarzer et al., 2000). MLP is a class of feed forward neural networks which is trained in a supervised manner to become capable of outcome prediction for new data (Haykin, 2009).
In this study, three cellular markers including DNA ploidy, S-Phase Fraction (SPF) and cell cycle distribution in addition to a molecular marker-the state of steroid receptors including Estrogen and Progesterone Receptors (ER/PR) have been employed for nodal status prediction in breast cancer. The aim of the study is to employ a MLP neural network as a platform to predict the state of nodal involvement based on the four cellular and molecular biomarkers. This study also investigates the predictive accuracy of individual biomarkers in order to define their impact on outcome prediction in breast cancer. Besides, the relation between the mentioned cellular and molecular markers will be explored. We will also illustrate the capability of MLP in capturing both the linear and nonlinear relationship between the above markers and breast cancer outcome. In addition, the efficiency of stratified and simple k-fold Cross Validation (CV) in validating the MLP outcome for cancer prediction is investigated.
The study is organized as follows: the next section explains the breast cancer dataset used in this study and the roles of the biomarkers. Materials and methods include the MLP structure employed for cancer prediction, the validation method for assessing the designed network and also a brief description of Pearson's correlation coefficient which its results are later used to compare and validate those results obtained by the MLP. Following that, results and discussion are elaborated. Finally, the findings of the study are presented in the conclusion.

Breast cancer dataset:
The data utilised for nodal involvement analysis contains the information corresponding to four cellular and molecular breast tumor biomarkers pertaining to 46 patients who had been diagnosed with a carcinoma or benign breast tumor. The biomarkers include DNA ploidy, cell cycle distribution (G0G1/G2M), Steroid Receptors (ER/PR) and S-Phase Fraction (SPF). Nodal status in terms of cancer metastasis to regional lymph nodes has been defined as an outcome for all 46 patients. DNA aneoploidy is a state in which abnormal sets of chromosomes exist within the nucleus and is considered as an indicator of tumor malignancy. The degree of DNA ploidy is calculated based on the Integrated Nuclear Density (IND) measurement which is obtained by staining the aspirated tumor cells. Many studies have investigated the role of DNA ploidy of cancer cells in cancer prognosis. The results demonstrate that this marker is highly associated with relapse of the disease (Yuan et al., 1992), reduced survival time (Azua et al., 1997), metastasis to regional lymph nodes and early death (Gilchrist et al., 1993). In addition, aneuploidy has been identified as a significant prognostic biomarker for breast, prostate and endometrial cancer prognosis (Moureau-Zabotto et al., 2005;Suehiro et al., 2008;Pretorius et al., 2009).
However, some studies have found DNA ploidy uncorrelated with breast cancer prognosis (Naguib et al., 1999). Moreover, some studies suggest DNA ploidy as a consequence of premature 3 cells entering the Sphase and therefore the close correlation between aneoploidy and size of SPF. Nevertheless, Sherbet and Lakshami (Naguib and Sherbet, 2001) have found them totally uncorrelated.
The pattern of cell cycle distribution is defined by the G0G1/G2M ratio (ratio of the number of the cells in G0G1 phase over the number of the cells in G2M phase) which is measured by ICM (Anderson et al., 2003). The fraction of cycling cells in the tumor has proven to be an effective factor in the response of the carcinoma to chemotherapy (Remvikos et al., 1989). The size of the proliferative fraction is also known to be a good prognostic feature (Kallioniemi et al., 1988). Cell cycle distribution can be measured from DNA profiles derived from flow cytometry which is also considered as a reliable method for SPF measurement (Naguib et al., 1999). The reliability of SPF estimation depends upon the differentiation clarity of the G0G1 and G2M parts of the cell cycle distribution diagram.
In many studies, it has been proved that the status of hormone receptors of breast cancer cells can be used as useful information for cancer prognosis and treatment (Anderson et al., 2003;Grey et al., 2003;Esteva and Hortobagyi, 2004). The steroid receptors considered in this study include Estrogen Receptor (ER) and Progesterone Receptor (PR). Estrogen is a hormone with growth stimulating ability in a variety of target tissues. It binds to estrogen receptors which are then transmitted to the nucleus where they instigate responsive genes transcription and lead to appropriate psychological function. Estrogen and progesterone hormones can initiate the transcription of some target genes related with cell differentiation and proliferation (Phippard et al., 1996).  Tumors that are receptor positive respond well to treatment with anti-estrogens. So the absence of ER in breast cancer is considered as a sign of poor prognosis since these patients cannot benefit from anti-estrogen therapy. ER absence in breast cancer is caused by ER gene silencing resulting from hypermethylation (Grey et al., 2003). The role of the PR positivity in breast cancer is less significant. Normally, ER positive cancers are also PR positive, but there would be a poor prognosis for PR positive tumors that are not ER positive.
The size of the SPF indicates the percentage of cells in the stage of DNA replication in cell cycle and it is a validated marker for estimating the proliferative rate of tumor cells (Clark et al., 1989). SPF is also recognized as an independent prognostic factor in breast cancer (Bae et al., 2007;Gazic et al., 2008). A complete procedure of SPF measurement is described by Naguib et al. (1999).
Except for ER/PR, which takes discrete values, other markers are continuous within different ranges. Nodal status is defined as either 0 or 1 for the case of no node involved or metastasis to the regional lymph nodes, respectively. Table 1 and 2 show some descriptive statistics for continuous and discrete markers respectively.
All the mentioned markers are established as effective markers in breast cancer prognosis in medical context. However, the efficiency of the combination of these markers and also their inter-relation is further investigated in this study. In addition, the data feature vectors for the two output groups are plotted in the form of scatter plots in Fig. 1 Each scatter plot in Fig. 1 shows two feature vectors on two axes with the two output groups shown by "x" and "o" for no nodal metastasis and nodal metastasis respectively.
The scatter plots in Fig. 1 show that the data feature vectors are not linearly separable in the 2dimensional space.

MLP:
ANNs are a class of artificial intelligence methods commonly used for classification and pattern recognition. A MLP is a type of ANN which consists of a set of interconnected artificial neurons connected only in a forward manner to form layers. One input, one or more hidden and one output layer are the layers forming a MLP. A MLP with one hidden layer and its connections is illustrated in Fig. 2.
An artificial neuron is the basic processing element of a neural network, which consists of a linear combiner followed by a transfer function. The neuron's output (o) is computed by weighting the summation of the neuron's inputs which is then passed through a transfer function φ (.). This can be formulated in the Eq. 1 as: where, υ i is defined as the external input, m is the total number of inputs of the neuron and w i and bi are the weight and bias corresponding to the connection linking the i th input to the neuron. A hyperbolic tangent transfer function has been chosen in this paper for its special properties such as symmetry and monotonicity A hyperbolic tangent transfer function can be represented in the Eq. 2 as: The simplest form of trainable neural network, first developed (Rosenblatt, 1959), composed of two layers of nodes namely input and output layer. A mapping between the input and output data could be established by assigning weights to the input numerical data during training. More complicated MLPs which are commonly used consist of some hidden layers in addition to the input and output layers. These hidden layers enable the MLP to extract higher order statistics from a set of given data and hence, capture the complex relationship between input-output data. Therefore, MLPs commonly consist of an input layer for which the number of nodes are defined by size of input vector, one or more hidden layers which can have variable number of nodes depending on the application and an output layer which has one or more nodes depending on the number of output classes. Connections between these layers are defined by weights which are assigned in a supervised learning process so that the neural network would respond correctly to new data. This can be done via a training algorithm, in which a cost function is computed by comparing the network's output and the desired output and is then minimized with respect to the network parameters.
In this study, Scaled Conjugate Gradient (SCG) algorithm is employed as a supervised training algorithm for the MLP. SCG algorithm, proposed by Moller (1993), is a class of conjugate gradient optimization techniques applied for training feed forward neural networks. Conjugate gradient techniques consist of iterative algorithms for optimization in which the minimum of an error function is located by proceeding in a direction on error surface which is conjugate to the previous step. This is advantageous to standard back propagation in which the algorithm proceeds only in a downward direction on error surface and therefore one step is partially undone by the next step.
SCG, like other training algorithms in feed forward networks, consists of a forward and backward pass. In the forward pass, an error is computed by comparing the network's output and the desired output which is then fed to a cost function. A Mean Square Error (MSE) cost function is chosen in this work, defined in the Eq. 3 as: where, the MSE cost function is the mean of squarederror of the total number of patterns denoted by N. t j and o j are the desired output and the network's output respectively using the p th input pattern p j -O pj .
During the backward pass, the network parameters i.e., the weights and biases are updated by computing the second order partial derivative of the cost function. This derivative is called Hessian matrix and is computed in the Eq. 4 as: where, vector W indicates the network parameters. Using second order derivatives enable the network to predict the next input pattern more accurately. The Hessian matrix provides additional information related to the curvature of the cost function and hence results in faster and more accurate convergence to the minimum compared to first order techniques such as standard back propagation that uses first order derivatives only. The network parameters (i.e., weight and bias) update is then performed by changing the weight vector length and direction by Eq. 5: l l l wl 1 w a d + = + where, α l and dl define the step size and search direction at step l respectively. The search direction at each step is chosen such that it does not have any component parallel to the previous search direction. The step size in each step is defined in the Eq. 6 as: where, the error surface gradient at step l is defined as. In conjugate gradient algorithm, the Hessian matrix is computed by performing a line-search (Bishop, 1995). However, the high computational cost of line-search is an issue in conjugate gradient algorithm. In order to reduce this computational cost in SCG, is computed by evaluating. This is viable by online estimation of the Hessian matrix eigenvectors (LeCun et al., 1993). In this approach, the product of Hessian matrix with an arbitrary vector dm is computed without computing the full Hessian in each step. To ensure that Hessian in Eq. 6 is a positive definite matrix, it can be replaced by a modified version which is defined in the Eq. 7 as: β is a positive coefficient defined such that the new Hessian H % would be positive definite. In Eq. 7, I represent a unit matrix.
The training process is formed by several passes of information through the network called training iterations. Training may only complete when one of the predefined stopping criteria has occurred. These criteria are varied depending on the type of network and the training algorithm. In this study, a minimum amount of gradient performance and a maximum number of iterations are employed in conjunction as the network's stopping criteria to avoid over fitting and providing a good generalization performance for the network.

K-fold crosses validation:
After training, the network's performance is evaluated by a test process through which the network's classification outcome is computed using a new set of data fed to the input layer. Hence, the available dataset is initially divided into two parts which will be used for training and test independently. Random division of the data into two parts is commonly used for the training/test data division. However, this might not result in a reliable evaluation of the network for a small dataset as a part of the data is only reserved for the test purpose. Moreover, the random division might bring about training/test datasets with different proportions of output classes. This especially happens in dataset with imbalanced output classes.
In k-fold CV, the dataset is divided into k independent folds where k-1 folds are used to train the network and the remaining one is reserved for the test purpose. This procedure is then repeated until all folds are used once as a test set. The final output of the network is then computed by averaging over the obtained accuracy from each test set. We will refer to kfold CV as "simple k-fold CV" to differentiate it from the stratified k-fold CV.
Stratified k-fold CV is a special type of k-fold CV where the data folds are chosen such that each fold contains nearly the same proportion of the output data. Both stratified and simple k-fold CV is evaluated in this stuyd using different number of data folds to find an optimum evaluation method for the in-hand dataset.
Correlation coefficient: Correlation coefficient is a measure of dependence between two variables. In this study, Pearson's correlation coefficient (r) is used as a measure of linear relationship between different markers and the cancer outcome. Pearson's correlation coefficient can be obtained for two variables A and B by normalizing their covariance with respect to their standard deviation σ A and σ B as in the Eq. 8: where, µ A and µ B are the expected values of two random variables A and B and E is the expected value of the random variable. Pearson's correlation coefficient assigns a number between -1 to +1 for the measure of linear dependence between variables. A positive value represents a positive linear relationship while a negative one implies negative linear relationship and 0 suggests no linear relation between variables.

RESULTS
The designed MLP in this study consists of an input layer and one hidden layer with variable number of nodes depending on the number of input markers and an output layer with one neuron. The network is fed with different combination of markers in each run to investigate the predictive significance of each marker. Hence, the number of input neurons is defined by the number of markers and the number of hidden neurons is optimized for each marker combination. The network is then trained using SCG algorithm and validated with k-fold CV.
The network's outcome is classified into four groups depending on the desired output. A True Positive (TP) outcome denotes a cancer case classified correctly while a False Negative (FN) implies a cancer case classified as normal incorrectly. Accordingly, True Negative (TN) and False Positive (FP) stand for the normal cases classified correctly and incorrectly respectively. The network is thus evaluated by computing its accuracy, sensitivity and specificity edfined in the Eq. 9-11 as: The results obtained by running stratified and simple k-fold CV are first obtained by the designed network to predict the outcome using all input markers. These results are then analyzed to choose the best validation method to further investigate network prediction accuracy and markers' significance in outcome prediction.

Results of K-fold cross validation analysis:
The output accuracy of the designed network using different number of folds for stratified and simple k-fold CV are illustrated in Fig. 3-5. Considering the network accuracy, sensitivity and specificity using different stratified and simple k-fold CVs illustrated in Fig. 3-5, stratified CV is preferred over a simple CV as it obtains better and more reliable results. Moreover, investigating the output results for different values of k for k-fold CV shows that 2-fold CV is a better choice for network validation with the inhand dataset. Hence, the MLP results are evaluated using a stratified 2-fold CV.

Results of correlation coefficient and MLP analysis:
The results for Pearson's correlation coefficient computed for all 2-member possible combinations of the set including the input markers and the output are presented in Table 3. The cross section of each row and column in Table 3 shows the coefficient between the associated variables. The table illustrates a symmetric matrix with a diagonal of 1 as the Pearson's correlation coefficient is the same between variable A and B and vice versa and is 1 for two identical variables. Results from Table 3 suggest significant linear relation between DNA ploidy and ER/PR (p = 0.05). The degree of linear dependence of SPF and DNA ploidy (p = 0.06) and G0G1/G2M and SPF (p = 0.07) is also noticeable. Nonetheless, there is no significant linear relation between other markers and the output (p>0.1). These results however, do not necessarily provide any indication about the existence of any nonlinear interaction between the different markers and the output.
The MLP results are obtained using different combination of the mentioned four markers in the form of 3, 2 and 1-member marker sets and also for the full marker set. The best classification results based on inputs including groups of 4, 3, 2 and 1 biomarkers are included in Table 4. First column in Table 4 shows the markers used in the combination while the other columns represent the obtained sensitivity, specificity and accuracy in percentage.
Results in Table 4 show that the prediction accuracy obtained using all markers remains virtually unchanged despite using a 3 or 2-marker set. This can be explained by the interaction between the markers. Removing DNA ploidy from the set of all markers results in the same accuracy, sensitivity and specificity. In addition, removing SPF from the set including ER/PR, SPF and G0G1/G2M results in a higher specificity at the cost of reduced sensitivity but the accuracy remains unchanged.
From Table 3, no significant linear relation could be found between the individual markers and the output. However, in spite of the lack of linear relation between them, the higher predictive accuracy provided by SPF alone compared to other combinations proves the existence of strong nonlinear relation between SPF and output captured by the MLP.

DISCUSSION
A good deal of research conducted in the field of breast cancer prognosis has led to the identification of many new prognostic markers. However, besides exploring novel markers, finding the relationship between the new markers to those previously used along with the additional information they can provide is of great importance. Therefore, a reliable prediction system capable of predicting cancer progression on the basis of the tumor markers and which can also define the predictive accuracy of these markers is highly demanded. In the search of the best prediction models, many research studies have confirmed ANN as a good modeling approach for cancer diagnosis and prognosis (Hudson and Cohen, 2000).
This study has presented an artificial neural network based method to define the predictive accuracy of the features or subsets of features in breast cancer prognosis in terms of nodal status prediction. The final network structure is a three-layered network trained using a SCG algorithm. Although a single perception can perform nonlinear classification, there is no evidence that it can realize optimal decision boundary and has poor ability to generalize to unseen data. On the other hand, MLP has been proven to realize the optimal decision boundary and has the ability to generalize well to unseen data (Hornik et al., 1989). Finally, the designed network is evaluated using different number of folds in stratified and simple k-fold CV.
The results show that stratified 2-fold CV is a more accurate and reliable method as it obtains a higher accuracy and specificity and also provides a more stable network validation in terms of sensitivity. This can be explained by the same proportion of the output data existing in each group (fold) in stratified CV. when simple CV is used to partition the data into k folds, one fold may contain only one output data. This gives rise to biased output accuracy as the network is tested with only one group of outputs in the test set. This is rectified in stratified CV by having a balanced number of output groups in each fold.
The low variance and high accuracy of stratified 2fold CV in small sample sizes has been confirmed for k-nearest neighbor classifiers (Weiss, 1991). This is also proved for the MLP used for the breast cancer data in this study as the stratified 2-fold CV obtains higher accuracy and specificity compared to simple CV and other number of folds in k-fold CV.
In addition, stratified CV shows more consistent results compared to simple CV especially for sensitivity. Although the sensitivity achieved by the simple 2-fold CV is higher than that of stratified 2-fold CV, the later is chosen as it is more reliable.
All the three marker combinations including 4, 3 and 2 markers include ER/PR. This shows the important role of including ER/PR as an individual marker in nodal involvement prediction. Amongst 3marker input combinations, the arrangement including ER/PR, SPF and cell cycle distribution results in the best output accuracy which indicates the efficiency of this pattern for accurate prediction of nodal involvement. Between 2-marker combinations of ER/PR with other markers, the amalgamation with steroid receptors ends in the same accuracy achieved in the case of including all 4 biomarkers in the input which verifies the previous assumption about the efficiency of this combination for accurate prediction.
Pearson's correlation coefficient shows almost no linear relation between G0G1/G2M and nodal status outcome. ER/PR and G0G1/G2M are also hardly correlated linearly, based on the correlation coefficient results. However, the combination including ER/PR and G0G1/G2M provides a prediction as accurate as those results obtained by using all markers. These findings confirm the ability of the designed MLP in capturing nonlinear relations between these markers and the nodal status outcome.
Leaving DNA ploidy out from the network inputs does not cause any variation in classification accuracy. This can be explained by the close relation between G0G1/G2M and DNA ploidy. Since DNA ploidy is determined based on the percentage of cells being in G0G1 phase of cell cycle, it can be considered as an aspect of cell cycle distribution. Therefore, the inclusion of cell cycle distribution seems to compensate for the lack of DNA content information. It is worthy of note however that best prediction results are obtained by using only one marker-SPF. This confirms the predictive significance of this marker and also the negative correlation of markers in some cases which results in a lower predictive outcome using all the available markers.

CONCLUSION
This study presents an evaluation of four cellular and molecular breast cancer markers for the purpose of nodal status predication using a MLP neural network.
The main aim of the study is to investigate the neural network ability in capturing nonlinear interaction of these markers and nodal status in breast cancer. We have also assessed the effectiveness of stratified and simple k-fold CV for MLP outcome evaluation in case of having breast cancer dataset containing limited number of data. The results confirm the superiority of stratified 2-fold CV over the simple k-fold CV especially for a limited number of data. The ability of neural network in extracting the complex patterns existing in breast cancer tumor markers is further confirmed in this study.