Cerebrovascular Accident Attack Classification Using Multilayer Feed Forward Artificial Neural Network with Back Propagation Error

,


INTRODUCTION
The word stroke (Cerebrovascular accident attack) is used to refer to a clinical syndrome, of presumed vascular origin, typified by rapidly developing signs of focal or global disturbance of cerebral functions lasting more than 24 h. or leading to death.
Usually, when there is an attack, the chances of a successful treatment depend essentially on the early diagnosis. In the early stage, many cases of stroke are curable if properly managed. In case of false or wrong diagnosis, precious times are lost, waste of scarce health and social services resources, the chances of curing/recovering from the attack diminishes and may eventually lead to death of patient. Stroke which has been categorized Mosalov et al., (2007) as ischemic stroke, hemorrhagic stroke and subarachnoid hemorrhage and has varying medical treatment in each case. In practice the part of medical errors while diagnosing a stroke type comes to 20-45% even for experienced doctors. According to the discussion in Jehangir and Rehman (2005), differentiation of cerebral infarction and cerebral hemorrhage is the most important first step in the management of acute stroke as clinical management of the two disorders differs substantially. In most developed countries, diagnosis is easily obtained by CT scanning, which allows the accurate distinction of hemorrhagic and ischemic types. However, quick access to CT scanning is not available in every country and hospitals. It is well known that some clinical data may suggest a hemorrhagic or ischemic stroke even though no data are specific enough to allow a reliable diagnosis.
A number of scoring systems based on clinical data determining the relative likelihood of infarction or hemorrhage were developed and tested over the last decade. Although the clinical diagnoses made using these scores seem more accurate than those made by physicians, they present several problems. The scope of methods of neurovisualization at stroke diagnosis is limited. Thus, the development of intellectual computer support systems that could assist to decrease the part of medical errors. In this study, attempt is made at looking at the application of Neural Networks for classification of Cerebrovascular Accident Attack (stroke) in patient. A Neural Network has the ability to mimic this type of decision-making process and use a knowledge base of information and a training set of practice cases, to learn to diagnose diseases.
The behavior of an artificial neural is inspired by the assumed behavior of a real neural in organic networks (Veelenturf, 1995). A simplified model of real neuron is composed of a cell body or soma, a set of fibers entering the cell body, called the dendrites and one special fiber leaving the soma, called the axon. The simplified model of a neuron can be simulated by an artificial neuron, Fig. 1. (Krose and Smagt, 1996;Mitchell et al., 1990. The perceptron is the basic processing element. It has inputs that may come from the environment or may be the outputs of other perceptrons.
Associated with each input, x j ∈R,I = 1,2,…n, is a connection weight or synaptic weight w j ∈R and the output, y, in the simplest case is a weighted sum of the inputs Eq. 1: w o is the intercept value to make the model more general; it is generally modeled as the weight coming from an extra bias unit, x, which is always +1.

MATERIALS AND METHODS
The Multi-layer perceptrons artificial neural networks with back-propagation error method are feed-forward nets with one or more layers of nodes between the input and output nodes. These additional layer contains hidden units or nodes that are not directly connected to both the input and output nodes. Multilayer perceptron with backpropagation learning is perhaps the most common paradigm for supervised neural network computing to date. This has been observed in the medical imaging area as well as many other pattern recognition areas.
ANNs can be viewed as weighted directed graphs in which artificial neurons are nodes and directed edges with weights are connections between neuron outputs and neuron inputs (Jain et al., 1996). Artificial network consists of the following as discussed in (Rumelhart and McClelland, 1986): Processing units: Within neural systems it is useful to distinguish three types of units: input units (indicated by an index i) which receive data from outside the neural network, output units (indicated by an index o) which send data out of the neural network and hidden units (indicated by an index h) whose input and output signals remain within the neural network, (Ethem, 2004;Mesut and Bob, 2004). During operation, units can be updated either synchronously or asynchronously. (Krose and Smagt, 1996;Haykin, 1999;.

Connections between units:
In most cases we assume that each unit provides an additive contribution to the input of the unit with which it is connected. The total input to unit k is simply the weighted sum of the separate out puts from each of the connected units plus a bias or offset term θ k Eq. 2: The contribution for positive wjk is considered as an excitation and for negative wjk as inhibition. In some cases more complex rules for combining inputs are used, in which a distinction is made between excitatory and inhibitory inputs.

Activation and output rules:
We also need a rule which gives the effect of the total input on the activation of the unit. We need a function fk which takes the total input sk (t) and the current activation yk and produces a new value of the activation of the unit k Eq. 3: Often, the activation function is a non-decreasing function of the total input of the unit Eq. 4: l k k k k jk yj k j y (t 1) f (y (t)) f ( w (t) (t) Although activation functions are not restricted to non-decreasing functions. Generally, some sort of threshold function is used. This includes a hard limiting threshold function (a sign function), or a linear or semilinear function, or a smoothly limiting threshold (Joarder et al., 2006;Tom, 1997) Medical Doctors use a combination of a patient's case history and current symptoms to reach a health diagnosis when a patient is ill. In order to recognize the combination of symptoms and history that points to a particular disease, the doctor's brain accesses memory of previous patients, as well as information that has been learned from books or other doctors. One of the most important problems of medical diagnosis is the subjectivity of the specialist, in particular pattern recognition activities, that is, the experience of the professional is closely related to the final diagnosis. This is due to the fact that the result does not depend on a systematized solution but on the interpretation of the patient's signal. It was highlighted in Brause (2001) that almost all the physicians are confronted during their formation by the task of learning to diagnose. Where, they have to solve the problem of deducing certain diseases or formulating a treatment based on more or less specified observations and knowledge.
An ANN is an artificial intelligence tool that can identifies arbitrary nonlinear multiparametric discriminate functions directly from clinical data. The use of ANNs has gain increasing popularity for applications where description of the dependency between dependent and independent variables is either unknown or very complex. The application of ANNs to complex relationships makes them highly attractive for the study of complex medical decision making. Recent applications include (Mesut and Bob, 2004) diagnosis, staging and progression of prostate cancer, progression of benign prostate hyperplasia and bladder cancer recurrence in TA/T1 bladder cancers.

RESULTS AND DISCUSSION
Predictive models are used in a variety of medical domains for diagnosis and prognostic tasks. These models are built from "experience", which constitutes data acquired from actual cases. The data can be preprocessed and expressed in a set of rules, such as it is often the case in knowledge based expert system, or serve as training data for statistical and medical learning models. Among the options in the latter category, the most popular models in medicine are logistic regression and artificial neural networks, (Mesut and Bob, 2004). ANNs are nonlinear regression computational devices that have been used for over 45 years in classification and survival prediction in several biomedical systems, including colon cancer. Multilayer feed-forward neural networks have been widely used for financial forecasting due to their ability to correctly classify and predict the dependent variable (Vellido et al., 1999). Back propagation is by far the most popular neural network training algorithm that has been used to perform learning for multilayer feed-forward neural networks (Russell and Norving, 1995). Ay et al. (2010) there is currently no instrument to stratify patients presenting with ischemic stroke according to early risk of recurrent stroke. However, in the Communications and Public Liaison, (2008) the symptoms of stroke are distinct because they happen quickly, for instance Sudden numbness or weakness of the face, arm, or leg (especially on one side of the body), Sudden confusion, trouble speaking or understanding speech, Sudden trouble seeing in one or both eyes, Sudden trouble walking, dizziness, loss of balance or coordination and Sudden severe headache with no known cause. So it was suggested that the best treatment for stroke is prevention. There are several risk factors that increase a chance of having a stroke, summarily, this includes, High blood pressure, Heart disease, Smoking, Diabetes, High cholesterol. Cerebrovascular Accident is an acute focal neurological deficit resulting from cerebrovascular disease. Notes, It is difficult to be sure clinically about the type of stroke because in majority of cases as there are no specific differentiating feature. However, findings has shown that features like sudden onset of coma or changing state of consciousness with severe headache, vomiting and meningeal irritation could suggest intracranial bleed.
Similarly in cerebral infarction patient are usually presented with sudden onset of stroke with lateralizing neurological deficit (hemi paresis, aphasia, homonymous hemianopia) with or without clinically detectable risk factors such as hypertension, atrial fibrillation, rheumatic heart disease, recent myocardial infarction, (Guieis and Perez, 1988).
Algorithm for Network Training: A feed-forward network has a layered structure. Each layer consists of units which receive their input from units from a layer directly below and send their output to units in a layer directly above the unit. There are no connections within a layer. The Ni inputs are fed into the rest layer of Nh,1 hidden units. The input units are merely 'fan-out' units; no processing takes place in these units. The activation of a hidden unit is a function F1 of the weighted inputs plus a bias. The output of the hidden units is distributed over the next layer of Nh, 2 hidden units, until the last layer of hidden units, of which the outputs are fed into a layer of No output units. Although back-propagation can be applied to networks with any number of layers. But, In most applications a feed-forward network with a single Layer of hidden units is used with a sigmoid activation function for the units (Hornik et al., 1989;Funahashi, 1989;Cybenko, 1989;Hartman et al., 1990).
For a general three-layer network with n1 neurons in the first and n2 neurons in the second layer and n_3 in the third layer as presented in Fig. 2.

Gradient descent adaptation method:
For any output neurons we can derive the relation between the output vector y and the input vector x= [x 1, x 2 ,…,x no ].
For the value of the ith output in the third layer we have Eq. 5: With f θ 3i the transfer function of the particular output neuron and 3i the weighted input of that neuron defined by Eq. 6: With w3t.j the weight in the connection between the j th neuron in the third layer and the output Y 2J of neuron J in the second layer and W3t.0 the threshold weight of the particular output neuron.
For the value of the j th output y 2j in the second layer we obtain similarly Eq. 7 and 8: With x_m the mth input value.

Definition:
A simple perceptron is a computing unit with threshold θ which, when receiving the n real inputs x 1 ,x 2 ,…,x n through edges with the associated weights w 1 , w 2 ,…,w n outputs 1 if the inequality n j j j 1w x = ≥ θ ∑ holds and otherwise 0 (Raul, 1996).
By substitution we obtain from the equations 10 above for the relation between the j th output and the inputs: x 1 ,x 2 ,…,x m of the network Eq. 11: 3 j 3t 3 j j 2 j 2 j k 1k 1km m j j j 1k,0 2 j,0 3 j,0 With the summation over the number of neurons in the particular layer.
By introducing a dummy neuron in layer 2 and 1 with constant transfer functions f 20(.) = 1 f 10 (.) = 1 and by adding a constant additional input x0 = 1 to the net, we can incorporate the thresholds w pq,0 of the neurons in the weighted input of the neurons by the terms w 3t ,0f 20 (.),w 2j ,f 10 (.) and w 1k , 0 x c (Veelenturf, 1995).
Theorem 1: A function g w1 R n → R that can be realized by a sigle-neuron perceptron with transfer function f can go exactly through the samples of the data set [x j, t (x j )] ∈D, iff for every element [x j ,t(x j )] from D the vector, Eq. 12: Obtained from n+1 elements [x j ,t(x j )] from the data set D.
The sigmoid function is the commonly used transfer function (Russell and Norving, 1995). It is given as Eq. 13-15: And: Then: Activation can take any real value, including negative values, if the particular output activation function supports this. In Jain et al. (1996), the supervise learning paradigm, the network is given a desired output for each input pattern. During the learning process, the actual output y generated by the network may not be equal to the desired output d. The basic principle of error-correction learning rule is to use the error signal/function to modify the connection weight to gradually reduce this error.
Theorem 2: Given a single-neuron perceptron that can realize functions g w :R n → R, then the weight vector w of the function g w :R n → R,with a minimal MSE for a given data set D can be determined from a set of linear equations if: for all x_i of D is 'small' • The derivative dE⁄ds = 0 only for its minimum • The transfer function f of the neuron has an inverse f-1 Theorem3: The MSE of a multi-layer network will decrease if a weight w 1 is changed according to for some ε >0 sufficiently small.

Theorem 4:
The MSE of a multi-layer network will decrease if a weight w is changed according to: w . E(w) ∆ = ε ∇ For some ε >0 sufficiently small with ∇E(w) the gradient vector with respect to w. The proportionality constant ε is called the learning rate. However the learning rate as prescribed in the above theorem is frequently called the learning rule based on the gradient descent method of the MSE (Veelenturf, 1995).
Given some finite learning set L D ⊆ of N pairs x j ,t(x j )]. Let u be the set of input vectors in L . We want to have for each input vector x j ∈u some target output vector t(x j ) . Let tj(x j ) be the target value of the j th output neurons for input vector x and let yj(x i ) be the actual output of the j th neuron in the output layer of the neural network. Then the sum square error from using the Adaline on all training patterns is given by Eq. 16: With N the number of samples in the learning set L, n (x j ), the number of times input x occurs in u And n 3 the number of output neurons in the third layer, (Veelenturf, 1995) Eq. 17: Equation 20 is called energy or error function of the net. The error E is a function of all weights in the net. For a given finite learning set L ans some initial random distribution of the weights the MSE will in general be large. We want to change, during a so-called learning period, the weights step by step, such that E wills degease to its global minimum. Only when the structure of the neural net is able to realize exactly the learning set L must the minimum of E become zero? For an infinitesimal increment ∆w j of the weight w j in the connection of some neuron, somewhere in the net the following holds Eq. 18: Weight adjustment variants: With back propagation, there are a number of other weight adjustment strategies that can be applied that can speed learning or avoid becoming trapped in local minima. One of such involves a momentum term where a portion of the last weight change is applied to the current weight adjustment round. The following equation shows an updated weight adjustment variant based on Eq. 19: For the given Error (E) and activation (or cell output, U 1 ), we multiply by a learning rate ∈ and add this to the current weight. The result is a minimization of the error at this cell, while moving the output cell activation closer to the expected output. The difference is that a portion of the last weight change (identified as ∆w ij is accumulated using a small momentum multiplier (m) Eq. 20: The last weight change is stored using Eq. 21: The adaptation of weight w_j could be obtained using Theorem 1, that is Eq. 22: According to theorem 2, the learning implies for the adaptation of the weights connecting neurons j in the second layer to neurons i in the third layer Eq. 23: The adaptation of the weights connecting neuron k in the first player to neuron j in the second layer Eq. 24: If we represent all the output of the n 1 neurons in the first layer including the constant component x 0 = 1 with a vector x ⌢ , then we can write, with the use of the extended weight vector 2 j 2 j,0 2 j.0 2 j.0, w [w .w ,..., w ] = ⌢ for the adaptation of the weight vector w 2j of output neuron j Eq. 25: The adaptation of the weights connecting neuron input to the first layer Eq. 26: With the notation 2 j i for the error of the output neuron t for an input x i we can thus write the following theorem (Veelenturf, 1996).
Theorem 5: The adaption of the weight vector w of a neuron in the second layer of a two-layer perceptron, in order to minimize the MSE for a given set of target values: For the adaption of the weight connecting the net input with neuron k in the first layer, we obtain according to Theorem 2 Eq. 27: The sum of product over j can be considered as the error (from the output neurons back-propagated) of neuron k in the first layer. So with Eq. 28: We can write Eq. 29: If we represent all the input of the network, including the constant component, with a vector, then we can write, with the use of the extended weight vector, the following theorem for the adaptation of the weight vector of input neuron k: Theorem 6: The adaptation of the weight vector of a neuron in the first layer of a two-layer perception in order to minimize the MSE for a given set of target values for a given finite set U of input vectors, is: We define the internal learning rate for input vector for the weights of neuron k in the layer as: The hidden layer of the network consists of 10 nodes. A good method to find out how many hidden neurons are needed would be to train the network several times (10+) and see what the average accuracy is at the end and then use this accuracy as a measure to compare different architectures. The weights between the layers are randomly initialized. They are set to small values in the range of [-0.5, 0.5]. Technically the weights can be set to a any fixed value since the weight updates will correct the weights eventually but may be inefficient. The whole point of setting the initialization range is to reduce the number of epochs required for training. Using weights closer to the required ones will obviously need fewer changes than weights that differ greatly. The learning Rate θ was set between 0.1 and 0.9 while the epoch set at 10.
Classically we split the data set into three parts: (The classic split of the dataset is 60, 20 and 20% for the training, generalization and validation data sets respectively): • The training data is what we used to train the network and update the weights with so it must be the largest chunk • Generalization data is data that we will run through the NN at the end of each epoch to see how well the network manages to handle unseen data • The validation data will be run though the neural network once training has completed (Table 1) • The deciding rule to stop training is to stop once a set number of epochs have elapsed.

CONCLUSION
In this research study, the application of Artificial Neural Networks to classification of Cerebrovascular Accident Attack in a patient is considered. The described is the theory behind the classification of Cerebrovascular Accident Attack in a patient using the three-layer feed forward artificial neural networks with back-propagation error. Data were collected for 100 records (60 males and 40 females) of patients from Federal Medical Centre, Owo, Nigeria and the Artificial Neural Networks classifier was trained using backward propagation algorithm with flexible sigmoid activation function at one hidden layer with 16 inputs nodes representing stroke onset symptoms at the input layer, 10 nodes at the hidden layer and one node at the output layer representing the type of attack. The learning Rate y was set between 0.1 and 0.9 while the epoch set at 10. The network was successfully trained with 150 epoch and MSE of 0.0698843.