ITERATIVE DICHOTOMISER-3 ALGORITHM IN DATA MINING APPLIED TO DIABETES DATABASE

In this study, eight major factors playing significant role in the Pima Indian population are analyzed. Real time data is taken from the large dataset of National Institute of Diabetes and Digestive and Kidney Diseases. The data is subjected to an analysis by logistic regression method using SPSS 7.5 statistical software, to isolate the most significant factors among the eight factors taken. Then the significant factors are further applied to decision tree technique called the Iterative Dichotomiser-3 algorithm which leads to significant conclusions about this diabetes disorder which poses to be a greatest threat to mankind in the coming era. Conglomeration of data mining techniques and medical data base research can lead to life saving conclusions for the physicians at critical times to save the mankind.


INTRODUCTION
ID3 begins by choosing a random subset of the training instances. This subset is called the window. The procedure builds a decision tree that correctly classifies all instances in the window. The tree is then tested on the training instances outside the window. If all the instances are classified correctly then the procedure halts. Otherwise it adds some of the instances incorrectly classified to the window and repeats the process. This iterative strategy is empirically more efficient than considering all instances at once. In building a decision tree ID3 selects the feature which minimizes the entropy function given below and thus best discriminates among the training instances. Data have been collected from about 768 Indian Origin females who were tested for the presence of diabetes mellitus of which 268 were found to be positive. Sample of 336 records are selected deleting the record sets with zero values. (Quinlan, 1986) In data mining, a decision tree is a predictive model; that is, a mapping of observations about an item to conclusions about the item's target value. More descriptive names for such tree models are classification tree or reduction tree. In these tree structures, leaves represent (Ankerst et al., 1999) classifications and branches represent conjunctions of features that lead to those classifications. The machine learning technique for inducing a decision tree from data is called decision tree learning, or decision trees. In decision theory and decision analysis, a decision tree is a graph or model of decisions and their possible consequences, including chance event outcomes, resource costs and utility. It can be used to create a plan to reach a goal. Decision trees are constructed in order to help with making decisions (Almuallim, 1996). A decision tree is a special form of tree structure and a descriptive means for calculating conditional probabilities.

Decision Trees in Data Mining
Decision tree learning is a common method used in data mining. Each interior node corresponds to a variable; an arc to a child represents a possible value of that variable. A leaf represents a possible value of target variable given the values of the variables represented by the path from the root. A tree can be "learned" by splitting the source set into subsets based on an attribute value test. This process is repeated on each derived subset in a recursive manner. The recursion is completed when splitting is either non-feasible, or a singular classification can be applied to each element of the Science Publications JCS derived subset. A random forest classifier uses a number of decision trees, in order to improve the classification rate. In data mining (Brodley and Utgoff, 1995), trees can be described also as the synergy of mathematical and computing techniques that aids on the description, categorization and generalization of a given set of data. Data comes in records of the form: The dependent variable, y, is the variable that we are trying to understand, classify or generalize. The other variables x 1 , x 2 , x 3 are the variables that will help us for predictions.

Information Entropy by Claude Shannon
In information theory, the Shannon entropy (Ankerst et al., 1999) or information entropy is a measure of the uncertainty associated with a random variable. It can be interpreted as the average shortest message length, in bits, that can be sent to communicate the true value of the random variable to a recipient. This represents a fundamental mathematical limit on the best possible lossless data compression of any communication: The shortest average number of bits that can be sent to communicate one message out of all the possibilities is the Shannon entropy (Shannon, 1948), Formula for computing the entropy:

Definition of Information Entropy
The information entropy of a discrete random variable X, that can take the range of possible values {x 1 ... x n } is defined to be: which is itself a random variable and p (x i ) = P(X = x i ) is the probability mass function of X. Iterative Dichotomiser-3 (ID3) is an algorithm used to generate a decision tree. However, it does not always produce the smallest tree and is therefore a heuristic. Occam's razor is formalized using the concept of information entropy as:

LOGISTIC REGRESSION OUTPUTS FROM SPSS 7.5
Logistic regression method (Tsien et al., 1998) was applied to bring out the significance factors like age, obesity in the cause of the diabetes disorder, in Pima Indian diabetes database using SPSS 7.5 software. These factors are fuzzified to form a sample decision tree by ID3 algorithm. After dependent variable Encoding and from the observations of Table 1, we find that the following factors playing significant role in the cause of diabetes:   (Almuallim, 1996) Select a random subset W (called the "window") from the training set.

Analysis of Logistic Regression Results
Build a decision tree for the current window. Select the best feature which minimizes the entropy function H. H = ∑-p i log p i (optimal values are available and the optimum entropy may be found by discrete probabilistic methods).
Where p i is the probability associated with i th class. The entropy is calculated for each value. The sum of the entropy is calculated for each value. The sum of the entropy weighted by the probability of each value is the entropy for the feature. Categorize training instances into subsets by this feature. Repeat this process recursively until each subset contains instances of one kind (class) or some statistical criterion is satisfied.
Scan the entire training set for exceptions to the decision tree.
If exceptions are found, insert some of them into W and repeat from Step 2. The insertion may be done either by replacing some of the existing instances in the window or by augmenting it with the new exceptions. In practice a statistical criterion can be applied to stop the tree from growing as long as most of the instances are classified correctly Fig. 2.

Pesudo
Code for ID3 Algorithm (Grefenstette et al., 1990) function ID3 (I, 0, T) { /*I is the set of input attributes *O is the output attribute *T is a set of training data **function ID3 returns a decision tree*/ if (T is empty) { return a single node with the value "wrong"; } if (all records in T have the same value for O) { return a single node with that value; } if (I is empty) { return a single node with the value of the most frequent value of O in T; */ } /* case where we can't return a single node */ compute the information gain for each attribute in I relative to T; let X be the attribute with largest Gain(X, T) of the attributes in I; Let {x_j| j = 1,2, .., m} be the values of X; Let {T_j| j = 1,2, .., m} be the subsets of T when T is partitioned according the value of X; return a tree with the root node labelled X and arcs labelled x 1 , x 2 , x 3 ...

Application of ID3 Algorithm in Samples of Pima Indian Diabetes Database
The ID3 algorithm builds a decision tree for classifying the following (Nunez, 1991) Objects: Sample Data of Pima Indian Diabetes Database Class A: Acquired Diabetes Class B: Non Diabetic.
First, we calculate the entropy for each attribute.

PLASMA H:
= 5/6(-3/5log3/5 -2/5 log2/5) + 1/6(-1/1log1/1) = 0.56 Thus from Table 2 we select the attribute AGE as the first decision node since it is associated with the minimum entropy. This node has two branches: Young and middle. Under the branch middle, only class a objects fall and hence no further discrimination is needed. Under the branch young, we need another attribute to make further distinctions. So, we calculate the entropy for the other two attributes under this branch (Quinlan, 1987).

CONCLUSION
Data mining method using logistic regression implies that Age, Obesity, PDF and Plasma level are to be taken care of for the onset of diabetes mellitus. ID3 algorithm applied to the sample database gives the decision tree prediction with major factors of diabetes. The paper on a small scale tries to bring out the dominant factors alone by applying Iterative Dichotomiser ID3 algorithm of data mining. As our mankind has a great threat of this pancreatic disorder more in the coming era the sample data is chosen from diabetes database. The same idea can be applied to any disease database on a large sampling to bring out more useful diagnostic findings before complications affect the human population.