Predicting Missing Attribute Values Using k-Means Clustering

,


INTRODUCTION
Missing attribute values are variables without observation or questions without answers. Even a small amount of data can cause serious problems may leading to wrong conclusions. There are several techniques to assign the values for missing items, but no one is absolutely better than the others. Different situations require different solutions; the only really good solution to the missing data problem is not to have any. Grzymala-Busse and Hu (2001) nine approaches on filling in the missing attribute values were introduced, such as selecting the most common attribute value, assigning all possible values of the attribute restricted to the given concept, ignoring examples with unknown attribute values, treating missing attribute values as special values, event-covering method and so on. Experiments on ten data sets were conducted to compare the performances.  a closest fit approach was proposed to compare the vectors of all the attribute pairs from a preterm birth data set and assign the value from the most similar pair to the missing value. In a more recent effort (Grzymala-Busse, 2005) four interpretations on the meanings of missing attribute values such as "lost" values and "do not care" values are discussed. Different approaches from rough sets theory are demonstrated on selecting values for the individual interpreted meanings. Grzymala-Busse and Hu (2001) performed computational studies on the medical data, where unknown values of the attributes were replaced using probabilistic techniques. Recently, Greco et al. (1999) used a specific definition of the discernibility relation to analyze unknown attribute values for multicriteria decision problems. Stefanowski and Tsoukias (2001) presented two different semantics for incomplete information "missing values" and "absent values" were discussed also; they introduced two generalizations of the rough set theory to handle these situations. Nakata and Sakai (2005) the author examined methods of valued tolerance relations. They proposed a correctness criterion to the extension of the conventional methods which is based on rough sets for handling missing values.
In real-world data set, missing attribute values are very common. This may happen at the time of data collection, redundant diagnose tests, unknown data and so on. A common approach is that discarding all data containing the missing values can't fully preserve the characteristics of the original data. Before assigning the values for missing attributes, we must understand the background knowledge and its context will be helpful for finding the best approach for handling missing values. Several approaches on how to deal with the missing attribute values have been proposed in the ancient years. Here are the nine different approaches as discussed in (Grzymala-Busse and Hu, 2001).

Most Common Attribute Value (MCAV):
It is one of the simplest methods to deal with missing attribute values. The value of the attribute that occurs most often is selected to be the value for all the unknown values of the attribute.

Concept Most Common Attribute Value (CMCAV):
The most common attribute value method does not pay any attention to the relationship between attributes and a decision. The concept most common attribute value method is a restriction of the first method to the concept, i.e., to all examples with the same value of the decision as an example with missing attributes value. This time the value of the attribute, which occurs the most common within the concept is selected to be the value for all the unknown values of the attribute. This method is also called maximum relative frequency method, or maximum conditional probability method (given concept).

C4.5:
This method is based on entropy and splitting the example with missing attributes values to all concepts.

Method of Assigning All Possible Values of the Attribute (APV):
In this method, an example with a missing attribute value is replaced by a set of new examples, in which the missing attribute value is replaced by all possible values of the attribute. If we have some examples with more than one unknown attribute value, we will do our substitution for one attribute first and then do the substitution for the next attribute until all unknown attribute values are replaced by new known attribute values.

Method of Assigning All Possible Values of the Attribute Restricted to the Given Concept (APVRC):
The method of assigning all possible values of the attribute is not related with a concept. This method is a restriction of the method of assigning all possible values of the attribute to the concept, indicated by an example with a missing attribute value.

Method of Ignoring Examples with Unknown Attribute Values (IGNORE):
This method is the simplest: just ignore the examples which have at least one unknown attribute value and then use the rest of the table as input to the successive learning process.

Event-Covering Method (EC):
This method is also a probabilistic approach to fill in the unknown attribute values. By event-covering we mean covering or selecting a subset of statistically interdependent events in the outcome space of variable-pairs, disregarding whether or not the variables are statistically independent.
A Special LEM2 Algorithm (LEM2): A special version of LEM2 that works for unknown attribute values omits the examples with unknown attribute values when building the block for that attribute. Then, a set of rules is induced by using the original LEM2 method.

Method of Treating Missing Attribute Values as Special Values (SPECIAL):
In this method, we deal with the unknown attribute values using a totally different approach: rather than trying to find some known attribute value as its value, we treat "unknown" itself as a new value for the attributes that contain missing values and treat it in the same way as other values.
In this study, k-means clustering is proposed for assigning missing attribute values. The core idea is based assigning all possible values. Each new value is assigned and the dataset is clustered using k-means. And the cluster is validated to check whether the instance having missing value is placed in correct cluster, if so, the assigned value is marked as permanent. Otherwise the next value will be assigned. If it doesn't fit with any possible value then the best fit value is assigned to that missing attribute. If an instance having more than one missing attributes values, then all the possible combinations are checked. Once the values has been assigned for all the missing attributes, then the feature selection is performed with Bees Colony Optimization (BCO) as discussed in (Suguna and Thanushkodi, 2010a) and the improved Genetic KNN (Suguna and Thanushkodi, 2010b) is applied for finding the classification performance. The rest of the study is organized as: the following text describes the existing approaches to be compared, followed by the proposed k-means clustering approach for assigning values for missing attributes. The experiments are conducted on different datasets from medical domain and the results are presented and the study is concluded with the discussion about the performance of our proposed method.

K-means clustering for missing attribute value prediction:
One of the most popular clustering techniques is the k-means clustering algorithm (Pavan et al., 2010;Jaradat et al., 2009). Starting from a random partitioning, the algorithm repeatedly (i) computes the current cluster centers (i.e. the average vector of each cluster in data space) and (ii) reassigns each data item to the cluster whose centre is closest to it. It terminates when no more reassignments take place. By this means, the intra-cluster variance, that is, the sum of squares of the differences between data items and their associated cluster centers is locally minimized. k-means' strength is its runtime, which is linear in the number of data elements and its ease of implementation. However, the algorithm tends to get stuck in suboptimal solutions (dependent on the initial partitioning and the data ordering) and it works well only for spherically shaped clusters. It requires the number of clusters to be provided or to be determined (semi-) automatically. In our experiments, the cluster number is kept equal to the number of classes.
1. Choose a number of clusters k 2. Initialize cluster centers μ 1 ,… μ k a. Could pick k data points and set cluster centers to these points b. Or could randomly assign points to clusters and take means of clusters 3. For each data point, compute the cluster center it is closest to (using some distance measure) and assign the data point to this cluster 4. Re-compute cluster centers (mean of data points in cluster) 5. Stop when there are no new re-assignments From the original dataset the instances having missing attributes are separated from the dataset. Now we have two different datasets denoted as F and M. the dataset F contains the instances which have all the attribute values filled. The M dataset contains all the instances having missing attributes. Then the instances from M are taken one by one and the missing attributes are filled with their possible values. Then the assigned instance is added with the dataset F. Now the k-means clustering is applied to the dataset F, from the resultant clusters, the newly added instance is validated that whether it is been clustered in the correct class or not. If it is in the correct cluster, then the assigned value is made as permanent then the procedure is continued with the next instance in the M dataset. If it is in the wrong cluster then the next possible value will be assigned and compared till we found the value which put the instance in the correct cluster. At the end of each clustering step the quality of the cluster is measured with entropy value. There are many different quality measures and the performance and relative ranking of different clustering algorithms can vary substantially depending on which measure is used. However, if one clustering algorithm performs better than other clustering algorithms on many of these measures, then we can have some confidence that it is truly the best clustering algorithm for the situation being evaluated.
We use entropy as a measure of quality of the clusters (with the caveat that the best entropy is obtained when each cluster contains exactly one data point). Let CS be a clustering solution. For each cluster, the class distribution of the data is calculated first, i.e., for cluster j we compute p ij , the "probability" that a member of cluster j belongs to class i. Then using this class distribution, the entropy of each cluster j is calculated using the standard formula: where the sum is taken over all classes. The total entropy for a set of clusters is calculated as the sum of the entropies of each cluster weighted by the size of each cluster: Where: n j = The size of cluster j m = The number of clusters n = The total number of data points From the second iteration, the current entropy value is compared with the previous one; if the current entropy is less than the previous one then the presently assigned value is considered as best fit for that missing attribute, otherwise the previously assigned value is considered as best fit for that missing attribute. If we can't find the correct cluster till the end then the best fit value is assigned for the missing attribute. Suppose an instance having more than one missing attribute values and then all the possible combinations are checked. After assigning the missing attributes, the feature selection is performed with Bees Colony Optimization (BCO) and the improved Genetic KNN is applied for finding the classification performance.
Bee Colony Based Reduct (BeeRSAR): Nature is inspiring researchers to develop models for solving their problems. Optimization is an instance field in which these models are frequently developed and applied. Genetic algorithm simulating natural selection and genetic operators, Particle Swarm Optimization algorithm simulating flock of birds and school of fishes, Artificial Immune System simulating the cell masses of immune system, ACO algorithm simulating foraging behaviour of ants and Artificial Bee Colony algorithm simulating foraging behaviour of honeybees are typical examples of nature inspired optimization algorithms.
Artificial Bee Colony (ABC) algorithm for real parameter optimization, is a recently introduced optimization algorithm and simulates the foraging behaviour of bee colony for unconstrained optimization problems (Karaboga and Basturk, 2008). For solving constrained optimization problems, a constraint handling method was incorporated with the algorithm (Srichandum and Rujirayanyong, 2010).
Algorithm Bee Colony Optimization Algorithms: • Initialize the food source positions • Each employed bee produces a new food source in her food source site and exploits the better source • Each onlooker bee selects a source depending on the quality of her solution, produces a new food source in selected source site and exploits the better source • Determine the source to be abandoned and allocate its employed bee as scout for searching new food sources. • Memorize the best food source found so far • Repeat steps 2-5 until the stopping criterion is met In a real bee colony, there are some tasks performed by specialized individuals. These specialized bees try to maximize the nectar amount stored in the hive by performing efficient division of labour and selforganization. The minimal model of swarm-intelligent forage selection in a honey bee colony, that ABC algorithm adopts, consists of three kinds of bees: employed bees, onlooker bees and scout bees. Half of the colony comprises employed bees and the other half includes the onlooker bees. Employed bees are responsible from exploiting the nectar sources explored before and giving information to the other waiting bees (onlooker bees) in the hive about the quality of the food source site which they are exploiting. Onlooker bees wait in the hive and decide a food source to exploit depending on the information shared by the employed bees. Scouts randomly search the environment in order to find a new food source depending on an internal motivation or possible external clues or randomly. Main steps of the ABC algorithm simulating these behaviours are given in the above algorithm, this procedure can be implemented for feature reduction, let the bees select the feature subsets at random and calculate their fitness and finds the best one at each iteration. This procedure is repeated for number of iterations to find the optimal subset.
In first step of the algorithm, the employed bee produces the feature subset in random. Consider a conditional feature set C contains N features. Then 'p' number of bees has been chosen as the population size. From this population half of the bees are considered as employed bee and the remaining is considered as onlooker bee. For each employed bee N random numbers are generated between 1 and N and assigned to them. From these random numbers the feature subset is constructed by performing round operation and then extracts only the unique numbers from the set. For example, consider the random numbers: {1.45, 1.76, 3.33, 1.01}, where N = 4 First we perform round operation, then the set is modified as: {1 1 3 1} from the above result extract the unique numbers alone, as {1 3} represent the feature subset. ie., the 1st and 3rd feature values alone. In the second step of the algorithm, for each employed bee, whose total number equals to the half of the number of food sources, a new source is produced by: Where: ϕ ij = A uniformly distributed real random number within the range [-1,1] k = The index of the solution chosen randomly from the colony (k = int (rand * N) + 1), j = 1, . . .,D D = The dimension of the problem After producing v i , this new solution is compared to xi solution and the employed bee exploits the better source. In the third step of the algorithm, an onlooker bee chooses a food source with the probability and produces a new source in selected food source site. As for employed bee, the better source is decided to be exploited. The indiscernibility relation is calculated for each feature subset as objective value (f i ). This value has to be maximized. From this objective value the fitness value is calculated for each bee as given in the following equation: The probability is calculated by means of fitness value using the following equation: where fit i is the fitness of the solution x i . After all onlookers are distributed to the sources, sources are checked whether they are to be abandoned. If the number of cycles that a source cannot be improved is greater than a predetermined limit, the source is considered to be exhausted. The employed bee associated with the exhausted source becomes a scout and makes a random search in problem domain by the following equation: The pseudocode of our proposed method is given as: Improved KNN classification based on genetic algorithm: In pattern recognition field, KNN is one of the most important non-parameter algorithms and it's a supervised learning algorithm (Eskandarinia et al., 2010;Lee et al., 2011;Saaid et al., 2009). The classification rules are generated by the training samples themselves without any additional data. KNN classification algorithm predicts the test samples category according to the K training samples which are the nearest neighbors to the test sample and judge it to that category which has the largest category probability. The process of KNN algorithm to classify sample X is: • Suppose that there are j training categories as C 1 ,C 2 ,…,C j and the sum of the training samples is N after feature reduction, they becomes mdimension feature vector • Make sample X to be the same feature vector form (X 1 ,X 2 ,…,X m ) as all training samples • Calculate the similarities between all training samples and X. Taking the i th sample, d i (d i1 ,d i2 ,…,d im ) as an example, the similarity SIM(X, d i ) is as following: • Choose k samples which are larger from N similarities of SIM(X, d i ), (i=1,2,…,N) and treat them as a KNN collection of X. Then, calculate the probability of X belong to each category respectively with the following formula: where, y(d i , C j ) is a category attribute function, which satisfied: • Judge sample X to be the category which has the largest P(X, C j ) In this study, Genetic Algorithm (GA) is combined with K-Nearest Neighbor (KNN) algorithm called as Genetic KNN (GKNN) to overcome the limitations of traditional KNN. In traditional KNN algorithm, initially the distance between all the test and training samples has been calculated and the k-neighbors with greater distances are taken for classification. In our proposed method, by GA, k-number of samples is going to be chosen for each iteration and the classification accuracy is calculated as fitness. The highest accuracy is recorded each time. Thus, it does not require calculating the similarities between all samples and no need to bother about weight of the category. Genetic Algorithm (GA) is randomized search and optimization techniques guided by the principles of evolution and natural genetics, having a large amount of implicit parallelism. GA perform search in complex, large and multimodal landscapes and provide near-optimal solutions for objective or fitness function of an optimization problem (Asfaw and Saiedi, 2011;Mahi and Izabatene, 2011;Mosavi, 2011;Matondang and Jambak, 2010;Nazif and Lee, 2010;Alfred, 2010;Sarabian and Lee, 2010;Yedjour et al., 2010).
In GA, the parameters of the search space are encoded in the form of strings (called chromosomes). A collection of such strings is called a population. Initially, a random population is created, which represents different points in the search space. An objective and fitness function is associated with each string that represents the degree of goodness of the string. Based on the principle of survival of the fittest, a few of the strings are selected and each is assigned a number of copies that go into the mating pool. Biologically inspired operators like cross-over and mutation are applied on these strings to yield a new generation of strings. The process of selection, crossover and mutation continues for a fixed number of generations or till a termination condition is satisfied.
GA have applications in fields as diverse as VLSI design, image processing, neural networks, machine learning and job shop scheduling. String representation -Here the chromosomes are encoded with real numbers; the number of genes in each chromosome represents the samples in the training set. Each gene will have 5 digits for vector index and k number of genes. For example, if k=5, a sample chromosome may look like as follows: 00100 10010 00256 01875 00098 Here, the 00098 represents, the 98th instance and the second gene say that the 1875 instance in the training sample. Once the initial population is generated now we are ready to apply genetic operators. With these k neighbors, the distance between each sample in the testing set is calculated and the accuracy is stored as the fitness values of this chromosome.

Reproduction (selection):
The selection process selects chromosomes from the mating pool directed by the survival of the fittest concept of natural genetic systems. In the proportional selection strategy adopted in this article, a chromosome is assigned a number of copies, which is proportional to its fitness in the population, that go into the mating pool for further genetic operations. Roulette wheel selection is one common technique that implements the proportional selection strategy.
Crossover: Crossover is a probabilistic process that exchanges information between two parent chromosomes for generating two child chromosomes. In this study, single point crossover with a fixed crossover probability of p c is used. For chromosomes of length l, a random integer, called the crossover point, is generated in the range [1, l-1]. The portions of the chromosomes lying to the right of the crossover point are exchanged to produce two offspring.
Mutation: Each chromosome undergoes mutation with a fixed probability p m . For binary representation of chromosomes, a bit position (or gene) is mutated by simply flipping its value. Since we are considering real numbers in this study, a random position is chosen in the chromosome and replace by a random number between 0-9.
After the genetic operators are applied, the local maximum fitness value is calculated and compared with global maximum. If the local maximum is greater than the global maximum then the global maximum is assigned with the local maximum and the next iteration is continued with the new population. The cluster points will be repositioned corresponding to the chromosome having global maximum. Otherwise, the next iteration is continued with the same old population. This process is repeated for N number of iterations. From the following section, it is shown that our refinement algorithm improves the cluster quality. The algorithm is given as.

RESULTS AND DISCUSSION
The performance of the reduct approaches discussed in this study has been tested with 4 different medical datasets, downloaded from UCI machine learning data repository. Table 1 shows the details about the datasets used in this study.
The advantage of our proposed approach is, it doesn't check all the possible values for all the instances. It may assign at first time also, once it is correctly clustered then no need to check with the remaining possible values.
Thus the runtime complexity can be enormously reduced.  Once the values are predicted for missing attributes, then the reduced feature set is received from a novel method based on Rough set theory hybrid with Bee Colony Optimization (BCO) as we have discussed in our earlier work. Table 2 shows the reduced feature sets. Then the Genetic Algorithm (GA) is combined with k-Nearest Neighbour (KNN) algorithm called as Genetic KNN (GKNN) classifier is employed to analyze the classification performance. Table 3 shows the comparison of classification accuracy of our proposed approach with the existing methods. It is clearly shown that k-means clustering approach can predict the missing attributes better than any other existing approaches.

CONCLUSION
Missing attribute values are very common the realworld dataset. Several methods have been proposed to predict these missing attribute values, but we can't say that they can predict well than the others. In this study, we have proposed a novel approach for predicting missing attribute values using simple k-means clustering. The missing attributes are assigned with one possible value each time and the dataset is clustered using k-means to check whether the instance is clustered in the correct class, if so then the assigned value is made as permanent. Otherwise the clustering is performed with the next possible value. If we found that no one possible value put the instance in the correct cluster then the best fit value is assigned for that missing attribute based on entropy measurement. This novel approach is implemented for x number of medical dataset with missing attribute values. After prediction, the reduced feature set is constructed using Rough set theory hybrid with BCO and the classification performance is studied with Genetic-KNN classifier. The results shows that k-means clustering can predict the missing attribute values better than any other approaches.