Discretization Numerical Data for Relational Data with One-to-Many Relations

,


INTRODUCTION
Most multi-relational data mining deals with nominal or symbolic values, often in the context of structural or graph-based mining (e.g., ILP) [3] . Much less attention has been given to the area of discretization of continuous attributes in a relational database, where the issue of one-to-many association between records has to be taken into account. Continuous attributes in multi-relational data mining are seldom used due to the difficulties in handling them particularly when we have a one-to-many association in a relational database.
Handling numerical data stored in a relational database is different from handling those numerical data stored in a single table due to the multiple occurrences of an individual record stored in the non-target table and non-determinate relations between tables.
Firstly, most pre-processing steps, such as the discretization and aggregation operations, that process attributes stored in relational database, need to use the structure (schema) of the relational database and to find out how attributes stored in non-target and target tables are related to each other. One may perform the aggregation operation on the attributes that have numerical multi-set values and then perform the discretization operation on the aggregated value. However, this is not an easy task as the non-target table may have categorical and numerical attributes in the same table.
Next, the task of discretizing continuous attributes is more complex when the occurrences of multiple instances in the non-target table are taken into consideration, since most traditional discretization methods deal with a single flat table and quite often ignore the one-to-many relationships problem.
And finally, using a class-based discretization method, such as an entropy-based discretization [17] , is not a straight-forward task in a relational database as it needs to be done in a single table. Most traditional data mining methods only deal with a single table where all attributes are available in that table and discretize columns that contain aggregated continuous numbers into nominal values. In a relational database, multiple records with non-aggregated numerical attributes are stored in the non-target table, separately from the target table and these records are usually associated with a single individual stored in the target table. As a result, discretizing continuous attributes in non-target table based on the class information requires user to consider the structure of the relational database. Thus, numbers in relational databases are often discretized, after considering the schema of the relational database, in order to reduce the continuous domains to more manageable symbolic domains of low cardinality and the loss of precision is assumed to be acceptable.

Data transformation using Dynamic Aggregation of Relational Attributes (DARA):
The DARA algorithm is designed to transform the data representation of a relational database into a vector space model, such that records stored in the non-target table can be summarized to characterize the related records stored in the target table. In a relational database, a single record, R i , stored in the target table can be associated with other records stored in the non-target table, as shown in Fig. 1. Let R denote a set of m records stored in the target table and let S denote a set of n records (T 1 , T 2 , T 3 , ..., T n ), stored in the non-target table. Let S i be a subset of S, S i ⊆ S, associated through a foreign key with a single record R a stored in the target table, where R a ∈ R. Thus, the association of these records can be described as R a ⇐ S i . In this case, we have a single record stored in the target table that is associated with multiple records stored in the non-target table.  target table that  correspond to a particular record stored in the target  table can be represented as vectors of patterns. As a result, based on the vector space model [4] , a unique record stored in non-target table can be represented as a vector of patterns. In other words, a particular record stored in the target table that is related to several records stored in the non-target table can be represented as a bag of patterns, i.e., by the patterns it contains and their frequency, regardless of their order. The bag of patterns is defined as follows: Definition: In a bag of patterns representation, each target record stored in the non-target table is represented by the set of its pattern and the pattern frequencies.
This definition follows the notion of an individualcentered representation defined by Lachiche and Flach [9] , where the data is described as a collection of individuals and the induced rules generalize over the individuals, mapping them to a class. For instance, individualcentered domains include classification problems in molecular biology where the individuals are molecules.
In our approach, an individual is represented as a bag of patterns. In the DARA algorithm, these patterns are encoded into binary numbers. The process of encoding these patterns into binary numbers depends on the number of attributes that exist in the non-target table. For example, there are two different cases when encoding patterns for the data stored in the non-target table. In the first case (Case I), a non-target table may have a single attribute. In this case, the DARA algorithm transforms the representation of the data stored in a relational database without constructing any new feature to build the (n×p) TF-IDF weighted frequency matrix [4] , as only one attribute exists in the non-target table. In the other case (Case II), a non-target table may have multiple attributes exist in the table. In this case, DARA may construct new features, which results in richer representation of each target record in the non-target table. The method used to encode the patterns derived from these attributes has some influences on the final results of the modeling task [11] . For each encoded pattern, the counter for the corresponding pattern in the bag is incremented or the pattern is added to the bag of patterns if it is not already in the bag. The resulting bag of patterns, shown in Fig. 2, can be used to describe the characteristics of an individual record. In Fig. 2, the first digit "2" preceded the binary numbers indicates the index of attribute that the binary numbers are belong to. Since there is only one attribute exists in the datasets, all the encoded patterns produced are belong to index attribute "2".

Case 2: A non-target table with multiple attributes:
Case 2 assumes that there is more than one attribute that describe the contents of the non-target table associated with the target table. All continuous values of the attributes are discretized and the number of bins is taken as the cardinality of the attribute domain. After encoding the patterns as binary numbers, the algorithm determines a subset of the attributes to be used to construct a new feature.
Here is an example of a simple algorithm to construct features without using feature scoring to generate the patterns that represent the input for the DARA algorithm. For each record stored in the nontarget table, we concatenate p number of columns' values, where p is less than or equal to the total number of attributes. For example, let F = (F 1 , F 2 , F 3 , ..., F k ) denote k field columns or attributes in the non-target table. Let dom(F i ) = (F i,1 , F i,2 , F i,3 , ..., F i,n ) denote the domain of attribute F i , with n different values. So, one may have an instance of a record stored in the nontarget table with these values (F 1,a , Table 1 shows the list of patterns produced with different values of p. It is not natural to have concatenated features like F 1,a F 2,b but not F 1,a F 3,c , when we have p = 2, since the attributes do not have a natural order. However, a genetic algorithm can be applied to solve this problem [10] .
For each record, a bag of patterns is maintained to keep track of the patterns encountered and their frequencies. For each new pattern encoded, if the pattern exists in the bag, the counter for the corresponding pattern is increased, else the pattern is added to the bag and set the counter for this particular pattern to 1. The resulting bag of patterns can be used to describe the characteristics of a record associated with them.
In short, the encoding process described here transforms data stored in the non-target table that has many-to-one relations with the target table, to the representation of data in a vector-space model [4] . With this representation, the data can be conveniently clustered by using the hierarchical or partitioning clustering technique, as a means of summarizing them.
In short, the encoding process described here transforms data stored in the non-target table that has many-to-one relations with the target table, to the representation of data in a vector-space model [4] . With this representation, the data can be conveniently clustered by using the hierarchical or partitioning clustering technique, as a means of summarizing them.

Types of discretization:
The motivation for the discretization of continuous features is based on the need to obtain higher accuracy rates in order to handle data with high cardinality attributes using the DARA algorithm, although this operation may affect the speed of any learning procedure that may subsequently use it.
There are a few common methods used to discretize continuous attributes that include equal-width, equalheight, equal-weight and entropy-based discretization methods. A new method of discretization, called entropy-instance-based discretization, will also be introduced later. In the DARA algorithm, all attributes with continuous values are discretized before they are transformed into vector space data representation.
Discretization methods can be categorized along 3 axes [6] : (a) Supervised versus unsupervised (b) Global versus local and (c) Static versus dynamic. Supervised methods make use of the class label when partitioning the continuous features. On the other hand, unsupervised discretization methods do not require the class information to discretize continuous attributes. Next, the distinction between global and local methods is based on the stage when the discretization takes place. Global methods discretize features prior to induction. In contrast, local methods discretize features during the induction process.
Given k as the number of intervals or bins, some discretization methods discretize features independently of the other features-this is called static discretization. On the other hand, dynamic discretization methods search for the space of possible k values for all features at the same time and this allows inter-dependencies in feature discretization to be captured. In this study, the global discretization method is used to discretize continuous features. In addition to that, since there is no significant improvement in employing dynamic discretization over static methods [13] , we employ the static method when discretizing the continuous features in this studies.

Unsupervised discretization methods:
Equal-width discretization method: The simplest discretization method is called equal-width interval discretization and this method has often been applied as a means for producing nominal values from continuous ones. This approach divides the range of observed values for a feature into k equal sized bins, where k is a parameter provided by the user. The process involves sorting the observed values of a continuous feature and finding the minimum, V min and maximum, V max , values. The interval (Eq. 1) can be computed by dividing the range of observed values for the variable into k equally sized bins, where k is a parameter supplied by the user: and then the boundaries then can be constructed using Eq. 2 where i = 1,..., k-1. This type of discretization does not depend on the multi-relational structure of the data. However, this method of discretization is sensitive to outliers that may drastically skew the range [6] .
Equal-height discretization method: Another simple discretization method, called equal-height interval binning, discretizes data so that each bin will have approximately the same number of samples. This method involves sorting the observed values together with the record ID. If |R| refers to the size of the records and V [|R|] refers to the size of the array that stores the sorted values, then the boundaries can be constructed as: where, i = 1, ..., k-1. The result is a collection of k bins of roughly equal size. This algorithm is class-blind and does not take into consideration the structure of the database, especially the one-to-many association problem. Since unsupervised methods do not make use of the class information in finding the interval boundaries, the classification information can be lost as a result of placing values that are strongly associated with different classes in the same interval [6] .
Equal-weight discretization method: Another unsupervised discretization method called the equalweight interval binning, which was introduced by Knobbe and Ho [1] . The equal-weight discretization method considers not only the distribution of numeric values present, but also the groups they appear in. This method involves an idea proposed by Van Laer and De Raedt [16] . It is observed that larger groups have a bigger influence on the choice of boundaries because they have more contributing numeric values. In equal-weight interval binning, numeric values are weighted with the inverse of the size of the group they belong to and this weight is defined in Eq. 4: Where: w t = The weight function v = The value being considered |group v | = The size of the group that ϖ belongs to Instead of producing bins of equal size, the boundaries are computed to obtain bins of equalweight. The algorithm starts by computing the size of each group, then it moves through the sorted arrays of values, keeping a running sum of weights w t . Whenever w t reaches a target boundary number of groups ( ) bins − − , the current numeric value is added as one of the boundaries and the process is repeated k-1 times (k is the number of bins).

Supervised discretization methods:
Entropy-based discretization method: One of the supervised discretization methods, introduced by Fayyad and Irani, is called the entropy-based discretization [17] . A lot of significant research in entropy-based discretization has been carried out and an early comparison of entropy-based methods for discretization of continuous features and multi-interval discretization methods can be found in the works conducted by Kohavi and Sahami [13] . Algorithms, such as C4.5, try to find a binary cut for each attribute and use a minimal entropy heuristic for the discretization of continuous attributes. The algorithm uses the class information entropy to select binary boundaries for discretization. In entropy-based discretization, given a set of instances S, a feature A and a partition boundary T, the class information entropy is: where, p(C i , S) is the probability of observing the ith class randomly in the subset S. This method can be applied recursively to both partitions induced by T until some stopping condition is achieved, thus creating multiple intervals of feature A. So, for k bins, the class information entropy for multi-interval entropy-based discretization is: The stopping condition proposed by Fayyad and Irani [17] is based on the Minimum Description Length (MDL) principal [2] . The stopping condition prescribes accepting a partition induced by T if and only if the cost of encoding the partition and the classes of the instances in the intervals induced by T is less than the cost of encoding the classes of the instances before the split as shown in Eq. 8: and in Eq. 10, c, c 1 and c 2 are the number of distinct classes present in S, S 1 and S 2 respectively.

Equal-weight discretization method:
This study introduces a new method of discretizing continuous attributes that takes into account the one-to-many association between records stored in the target and non-target tables. In this study, the entropy-based multiinterval discretization method introduced by Fayyad and Irani [17] is modified. In the proposed entropyinstance-based discretization method, besides the class information entropy, another measure that uses individual information entropy is added to select multiinterval boundaries for the continuous attributes. Given n individuals taken from the target table, the individual information entropy of a subset S is: where, p(I i , S) is the probability that a random record associated with individual i from this table is in the subset S. This is due to the fact that in a multi-relational environment in which an entity may have a one-tomany relationship with another entity, an object stored in the target table may have more than one occurrence of its instances stored in the non-target table. For this reason, the total individual information entropy for all partitions is defined as: In other words, the Entropy-Instance-Based interval binning considers the distribution of numeric values present, the groups they appear in and is also based on all occurrences of each individual record. The individual information entropy (Eq. 12) is added to the existing entropy-based discretization formula in order to get better partitions in a multi-relational setting. Feature construction for data summarization: These experiments are designed to investigate: • The effects of taking into account one-to-many relationships when discretizing continuous attributes in a multi-relational environment • Whether the choice of clustering techniques has any impacts on the data summarization results In this experimental study, the discretization methods, described previously, are implemented in the DARA algorithm [11,12] , in conjunction with the C4.5 classifier (J48 in WEKA) [5] , as an induction algorithm that is run on the discretized and transformed data by the DARA algorithm. Then, the effectiveness of each discretization method with respect to C4.5 [7] , is evaluated. Two datasets are chosen from the wellknown Mutagenesis dataset [3] and the Hepatitis dataset for Discovery Challenge PKDD 2005 [15] .
There are four different values used for the number of bins, b = 2, 4, 6, 8, to evaluate different methods of discretization. For each dataset, the data summarization process is performed using both the Hierarchical (H) and Partitional (P) clustering techniques. After summarizing the datasets using the DARA small algorithm, the effectiveness of each discretization method with respect to C4.5 [7] is evaluated using the 10-fold cross-validation. Table 2 and 3 provide a detailed overview of the accuracy estimation from 10-fold cross-validation performance of C4.5 for different number of bins, b, tested on Mutagenesis datasets (B1, B2, B3) and Hepatitis datasets (H1, H2, H3), for each method of discretization. In these experiments, all five methods of discretization are evaluated, namely equal-width (EWD), Equal-Height (EH), Equal-Weight (EWG), Entropy-Based (EB) and Entropy-Instance-Based (EIB).

RESULTS
Based on the experimental results, in most cases, the entropy-instance-based discretization method produced better data summarization results that lead to a better performance accuracy for the predictive tool (C4.5), compared to the other discretization methods. There is one exception, in the Mutagenesis dataset B3, where the improvement of the data summarization results produced by entropy-instance-based discretization method is not that obvious. Table 2 and 3 also show the behaviors of each discretization method with different values of bins, b, performed on the Mutagenesis and Hepatitis datasets. When b is big (b = 8), the entropy-instancebased discretization method produced better data summarization results compared to the other discretization methods, in the mutagenesis dataset.  For mutagenesis dataset, the optimal value for the number of discretization is relatively low: Between 2 and 4. In contrast, for the hepatitis dataset, the entropyinstance-based discretization method produced better data summarization results for all values of b, compared to the other discretization methods. The optimal value for the number of discretization for the hepatitis dataset is less clear. Fig. 3 also shows that the Entropy-Instance-Based (EIB) discretization method produced higher average performance accuracy (%), for both hierarchical and partitional clustering techniques, compared to the Entropy-Based (EB), Equal-Height (EH), Equal-Weight (EWG) and finally Equal-Width (EWD) discretization methods, for datasets H1, H2, H3 and B3. However, for datasets B1 and B2, the equal-width discretization method produced comparable results with the one produced by entropy-instance-based discretization method.
The results of paired t-test (p = 0.05) to indicate the significant improvement of each discretization method over the other methods for mutagenesis and hepatitis datasets are also collected. Since there are three varieties of datasets for each hepatitis and mutagenesis databases, with four different values for the number of bins (b = 2, 4, 6, 8), there are 24 cases in which each discretization method is evaluated for each database. Table 4 and 5 show the number of cases in which the method of discretization in row indicates significant improvement over the other discretization methods in column. For hepatitis dataset, Table 4 shows that the entropy-instance-based discretization method has higher number of cases in which this discretization method indicates significant improvement over the rest of the discretization methods. For the mutagenesis dataset, Table 4 shows that both the equal-width and entropy-instance-based discretization methods have higher number of cases in which these discretization methods indicate significant improvement over the other discretization methods.   Table 4 and 5, the percentages of significant improvement performed by each discretization method are computed in Table 6 and 7. For the hepatitis dataset, the entropy-instance-based discretization method has the highest average percentage of significant improvement. In contrast, for the mutagenesis dataset, both the equal-width interval discretization and the Entropy-Instance-Based (EIB) discretization methods show high average percentage of ties (no significant improvement). However, both methods show reasonable high average percentage of significant improvement over the other discretization methods, with EIB having the lowest percentage of loss.

DISCUSSION
In general, based on these experiments we can conclude that the Entropy-Instance-Based (EIB) discretization method helps one to achieve higher percentage of accuracy. This should come as no surprise, as the EIB is more precise in choosing the optimal numeric cut points. In other words, EIB splits the data better based on the class information and also the individual information entropy. As a result, each object in the target table can be described more accurately since each object possesses more consistent patterns used for clustering.
It is also found that the partitional clustering technique often performs much better compared to the hierarchical clustering technique in summarizing data with multiple occurrences stored in the non-target relation. Fig. 3 shows the comparison of the average performance accuracy (%) for the hierarchical and partitional clustering techniques on both the mutagenesis and hepatitis datasets.
In clustering, the frequency of patterns is used to distinguish records of different classes. And most records may have only a subset of all patterns from the complete patterns used to cluster these records and any two records may share many of the same patterns. As a result, two records could often be nearest neighbors without belonging to the same classes. Since, the nearest neighbors of a record are of different classes, hierarchical clustering technique will often put records of different classes in the same cluster, even at the earliest stages of the clustering process. In cases where nearest neighbors are unreliable, partitioning clustering technique (such as K-means) that relies on more global properties [14] is needed. In partitioning clustering technique, computing the cosine similarity of a record to a cluster centroid is the same as computing the average similarity of the record to all the clusters records [8] , the partitioning clustering technique is implicitly making use of such a global property approach. For that reason, this explains why partitioning clustering technique does better compared to the hierarchical clustering technique in the categorical domain, although this is not the case in some other domains.
One of the main problems with Entropy-Based and Entropy-Instance-Based discretization criterion is that they are relatively expensive. For instance, for 2 bins (k = 2), for a continuous attribute, the Eq. 7 and 12 must be evaluated N-1 times for each attribute, where N is the number of attribute values. Therefore, one may use a genetic algorithm-based discretization [11] , in order to obtain a multi-interval discretization for continuous attributes in a very large database, using Entropy-Based or Entropy-Instance-Based methods.

CONCLUSION
This study has revealed, through experiments, that the entropy-instance-based discretization method, which is implemented in the DARA algorithm, helps one to achieve higher percentage of accuracy. The entropy-instance-based discretization method is recommended for discretization of attribute values in multi-relational datasets, in which the individual information entropy can be used to improve the discretization process, as it has been shown here. However, when the dataset is too large, one may apply the genetic algorithm-based for the entropy-instancebased discretization method to find the best partitions. It is also found that, from this experiment, the partitional clustering technique produced better performance accuracy compared to the one produced by hierarchical clustering technique.