Model for Load Balancing on Processors in Parallel Mining of Frequent Itemsets

.


INTRODUCTION
Association rule mining finds interesting associative or correlative relationships among a large set of data items. The problem was formulated originally in the context of transaction data at the supermarket.
This market basket data consisst of transactions made by each customer. Each transaction contains items bought by the customer. The goal is to see if the occurrence of certain items in a transaction can be used to deduce occurrence of other items or in other words, to find associative relationships between items. If such interesting relationships are found, then they can be put to various profitable uses such as self management, inventory management, etc. Thus association rules were born [1] .
Let I = { I 1 ,I 2 , …………,I m } be a set of items. Let D, be a set of database transactions where each transaction T is a set of items such that T ⊆ I. Each transaction is associated with an identifier, called T_ID (transaction identity).
Let A be a set of items. A transaction T is said to contain A if and only if A ⊆ T. An association rule is an implication of the form A =>B, where A ⊂ I, B ⊂ I and The rule A =>B holds in the transaction set D with supports, where s is the percentage of transactions in D that contain A∪B. This is taken to be the probability, P (A∪B).
The rule A =>B has confidence c in the transaction set D if c is the percentage of transactions in D containing A that also contain B. This is taken to be the conditional probability, P (B/A). That is:

support (A =>B) = P(A∪B) confidence (A =>B) = P(B/A)
Rules that satisfy both a minimum support threshold (min_sup) and a minimum confidence threshold (min_conf) are called strong. An utmost that contain k items is a k-item sets. The set, {bread, butter} is a 2-item set. The occurrence frequency of an itemset is the number of transactions that contain the itemset. This is also known as frequency of support count. An itemset satisfies minimum support, if the occurrence frequency of the utmost is greater than or equal to the product of min_sup and the total number of transactions in D. If an item set satisfies minimum support, then it is a frequent item set. The set of frequent k-item sets are commonly denoted by L k . Association rule mining is a two-step process. Find all frequent itemsets and generate strong association rules from the frequent itemsets. In this study, we concentrate on the most time consuming process, which is the discovery of frequent item set. The first algorithm that handled the problem of generation of the frequent item set was the Apriori algorithm [2] . This algorithm used a very fundamental property for the support of item sets: An item set of size k can meet the minimum level of support only if all of its subsets also meet the minimum level of support. This property used to systematically prune the search space of desired itemsets, by increasing the length of the itemsets being discovered. In an iteration k, all candidate k-itemsets are formed such that all its (k-1) subsets are frequent. The numbers of occurrences of these candidates are then counted in the transaction database. Efficient data structures are used to perform the fast counting. Since its conception, many others algorithm [3][4][5][6][7][8][9][10] have emerged that improve upon the runtime, I/O and scalability performance of the Apriori algorithm by various efficient means of pruning the itemset search space and counting the candidate occurrences in large databases.
We assume that the database is a transactional database with high data skewness. The database consists of the huge amount of transaction records, each with a transaction identifier (TID) and a set of data items. The data mining in such databases requires substantial processing power and parallel system is a possible solution. This observation motivates us to study efficient parallel algorithms for mining association rules in large databases. The database is partitioned 'horizontally' (i.e., grouped by transactions) and each partition generated by using stratified sampling to select a sample of transactions for a partition. Allocate these partitions to the processors of sites in distributed system which communicates via a fast network. It has been well known that the major cost of mining association rules is the computation of the set of large itemsets (i.e. Frequently occurring sets of items) in the database. An itemset (a set of items) is large if the percentage of transactions that containing all these items is greater than a given threshold.

Sequential Mining of Association Rules:
A priori Algorithm: The Apriori algorithm consists of a number of passes Initially F 1 contains all the items (i.e., Item set of size one) that satisfy the minimum support requirement. During pass k, the algorithm finds the set of frequent itemsets F k of size k that satisfy the minimum support requirement. The algorithm terminates when F k has satisfied the minimum support requirement. The algorithm terminates when F k is empty. In each pass, the algorithm first generates C k the candidate itemsets of size k. Function apriori_gen (F k-1 ) constructs C k by extending frequent itemsets of size k -1. This ensures that all the subsets of size k -1 of a new candidate itemset are in F k-1 . Once the candidate itemsets are found, their frequencies are computed by counting how many transactions contain these candidate itemsets.
Finally, F k is generated by pruning C k to eliminate itemsets with frequencies smaller than the minimum support. The union of the frequent itemsets, ∪ F k , is the frequent itemsets from which we generate association rules.
Computing the counts of the candidate itemsets is the most computationally expensive steps of the algorithm.

Parallel and Distributed Mining:
The Count Distribution (CD) algorithm is a simple data parallelization algorithm. The database D is positioned horizontally into D1, D2..Dn and distributed across n processors P i (1≤ i≤ n). It uses sequential Apriori algorithms on each partition. The CD algorithm's main advantage is that it does not exchange data tuples between processors, it only exchange counts. In the first database scan, each processor generates its local candidate itemsets depending on the items present in its local partition. The algorithm obtains global support counts by exchanging local support count with all other processors. The algorithm communication overhead is O (|c|. n) at each phase, where |c| and n are the size of candidate itemsets and the number of data sets, respectively.
Researchers proposed FDM (Fast Distributed Mining) algorithms to mine association rules from distributed datasets partitioned among different sites [8] . At each site, FDMK find the local support counts and prunes locally in frequent itemsets. After completing local pruning, each site broadcasts messages containing all the remaining candidate sets to all other sites to collect their support counts. It then decides whether locally large itemsets are globally large and generates the candidate itemsets from those globally large itemsets.
The FDM's main advantages over CD is that it reduces the communication overhead to O (|c p |. n), where |c p | and n are number of large itemsets and the number of sites. FDM generates fewer candidate itemsets compared to CD, when the number of disjoint candidate itemsets among various sites is large. However, we can achieve this when different sites have non homogenous data sets. The FDM's message optimization technique requires some functions to determine the polling site, which could cause extra computational cost when each site has numerous local frequent itemsets.
All of the parallel approaches optimize message to reduce communication costs, but none of the parallel algorithm has considered the problem of partitioned database with high data schemas. Whenever, the partitioned database with high data skews increases computational cost and reduces workload balancing of processors and hence in such situation a parallel algorithm works look likes a sequential. Hence without consideration of problems with data schemas, we can't achieve the advantages of parallelization of a mining algorithm.
Proposed algorithm, WBDM (Workload Balanced Distributed Mining) deals with the problem of data skews and workload balancing by using a stratified sampling method to partition the database.

Data Skewness and Workload Balance:
A partioned database has high data skewness if most globally large itemsets are locally large only at a very few partitions. It is low if most globally large itemsets are locally large evenly across the processors. When the clustering of different large itemsets distributed evenly across the processors; hence each processor would have similar numbers of locally large itemsets. This case characterizes as high workload balance. When the clustering of different large itemsets concentrated on a few processors; hence some processes would have much more locally large itemsets than the others. This is a case of low workload balance. When the clustering of different large itemsets distributed not evenly across the processors, then the pruning effects would be reduced significantly and the work of computing the large itemsets would be concentrated on a few processors which is a very troublesome issue of parallel computation.
For example Table 1 shows an example of high data skews and low workload balance. The global threshold is 15 and the local support threshold at each processor is 5. In this case, distributed pruning will generate 7 sizes-2 candidates, namely AB, AC, AD, BC, BD, CD and EF, while the CD will have 15 candidates. Thus distributed pruning to be very effective, but most globally large itemsets are locally large only at processor1, hence have lower high workload balance.
For example Table2 shows an example of low data skews and high workload balance. The support counts of items A, B, C, D, E and F are almost equally distributed over three processors. Hence the data skews are low. On the other hand the workload balance is high, because the number of locally large itemsets in each processor is almost the same. In this case, both CD and distributed pruning will generate the same 15 candidate sets; however, global pruning can prune away the three candidates AC, AE and CE. Hence FPM still has 20% of improvement over CD in the case of Low data skews and high workload balance.

Stratified Sampling Based Partitions:
With stratified random sampling, the whole database is divided into a number of parts or 'strata' according to some characteristic. Simple random samples are then selected from each stratum. The same proportion will be selected within each stratum, making the sample a proportionate stratified random sample. Stratified sampling can be used as a data partition technique, it allows a high skewed data set can be partitioned as homogeneous portions Let DB be a database with D transactions. Assume that there is N processors P 1 , P 2 … P N in a distributed environment. The database divided into N stratum DB 1 , DB 2 …. DB N each with D/N transactions.
Simple random samples S i, j (j=1..N), each with D/N 2 transactions selected from each stratum DB i (i=1..N). Thus N partitions DS i , with homogeneous data of size DI (=D/N) for i=1..N, can be generated as: Such that: In this technique the database of size D is divided into N mutually disjoint parts called strata, each of size D/N, a stratified sampling partition can be generated by obtaining a simple random sample of size (D/N) /N (=D/N 2 ) from each stratum and N samples of size D/N 2 makes a sample of size D/N (= (D/N 2 ) *N) with homogeneous data. This helps to ensure a representative sample, especially when the data are skewed.

Distributed Approach for Generating Frequent
Itemsets: Let the size of partitions DB i be D i (=D/N) for i=1..N. Let X. sup and X. spy be the support counts of an item sets X in DB and DS i , respectively. X.sup is called global support count and X.sup i is called local support count of X at processor P i . for a given minimum support threshold s, X is globally the largest if X.sup ≥ s×D and X is locally large at P i if X.sup i ≥ s×DS i .

Notations:
D Number of transactions in DB s Support threshold min-sup L (k) Globally large k-itemsets C (k) Candidate sets generated from L (k) X. sup Global support count of X L i(k) Locally large k-itemsets at S i X. sup i Local support count of X at S i CG (k) Candidate sets generated from L (k-1) Ti (j) Data structure to maintain the item set an their support count at the site S i in jth iteration D i Number of transactions in DS i There is an important relationship between large itemsets and the partitions in allocating to distributed system: every globally large itemsets must be locally large at some partitions DS i . If an itemset X is both globally large and locally large at a partition DSi, X is called gl_large at DS i . Notice that at each partition DS i , if a candidate set X∈CG (k) is not locally large at partition DS i , there is no need for DS i to find out its global support count to determine whether it is globally the largest. This is because in this case, either X is small (not globally large). Or it will be locally large at some other partition DS i and hence only the partition DS i at which X is locally large needed to be responsible to find the global support count of X. In the proposed approach since each processor has homogeneous data, hence generate approximately equal number of locally large itemsets simultaneously. if X. sup i ≥ s × D/N then for j = 1 to N do if polling_site ( X ) = P i then insert <X, X. sup i ) into LL i,j(k) ; for j =1, ….. , N do Send LL i, j (k) to processor P i ; for j =1, ….. , N do { Receive LL i, j (k) ; for_all X ∈ LL i, j (k) do { if X ∉ LP i(k) then Insert X into LP i (k) ; update X. large_processors ; } } for_all X ∈ LP i(k) do send_polling_request (X); reply_polling_request (T i (k) ); for_all X ∈ LP i(k) do { receive X.sup j from the processors P j , where Pj ∉ X. large_processors; X.sup = X.sup 1 + X. sup 2…… + X. sup N ; if X. sup ≥ s × D then insert X into Gi(k) ; } broadcast G i(k) ; receive G j(k) from all processors P j (j ≠ i);  Following that it computes the global support counts of all these candidate set and find out the globally large itemsets among them. These globally large k-itemsets are stored in the set GI (k) . Finally, S i broadcasts the set G i(k) to all the other sites.

Workload Balanced Distributed
* Home Site: Receives Large Itemsets: As a "home "site, S i receives the sets of globally large kitemsets G i(k) from all the polling sites. By taking the union of G i(k) , (i=1,…..N), S i finds out the set L k of all the size-k large itemsets. Further S i finds out from L k the set GL i(k) of gl-large itemsets for each site by using the site list in X. large_sites. The set GL i (k) will be used for candidate set generation in the next iteration.
Comparison: Now let a sequential approach takes T time for support count of candidate sets in any iteration.
And let processors, Pi (i=1...N) takes, T i time to calculate support counts in allocating partition. Then on the proposed distributed approach each processor works on homogeneous partitions (have equal number of locally large itemsets), hence each processor P i , performed their processing at same time T/N for i=1..N. Now since no any other distributed approach has considered a homogeneous partition technique of a database with high skins, hence must be most of globally large itemsets are locally large only on few processors. Thus the time required for processing in any other distributed approaches like Count Distribution (CD), Fast Distributed Algorithm (FDM) equal to Max (T i , i=1..N), which will be greater than T/N.

CONCLUSION
We considered the problem of mining frequent itemsets on a shared-nothing multiprocessor environment on which data has been partitioned, across the nodes, by using stratified random sampling. An advantage of sampling for data partition is that the cost of obtaining a sample is propositional to the size of the sample, S, rather than the size of the datasets, D. Other data partition techniques can require at least one complete pass through D. This algorithm also attempts to minimize communication by allocating homogeneous partitions to each processor.
This algorithm is more efficient to mining frequent itemsets for those databases, whose size is very large and have high data skewness. Any parallel algorithm working on database with high data skews could not achieve the advantages of parallel processing, because most globally large itemsets clustered on few processors. The stratified random sampling used as partitioning approach balanced work load on each processor in a distributed environment.