An Efficient Algorithm for Mining Maximal Frequent Item Sets

Problem Statement: In today's life, the mining of frequent patterns is a basic problem in data mining applications. The algorithms which are used to generate these frequent patterns must perform efficiently. The objective was to propose an effective algorithm which generates frequent patterns in less time. Approach: We proposed an algorithm which was based on hashing technique and combines a vertical tidset representation of the database with effective pruning mechanisms. It removes all the non-maximal frequent item-sets to get exact set of MFI directly. It worked efficiently when the number of item-sets and tid-sets is more. Results: The performance of our algorithm had been compared with recently developed MAFIA algorithm and the results show how our algorithm gives better performance. Conclusions: Hence, the proposed algorithm performs effectively and generates frequent patterns faster.


INTRODUCTION
Frequent pattern mining plays a major role in many data mining applications like mining association rules, correlations. Frequent patterns are the patterns that occur frequently in the given data set. If a set of items appear frequently together in a transaction data set, it is referred as frequent item set. If a set of items are occurring frequently in a sequential manner, it is referred as frequent sequential pattern. If a substructure like subtrees, subgraphs occurs frequently it is called as frequent structured pattern. These frequent items can be represented in the form of association rules. The association rule problem is a very important problem in the data-mining field with numerous practical applications including consumer market-basket analysis, inferring patterns from web page access logs and network intrusion detection. The association rule model was introduced by Agrawal [1] . Support and confidence are the two measures of rule interestingness. They reflect the usefulness and certainty of discovered rules respectively. If there is an association rule like X => Y (support = a%, confidence = b%) then it means that a% of all transactions show that X and Y are together and b% of customers who purchased X also bought Y, where X and Y are items. Association rules must satisfy both minimum support threshold and a minimum confidence threshold. Such rules are called strong [2] . Support (X => Y) = P(X U Y) Confidence (X => Y) = P(Y|X) = Support (X U Y) / Support (X)  [29] . Frequent pattern mining can be classified based on the completeness of patterns to be mined, the levels of abstraction involved in the rule set, the number of data dimensions involved in the rule, the types of values handled in the rule, the kinds of rules to be mined, the kinds of patterns to be mined. Algorithms for frequent itemset mining can be classified into three categories as Apriori-like algorithms, frequent pattern growth based algorithms such as FP-growth and algorithms that use the vertical data format. The process of finding association rules has two separate phases [13] . In the first phase, we find set of Frequent Itemsets(FI) in the database. In the second phase, we set the set FI to generate "interesting patterns. In practice, the first phase is time consuming.
Wherever there are very long patterns (patterns containing many items) are present in the data, it is often impractical to generate the entire set of frequent itemsets or closed itemsets [27] . There is much research on methods for generating all frequent itemsets efficiently [3,7,8,9,15,18,31] or just the set of maximal frequent itemsets [6,10,12,14,27] . Most of these algorithms use a breadth-first approach, i.e., finding all k-itemsets before considering (k+1) itemsets. However, with dense datasets such as telecommunications and census data, where there are many, long frequent patterns, the performance of these algorithms degrades incredibly. This degradation is due to the following reasons: these algorithms perform as many passes over the database as the length of the longest frequent pattern. Secondly, a frequent pattern of length l, when l is large, the frequent itemset mining methods become CPU bound rather than I/O bound. In other words, it is practically unfeasible to mine the set of all frequent patterns for other than small l. There are two current solutions to the long pattern mining problem. The first one is to mine only the maximal frequent itemsets [19] , which are typically orders of magnitude fewer than all frequent patterns. While mining maximal sets help understand the long patterns in dense domains, they lead to a loss of information; since subset frequency is not available maximal sets are not suitable for generating rules. The second is to mine only the frequent closed sets [19][20][21] . Closed sets are lossless in the sense that they uniquely determine the set of all frequent itemsets and their exact frequency. At the same time closed sets can themselves be orders of magnitude smaller than all frequent sets, especially on dense databases.
The large itemset problem is reasonably well solved at least for the case of very sparse sales transaction data, when the pattern lengths are short [6,13] . An interesting analysis of the impact of different kinds of data on access costs has been provided in [18] . An Apriori-style algorithm with improved counting techniques using column wise data access for databases with a larger number of items has been also been discussed in the same study. We maintain that when the actual frequent patterns are wide, even the CPU-costs of any algorithm which is based on the Aprioriframework would be compromised by the investigation of all 2k subsets of frequent k-patterns. In such cases, the frequent itemset generation algorithms become CPU-bound. GenMax utilizes a backtracking search for efficiently enumerating all maximal patterns. GenMax uses a number of optimizations to quickly prune away a large portion of the subset search space. It uses a novel progressive focusing technique to eliminate nonmaximal itemsets, and uses diffset propagation for fast frequency checking.
Mafia is good for mining a superset of all maximal patterns, GenMax is the method of choice for enumerating the exact set of maximal patterns. We further observe that there is a type of data, where MaxMiner delivers the best performance. We denote by Fk the set of frequent k-itemsets, and by FI the set of all frequent itemsets. A frequent itemset is called maximal if it is not a subset of any other frequent itemset. The set of all maximal frequent itemsets is denoted as MFI. Given a user specified miti-sup value our goal is to efficiently enumerate all patterns in MFI. Backtracking algorithms are useful for many combinatorial problems. There are two main ingredients to develop an efficient MFI algorithm. The first is the set of techniques used to remove entire branches of the search space, and the second is representation used to perform fast frequency computations. We will describe below how GenMax extends the basic backtracking routine for FI, and then the progressive focusing and diffset propagation techniques it uses for fast maximality and frequency checking.
Some of the algorithms in the literature such as MaxMiner avoid this by implementing look aheads [27] , in which supersets of frequent patterns are used in order to prune off potential candidates in the search. Other innovative ideas for handling these problems are discussed in [23] . Recently, the merits of a depth-first approach have been recognized [6] .
The database representation is also an important factor in the efficiency of generating and counting itemsets. Generating the itemset Z = (X U Y) refers to creating t(Z) = t(X) ∩ t(Y), and counting is the process of determining support(Z) in T. Most previous algorithms use a horizontal row layout, with the database organized as a set of rows and each row representing a transaction. The alternative vertical column layout associates with each item X a set of transaction identifiers (tids) for the set t(X). The vertical representation allows simple and efficient support counting Basic Properties of Itemset-Tidset Pairs We use the concept of a closure operation [24,25] to check if a given itemset X is closed or not. We define a closure of an itemset X, denoted c(X), as the the smallest closed set that contains X. Recall that i(Y ) is the set of items common to all the tids in the tid set Y , while t(X) are tids common to all the items in X. To find the closure of an itemset X we first compute the image of X in the transaction space to get t(X). We next map t(X) to its image in the itemset space using the mapping i to get i(t(X)). It is well know that the resulting itemset must be closed [25] , i.e., c(X) = i • t(X) = i(t(X)). It follows that an itemset X is closed if and only if X = c(X). For example the itemset ACW is closed since c(ACW) = i(t(ACW)) = i(1345) = ACW. The support of an itemset X is also equal to the support of its closure, i.e., σ(X) = σ(c(X)) [26,27] .
For any two nodes in the IT-tree, Xi × t(Xi) and Xj × t(Xj ), if Xi ⊆ Xj then it is the case that t(Xj) ⊆ t(Xi). For example, for ACW ⊆ ACTW, t(ACW) = 1345 ⊆ 135 = t(ACTW). Let us define f : P(I) → N to be a oneto-one mapping from itemsets to integers. For any two itemsets Xi and Xj , we say Xi ≤ f Xj iff f(Xi) ≤ f(Xj ). The function f defines a total order over the set of all itemsets. For example, if f denotes the lexicographic ordering, then itemset AC ≤ AD, but if f sorts itemsets in increasing order of their support, then AD ≤ AC if σ(AD) ≤ σ(AC). There are four basic properties of ITpairs that CHARM leverages for fast exploration of closed sets. Assume that we are currently processing a node P × t(P) where [P] = {l1, l2, · · · , ln} is the prefix class. Let Xi denote the itemset Pli, then each member of [P] is an IT-pair Xi × t(Xi). In a thorough experimental evaluation, we first quantify the effect of each individual component on the performance of the algorithm. We then compare the performance of MAFIA against depth project, the most efficient previously known algorithm for finding maximal frequent itemsets [6] . Our results using some of the standard machine learning benchmark datasets indicate that MAFIA outperforms depth project by a factor of three to five on average. The main aim of developing this algorithm is to achieve CPU efficiency.

MATERIALS AND METHODS
The problem of mining frequent itemsets has been a topic of Intensive research [14,26] . Since the number of such sets is huge, it is common and more efficient to restrict the search to closed item-sets [26] , where a set is closed if all its supersets have strictly lower frequency in the database. The collection of frequent closed sets contains the same information as the overall collection of frequent item-sets, but is much smaller. There is also a growing interest in mining structured data, such as graphs, and more generally multi-relational databases, and the notion of closed sets has also been imported to this richer setup. Another variation exist between mining in a single interpretation (graph), or across multiple interpretations. Finally, some authors restrict the implication relation used in defining closures to range-restricted clauses only. In addition to these differences, the notion of a closed set can be coupled with a closure operator that takes a set and calculates its closure and there is more than one way to define such closures. The literature gives the impression that these different choices are unimportant and that algorithmic issues can be studied independently of the semantics. Our investigation shows that this impression is false and that semantics do matter.
Methods for finding the maximal elements include All-MFS [28] , which works by iteratively attempting to extend a working pattern until failure. A randomized version of the algorithm that uses vertical bit-vectors was studied, but it does not guarantee every maximal pattern will be returned.
Max Miner [27] is another algorithm for finding the maximal elements. It uses efficient pruning techniques to quickly narrow the search. Max Miner employs a breadth first traversal of the search space; it reduces database scanning by employing a look ahead pruning strategy Depth Project [30] finds long itemsets using a depth first search of a lexicographic tree of itemsets, and uses a counting method based on transaction projections along its branches.
It returns a superset of the MFI and would require post-pruning to eliminate non-maximal patterns. FP growth [31] uses the novel Frequent Pattern tree (FP-tree) structure, which is a compressed representation of all the transactions in the database.
Mafia [29] is the most recent method for mining the MFI. Mafia uses three pruning strategies to remove non-maximal sets. The first is the look-ahead pruning first used in Max Miner. The second is to check if a new set is subsumed by an existing maximal set.
The most important category of approaches in multi-relational classification is ILP. Besides ILP, probabilistic approaches are also popular for multirelational classification and modeling. The most important one is the probabilistic relational models (PRM's) [22,17] which is an extension of Bayesian networks for handling relational data. PRM's can integrate the advantages of both logical and probabilistic approaches for knowledge representation and reasoning. In [16] an approach is proposed to integrate ILP and statistical modeling for document classification and retrieval.
Given this conceptual framework, we can describe the most recent approaches to the maximal frequent itemset problem. As a baseline, Apriori traverses the lattice in a pure breadth-first manner, discovering all frequent nodes at level k before moving to level (k+1); Apriori finds support information by explicitly generating and counting each node [13] . Max Miner performs a breadth-first traversal of the search space as well, but also performs look aheads to prune out branches of the tree. The look aheads involve superset pruning, using apriori in reverse (all subsets of a frequent itemset are also frequent). In general, look aheads work better with a depth-first approach, but Max Miner uses a breadth-first approach to limit the number of passes over the database. Depth Project performs a mixed depth-first traversal of the tree, along with variations of superset pruning [6] . Instead of a pure depth-first traversal, Depth Project uses dynamic reordering of children nodes. With dynamic reordering, the size of the search space can be greatly reduced by trimming infrequent items out of each node's tail. Also proposed in Depth Project is an improved counting method and a projection mechanism to reduce the size of the database. The other notable maximal pattern methods are based on graph-theoretic approaches. MaxClique and MaxEclat [10] both attempt to divide the subset lattice into smaller pieces ("cliques") and proceed to mine these in a bottom-up Apriori-fashion with a vertical data representation. The VIPER algorithm has shown a method based on a vertical layout can sometimes outperform even the optimal method using a horizontal layout [11] . Other vertical mining methods for finding FI are presented by Holsheimer [4] and Savasere et al. [5] . The benefits of using the vertical tid-list were also explored by Ganti et al. [3] .
Proposed Work: In general the structure of the transactional database may be in two different ways -Horizontal data format and Vertical data format. Here, we are using vertical data format for storing the transactions in the database. In vertical data format, the data is represented as item-tidset format, where item is the name of the item and tidset is the set of transaction identifiers containing the item. We use hash data structure to represent this data format. Initially one hash is maintained for itemset and another hash for tidset. Different pointers are maintained as links between itemset and tidset as shown in the Fig. 1.
In  Fig. 2 and Table 2.
If we observe, the third level itemset {I1, I2, I3} has only 2 transactions {T8, T9} in tidset. So, this is not even frequent itemset. So, the frequent itemsets and maximally frequent itemsets obtained from the second level are the final result. It can observed that the number of levels increase if the support is less. If we increase the support, then number of levels decreases and so as the time to find MFI decreases.  D in  second level  Itemset  Tidset  I1, I2  T1, T4, T8, T9  I1, I3  T5, T7, T8, T9  I2, I3  T3, T6, T8, T9 The process will be continued till intersection can be taken. In this procedure we need not calculate the support of the itemset separately. It can be taken by the number of transactions in the tidset. Also the pruning can be done while finding the MFI itself, but not after finding FI completely. The proposed algorithm is given below. The proposed algorithm performs better because MFI is being calculated directly before computing FI completely. At each level, after computation of FI, we are computing MFI also. So, the time taken to compute MFI is negligible. And also it shows that no separate pruning is required. Hash data structure can be maintained to store database. This makes easy in performing several tasks. As we are following vertical data format, support also need not be calculated separately. In this case, support is directly given by the number of transactions in the tidlist of each FI.

Figures 3-5 illustrate the results of comparing
HBMFI to our implementation of MAFIA method, the state of the art method for finding maximal frequent items. Support is taken as X-axis and the time taken to find the MFI is taken as Y-axis. The comparison of algorithms MAFIA and HBMFI is shown using different datasets in Fig. 3, Fig. 4 and Fig.  5.
The percentage in the improvement of the performance of the proposed algorithm, HBMFI is less in Chess dataset when compared to other datasets. The extremely low number of transactions and small number of frequent items at low supports muted the factors that HBMFI relies on to improve over MAFIA. Both the algorithms scale linearly with the database size, but HBMFI is about 2 to 3 times faster than MAFIA. Thus we see that HBMFI performs better with large number of transactions and long itemsets.

DISCUSSION
The testing of this algorithm has been carried out on the real datasets containing large number of transactions and long itemsets like chess, mushroom, connect4. At the lowest supports tested, the longest databases have over 20 items, making any algorithm that examines all possible subsets of these patterns infeasible. This makes the task of finding the MFI computationally intensive despite the small size of the databases.
For Connect-4, the increased efficiency of itemset generation and support counting in MAFIA and HBMFI explains the improvement. Connect-4 contains an order of magnitude more transactions than the other two datasets, amplifying the advantage in generation and counting.
For Mushroom, the improvement is best explained by how the MFI is computed at each level and found directly without waiting for FI completely. This leads to a much greater reduction in the overall search space than for the other datasets, since the reductions is so great at highest levels.

CONCLUSION
We presented HBMFI, an algorithm for finding maximal frequent itemsets. Our experimental results demonstrate that HBMFI consistently outperforms MAFIA by a factor of 2-3 on average. The vertical data format representation of the database, the easy manipulations on hash data structure and directly computing MFI are the added advantages of this algorithm.