Mining of Datasets with an Enhanced Apriori Algorithm

: Problem statement: Classical association rules are mostly mining intra-transaction associations i.e., associations among items within the same transaction where the idea behind the transaction could be the items bought by the same customer on the same day. The goal of inter-transaction association rules is to represent the associations between various events found in different transactions. Approach: In this study, we break the barrier of transactions and extend the scope of mining association rules from traditional single-dimensional, intratransaction associations to N-Dimensional, inter-transaction associations. With the introduction of dimensional attributes, we lose the luxury of simple representational form of the classical association rules. Mining inter-transaction associations pose more challenges on efficient processing than mining intra-transaction associations because the number of potential association rules becomes extremely large after the boundary of transactions is broken. Results: Various tests also conducted using the data set collected from different Stock Exchange (SE).Various experimental results are reported by comparing with real life and synthetic datasets and we show the effectiveness of our work in generating rules and in finding acceptable set of rules under varying conditions. Conclusion/Recommendations: This study introduce the notion of N-Dimensional inter-transaction association rule, define its measurements: support and confidence and develop an efficient algorithm called Modified Apriori.


INTRODUCTION
Among all the data mining problems, discovering association rules from large databases is probably the most significant contribution from the database community to the field (Agrawal et al., 1993;Agrawal and Srikant, 1994;Dong and Han, 2007;Feng et al., 2002;Han and Fu, 1995;Kamber et al., 1997;Shankar et al., 2009). The most often cited application of association rules is market basket analysis using transaction databases from supermarkets and departmental stores. We can discover rules like: R1: 80% of customers who bought diaper also bought beer (diaper => beer (20 and 80%)) where, 80% is the confidence level of the rule and 20% is the support level of the rule indicating how frequent the rule holds.
Association rules for prediction: The same concept can be applied to other applications as well. For example, to predict the stock market price movement (Tung et al., 2003), we can construct a transaction database in such a way that: each record (transaction) in the database represents one trading day and contains a list of winners (closing price is x% more than the previous day's closing price where x% is the trading overhead). Thus we can find rules like: R2: When the prices of IBM and SUN go up, 80% of time the price of Microsoft goes up (on the same day).
While rule R2 reflects some relationship among the prices, its role in price prediction is limited. It is rather obvious that the traders may be more interested in the following kind of rules: R3: If the prices of IBM and SUN go up, Microsoft's will most likely (80%of time) go up the next day.
Unfortunately, current association rule miners cannot discover this kind of rules.
The fundamental difference: There is a fundamental difference between rule R3 and the other rules. The classical association rules express the associations among items purchased by one customer or share price movement within a day, i.e., associations among items within the same transaction record. We call them intra-transaction association rules. Sequential pattern discovery is also intra-transaction mining in nature because each sequence is treated as one transaction and the mining process is to find similarities among the sequences. On the other hand, rule R3 expresses the association among items from different transaction records. We call it inter-transaction association.

N-dimensional inter-transaction Association rules:
In this stock movement prediction application, the association is along one dimension, the trading days. The concept can be extended further. If a database contains records about the time and location of buildings and facilities of a new city under development, we may be able to find such a rule: Based on what have been described above, we propose N-dimensional inter-transaction association rules with the classical association rules as a special case.
The transaction database: Definition 1 let E = {e 1 , e 2 , …,e u } be a set of literals, called events. Let D 1 , D 2 ,…,D n be a set of attributes. A transaction database is a database containing records in the form of (d 1 , d 2 ,…,d n , E i ) such that  k (1≤k≤n ) (d k Є Dom (D k )) where Dom(D k ) is the domain of attribute D k and E i ⊆ E.A transaction database with n attributes is called an ndimensional transaction database.
The attributes in an n-dimensional transaction database are called dimensional attributes. They describe the properties associated with the events, such as time and place. There are a wide range of application databases that can be viewed as n-dimensional transaction databases. The stock price movement database is a 1-dimensional transaction database. The example of urban development project can use a 2dimensional transaction database where the two dimensional attributes are month and block number and the event list includes the buildings or facilities completed during the month at a particular block. In the current study, we will assume that the domain of a dimensional attribute can be divided into equal length intervals. For example, time can be divided into day, week, month, etc. and distance into meter, mile. The intervals can be represented by integers 0, 1, 2,… without losing generality. If we divide the space into n-dimensional cells each of which is identified by the associated n-ary tuple (d 1 , d 2, … d n ), each transaction in the database represents a non-empty cell with some points (events) inside it.
Definition 2: Let T i = (d i1 , d i2 , …, d in , E i ) be a record in the transaction database. (d i1 , d i2 , …. , d in ) is the address of event e i Є E i . An event associated with its address is called an event instance, denoted by ē i = e i (d i1 , d i2 , …, d in ). Figure 1 depicts a 2-dimensional transaction database. The dimensional attribute values of D 1 and D 2 have been mapped to integers; and there are four types of events, a, b, c and d. The database contains transactions: T 1 (1,1,a,b,c),T 2 (2,1,b),….,T 24 (5,5,c).

N-dimension inter-transaction association rules:
The objective of inter-transaction association rules is to represent the associations between various events found in different transactions. With the introduction of dimensional attributes, we lose the luxury of simple representational form of the classical association rules. Some definitions are needed before we formally define such rules.
Definition 4: Given a set of transactions T={T 1 ,T 2 ,…,T 8 } where transaction T j is in the form of (d j1 ,d j2 ,…,d jn , E j ) (1 ≤ j ≤ s). An n-ary tuple (d 01 ,d 02 ,…,d 0n ) with d 0k =Min(d jk )(1 ≤ k ≤ n, 1 ≤ j ≤ s) is called the base address for transaction set T, denoted by T-BASE(T).The relative address of all member transactions in T form the address of transaction set T, denoted by T-ADDR(T).
Definition 5: Given a set of transactions T={T 1 ,T 2 ,…,T 8 } where T j is in the form of (d j1 ,d j2 ,….,d jn ,E j ) (1 ≤ j ≤ s) and a set of event instance (1) for every ēi ε Ē T , there exists a transaction T j ε T so that e i ε E j ,and the relative address of ēi in E-ADDR(Ē T ) is the same as the relative address of T j in T-ADDR(T).
(2) |E-ADDR(Ē T ) = |T-ADDR(T)|' In the definition, the first condition guarantees that each event is among certain event list of a record in the transaction database. The second condition requires the transaction set is a minimum set. In our example, transaction set {T 1 ,T 6 ,T 7 } contains event instance set {a(0,0),c(0,1),d(1,1)}.{T 11 ,T 16 ,T 17 } and {T 8 ,T 13 ,T 14 } contain the same set of event instances. Now we are ready to define n-dimensional intertransaction association rules.
Definition 6: An inter-transaction association rule is an implication of the form X ==> Y, where (1) X and Y are sets of event instances in the form of e For the database shown in Figure 1, one such association rule is a (0, 0), c (0, 1) => d(1, 1).
Since the inter-transaction association rules involve more than one transaction, the definitions of support and confidence, which are widely used as the objective interestingness measure of association rules in intratransaction association rules, need to be modified. The reason is that, the number of transactions in the database can no longer be used as the measure. To address the problem, we introduce the following notion.
Definition 7: Let T xy be the set of transaction sets containing event instance set X UY, T xy be the set of transaction sets that possibly contain X U Y and T x be the set of transaction sets containing X, the support and confidence of an inter-transaction association rule X=> Y are defined as: As an example, we compute the support and confidence of the association rule: A(0,0),c(0,1)⇒d(1,1) In database shown in Figure 1. Here, X = {a(0,0),c(0,1)} and Y = {d(1,1)}.There are three transaction sets that contain the event instance set X U Y: T xy = {{T 1 ,T 6 ,T 7 },{T 8 ,T 13 ,T 14 },{T 11 ,T 16 ,T 17 }}, |T xy | = 3 The transaction database contains 24 records. The number of transaction sets that possibly contain X U Y is | T T xy | = 13. Note that, the database does not contain any transaction with address (4,4), which reduces the number of transaction sets that possibly contain the event instance set. In addition to the transaction set in T xy , {T 18 ,T 22 ,T 23 } is a transaction set that possibly contain XUY and surely contains X: T T' xy = {{ T 1 ,T 6 ,T 7 },{T 8 ,T 13 ,T 14 },{ 11 ,T 16 ,T 17 },{T 18 ,T 22 ,T 23 }}|T' xy | = 4.Therefore, the support and confidence for the above rule is 3/13 and 3/4, respectively. Note that we do not count the event a and c in transaction T 19 and T 24 when computing the confidence, as no transaction set can be formed with T 19 and T 24 that possibly contains X U Y.

Mining 1 Dimensional inter-transaction association rules:
Mining n-dimensional inter-transaction rules is obviously a computation intensive problem (Lee et al., 2006). Comparing to the classical association rules, the search space is much bigger as the number of possible rules increases dramatically with both the number of transactions and the number of dimensions. To investigate the feasibility of mining inter-transaction rules, we implemented two algorithms by extending the Apriori-based algorithm to mine 1-dimensional intertransaction association rules and applied it to the problem of stock price movement prediction. To limit the search space, we used an additional mining parameter, MAX I NTERVAL, to define a sliding window. Only the associations among the events that co-occurred within the window are interested. In general, the mining process of n-dimensional intertransaction rules can be divided into three phases: data preparation, Frequent-item set discovery and candidate generation.

Data preparation:
The transaction database is prepared for mining from operational databases. The major task in this phase is to organize the transactions based on intervals of the dimensional attribute(s). For example, to find the long term movement regularities of stock prices across different weeks (months), we need to transform daily price movement into weekly (monthly) group. After such transformation, each record in the database will contain an interval value and a list of items.

Frequent-Item set discovery:
In this phase, we find the set of all frequent item sets. A k-item set is of the form {i 1 (d i1 ), i 2 (d i2 ), … , i k (d ik )}, where event ij, 1≤ j ≤ k, is attached by a non-negative value d ij indicating the relative address with respect to the base address of the set. For example, a 3-itemset {a(0), b(1), c(3)} contains three event instances expressed in relative addresses along the dimension. That is, taking a transaction containing event a as the base transaction, b(1) is an event b contained in a transaction with 1 unit distance away from the base transaction and c(3) represents an event c in a transaction 3 unit distances away from the base transaction. This is quite different from the classical definition of item set {i 1 ,i 2 ,…,i k } in which all items lie within the same transactions.
To find the frequent item sets, two algorithms, E-Apriori and M-Apriori, were implemented which are extensions of Apriori based algorithms (Agrawal and Srikant, 1994;Han et al., 2000;Chu et al., 2009;Srikant and Agrawal, 1995). Let L k represent the set of frequent k-item sets and C k the set of candidate k-item sets. Both algorithms make multiple passes over the database. Each pass consists of two phases. First, the set of all frequent (k-1) item sets L K-1 ; found in the (k-1)th pass, is used to generate the candidate item set C k . The candidate generation procedure ensures that C k is a super set of the set of all frequent k-item sets. The algorithms now scan the database. For each list of consecutive transactions, they determine which candidates in C k are contained and increment their counts. At the end of the pass, C k is examined to check which of the candidates are actually frequent, yielding L k . The algorithms terminate when L k becomes empty.
As previously reported in (Han et al., 2000;Feng et al., 2002), the processing cost of the first two iterations (i.e., obtaining L 1 and L 2 ) dominations the total mining cost. The reason is that, for a given minimum support, we usually have a very large L 1 , which in turn results in a huge number of itemsets in C 2 to process. In the intertransaction association rules, this situation becomes much more serious as a lot of additional 2-itemsets like {a(0),a(1)} may be added into C 2 ,thus leading to a huge amount of |C 2 |. In order to construct a significantly smaller C 2 , EH-Apriori adopts a similar technique of hashing as (Park et al., 1995;Han et al., 2000) to filter out unnecessary candidate 2-itemsets. When the support of candidate C 1 is counted by scanning the database. EH-Apriori accumulates information about candidate 2-itemsets in advance in such a way that all possible 2-itemsets are hashed to a hash table. Each bucket in the hash table consists of a number to represent how many itemsets have been hashed to this bucket thus far. Such resulting hash table can be used to greatly reduce the number of 2-itemsets in C 2 .In the following, we describe how E-Apriori and M-Apriori generates candidates and count their supports.
Pass 2: For any two 1-itemsets of L 1 ,a(0) and Of all 2-itemsets in C 2 , the minimal interval value is always 0: Pass k > 2 Given L k-1 , the set of all frequent (k-1) item sets, the candidate generation procedure returns superset of the set of all frequent k-item sets. This procedure has two parts. In the join phase, we join L k-1 with L k-1 : Insertinto C k Select p.item 1 (d item1 ),p. item 2 (d item2 ), …..,p.item k-1 (d itemk-1 ),q.item k-1 (d itemk-1 ) From L k-1 p,L k-1 q Where: p. item 1 (d item1 ) = q.item 1 (d item1 ),…., p. item k-2 (d itemk-2 ) = q.item k-2 (d itemk-2 ), p. item k-1 (d itemk-1 )< = q.item k-1 (d itemk-1 ) Counting support of candidates: To facilitate the efficient support counting process, a candidate C k of kitemsets is divided into k groups,with each group Go containing o number of items whose interval is 0(1 ≤ o ≤ k). For example, a 3-item set:  (0),q(0),r(0)}} Each group is stored in a modified hash-tree. Only those items with interval 0 participate the construction of this hash-tree, e.g., in group 2, only {a(0),b(0)},{(c(0),d(0)} enter the hash-tree. The construction process is similar to that Apriori (Agrawal and Srikant, 1994;Rahman and Balasubramanie, 2009). The rest items, e.g., h(3),d(2), are simply attached to the corresponding itemsets, e.g., {a(0),b(0)} and {c(0),d(0)} respectively, in the leaves of the tree. Upon reading one transaction of the database, every hash-tree is tested. If one itemset is contained, its attached itemsets whose intervals are larger than 0 will be checked against the successive transactions. In the above example, if {a(0),b(0)} exists in the current transaction tc, then tc+3 transaction will be scanned to see whether it contains item h. If so, the support of 3itemsets {a(0),b(0),h(3)} will increase by 1.
E-Apriori and M-Apriori share the same procedures, except that Pass 1, M-Apriori hashes all 2itemsets like {i 1 (0),i 2 (d i2 )}(d i2 ≠ 0) contained in the current series of transactions into the corresponding buckets of a hash Table and prunes unnecessary 2-itemsets from C 2 in pass 2, whose corresponding bucket values in the Hash Table are less than support threshold.

MATERIALS AND METHODS
To assess the performance of the proposed algorithms, some preliminary experiments were conducted using synthetic data. Table 1 listed one set of the results obtained using a transaction database with 10,000 records with each records containing 5 items on the average. The total number of items is 500. The maximum interval is set to 3. (T -ave_tran_size, N -item_num, D -tran_num,R -max_interval). The results indicate that, with the given setting, the execution time is acceptable, especially if M-Apriori algorithm is used.   It is also found that although the execution time of the fist pass of M-Apriori is slightly longer than that of E-Apriori due to the extra overhead required for building Hash Table, it incurs significantly smaller execution time than E-Apriori in later Pass 2 and less |C 2 | results in much less time to test against each transaction of the database.

RESULTS AND DISCUSSION
Some tests conducted using the data set collected from Singapore Stock Exchange (SES). The available stock price data was used to generate two data sets, WINNER and LOSER. A stock is a winner if its closing price of the day is 3% more than the previous day closing. A stock is a loser otherwise. The WINNER (LOSSER) data set contains the date and the winners(losers) of that day. Each data set contains 250 records corresponding to 250 trading days in 2006. Since the major trend for SES in 2006 is down side, there are a few of winners everyday but a large number of losers. From the LOSER set, one example rule found is {UOL(0), SIA(1)} ═> DBS(2). That is, if UOL goes down and SIA goes down the following day, DBS will go down the second day with confidence more than 99%. Since the WINNER data set is small, we do not have rules with large support. However, if after lowering the support, we can find rules such as {HAISUNWT(0), KIMENGWT(0)} ═> HAISUNWT(1). The following table shows the performance of the proposed algorithm.
The necessity of having N-dimensional intertransaction association rules is clear. The definition of such rules is lengthy (based on our study).

CONCLUSION
We believe that, the proposed n-dimension intertransaction association rules represent a uniform treatment to a few association-related data mining problems. Furthermore, there seems to be highly promising to apply such notions in textual mining, spatial data mining, multi-media data mining.