FREQUENT CORRELATED PERIODIC PATTERN MINING FOR LARGE VOLUME SET USING TIME SERIES DATA

Frequent pattern mining has been a widely used in the area of discovering association and correlations among real data sets. However, discovering interesting correlation relationship among huge number of co-occurrence patterns are complicated, a majority of them are superfluous or uninformative. Mining correlations among large pile of useless information is extraordinarily useful in real-time applications. In this study, we propose a technique uses FP-tree for mining frequent correlated in periodic patterns from a transactional database. The analysis of time correlation measure tend to improvise the performance based on real time data sets and the result proves the algorithm efficiency by shifting the data sets to various domain towards time series, its correlation and noise-resilient ratio. This work addresses the time correlation factor achieved with the previous evaluated result of time series sequence of FP tree.


INTRODUCTION
The concept of frequent pattern mining used extensively in the field of data mining. The association rule mining (Han and Pei, 2000), sequential pattern mining (Pei and Han, 2002), graph pattern mining (Yan and Han, 2002) are the few common approaches used in it. The real complication occurs in terms of real data sets. The real challenge is gather similar useful pattern collected from a large volume of information that catches the researcher concentration (Hasan et al., 2007;Chen et al., 2008).
The piles of data are gathered with similar behavior at identical time interval and its series which brings disrepute prior to analysis (Elfeky et al., 2005a;. The observation is to categorize duplicate patterns that provide important observations and its updated information of time series data (Weigend and Gershenfeld, 1994;Versaci, 2014) and assist in decision making based on the result achieved (Rasheed et al., 2011). A time series (Sheng et al., 2005a) is said to have three type of periodic pattern: (1) Symbol periodicity, (2) sequence periodicity or partial periodic pattern and (3) segment or full-cycle periodicity (Rasheed et al., 2011). For example, in time series contain the hourly number of transactions in retail store; the mapping different ranges of transactions (is referred as discreet process); a: {0} Transactions, b: {1-300} Transactions, c: {301-600} transactions, d: {601-1200} transactions, e: {>1200} Transactions. Based on this mapping, the time series T' = 0,212, 535, 0, 398, 178, 0, 78, 0, 0, 102, 423 can be discreet into T = abdacbabaabc. At least one symbol is repeated periodically in time series T is referred as Symbol periodicity. For example T = a bd a cb a ba a bc, symbol 'a' is periodic with periodicity p = 3, starting at position zero. Sequence periodic or partial periodic pattern consists of more than one symbol, maybe periodic in a time series. For example T = ab dacb ab aabc, symbol 'ab' is periodic with p = 5 starting at position zero. In whole time series, a repetition of pattern or segment is called segment or full-cycle periodicity. For Science Publications JCS example T = abdc abdc abdc has segment periodicity of p = 5 starting at position zero. Many existing algorithms (Elfeky et al., 2005a;Han et al., 1998;Indyk et al., 2000) detects periods that span through entire time series. Some algorithms detect all the above mentioned three type of periodicity, along with noise within subsection of time series, separately for each patterns (Rasheed et al., 2011).
The traditional association periodic pattern mining problem is well defined and has been thoroughly studied in last decade (Elfeky et al., 2005a;Rasheed et al., 2011), there is currently no canonical way to measure the degree of correlation between periodic patterns (Huang and Chang, 2005). We believe that there should intuitively be more than one solution to define this new type of pattern, especially among different scenarios. Although answer to whether a periodic pattern is correlated or not is not an absolute, we at least expect to match common knowledge. An appropriate measure of correlation between long periodic patterns should be allowed to correlate with its sub-patterns.
The concept of Frequent Correlated Periodic Pattern mining (FCPP) used with time series data was handled efficiently in this study. The process was initiated with TRIE data structure referred as consensus tree that will enable parallel pattern search within the tree search path. It was followed by period establishing position and finally results in time series and time correlated approach.
This study addresses the following: • The novelty of pattern mining using Frequent Correlated Periodic Pattern (FCPP) was handled to the address the issues on frequent pattern tree path towards time series and time correlated approach • In order to focus on its efficiency of the algorithm the periodicity was evaluated in trade off with CONV (Elfeky et al., 2005a) WARP (Elfeky et al., 2005b), ParPer  and finally STRN (Rasheed et al., 2011). The result obtained shows scalable performance in single stretch • To mine item pairs of a particular node, represent a periodic pattern and determine the correlated relationship among item pairs. We select measures appropriate to our mining task • To demonstrate the outstanding performance of our algorithm based on correlated relationship in terms of both efficiency and effectiveness on datasets The literature review was elaborated in section 2 and section 3 with initial ground work, section 4 results in correlated time series data and its approach, section 5 with algorithm and followed by Section 6 with its pros and cons.

RELATED WORKS
The time series query based approach and its classification based on the given querying sequences was addressed in (Fu et al., 2005;Vlachos et al., 2002;Zhu and Shasha, 2003). In trade off the algorithm that exist need to provide the specific time period (Han et al., 1998;Ma and Hellerstein, 2001;Yang et al., 2002;Berberidis et al., 2002;Chen et al., 2006) for getting the time series result and the time series trend set up was discussed in (Udechukwu et al., 2004) and the range was addressed in (Elfeky et al., 2005a;Indyk et al., 2000;Rasheed et al., 2011). The noise suppression ratio in time series data was addressed in (Elfeky et al., 2005a) where it fails to do so. In order to detect segment in periodicity and its sequence the concept of WARP (Elfeky et al., 2005b) was introduced. To detect the time series and its periodicity Sheng et al. (2005b) proposes an algorithm to retrieve the said data. The combination all these algorithms (Elfeky et al., 2005a; retrieves time series data along with periodicity based on its range. The time series sub section was addressed in (Rasheed et al., 2011) using STNR algorithm. The proposed algorithm (Mueen et al., 2010) results in time series followed by its time correlated approach. The time series and its correlated check in STNR prolongs for its entire pattern was also proposed.
The study of correlation pattern mining focused on two important aspects. The first aspect is the significance of the patterns. More specifically, it is relevant to provide significance measures for the correlation of attribute sets and the correlation patterns. The second aspect is related to the computational cost of the proposed task. FP-growth mining algorithm (Han and Kamber, 2006), offers the better performance in mining null transactions for subsequent scanning of conditional databases. Omiecinski (2003;Kim et al., 2004) which used to find correlated patterns satisfying given minimum all-confidence. Liu et al. (1999) used the method of pruning by discovering the time correlation using contingency table. The concept of independent and correlated pattern was addressed to get the exact time correlated data from the time series data's was handled by (Zhou, 2008;Zhou et al., 2006). The mining periodicity in compared with transaction data requires unique identity. Nevertheless these works built obviously by scanning every item pairs in a particular node of the consensus tree.

For Mining Periodic Pattern
Suppose ∑ is a finite symbol set and |Σ| its cardinality. Our previous work reflects the following (Pujeri and Karthik, 2012). For DNA, |Σ| is 4 and the symbols are the 20 amino acids. Let S = {S 1 ,S 2 ,…,S N } of input time series sequences over a finite symbol set ∑ with |Σ| = R, such that |s i | = L, 1≤i≤N and positive integer d and q such that 0≤d≤L and 1≤q≤N. Here given parameters N and L are the number and length of given input sequence. Let a is called a pattern (center string) if each of at least q input sequence contains a substring in a's d-neighborhood. Find all center string t∈Σ l with any length l, 0≤d<l≤L every t has at least q sequence posse's x-mutated copy (x≤d) of t. In real time, we have to investigate time series to identify repeating patterns along with its outliers. The proposed work concentrates on manipulated data that received as a result of time series patterns for exploring further patterns along with correlated approaches and it outliers. These outliers will further detect other patterns along with its sequence and its periodicity. The output of this approach is the TRIE data structure (referred as consensus tree) that helps to explore the patterns received as the result of the proposed approach.

JCS
The level of confidence to acquiring further pattern was done in two ways. First the pattern position, its sequence and its periodicity(as mentioned earlier) that result in initial pattern sequence to start up with followed by the level of confidence received as a result of exploring patterns. The level of the patterns and its periodicity makes the initial point of access to explore patterns of its kind as discussed in (Rasheed et al., 2011).

For correlated Patterns
A correlation provides results by considering the time series data as input and to detect the time interval of the series.
The concept of periodic mining assists in similar patterns gathering and also provides the way to recognize it. In order to do the comparison, the different signal phases are taken and reflected (autocorrelation) to its copy. The repeating periodic signals are then captured and analyzed subsequently.
A resultant pattern has two data followed by its pattern A and B, then the approach (2) is applicable has not more than two data's or items and result in the approach (3) from the approach (1 and 2) then Equation 1: If in the case of two data or items that sets the minimum and maximum patterns along with the pattern confidence level (threshold). The correlation terms either result in the combination of two data (dependent) on the other case results in two separate data's (independent) such as pattern X = {i 1 , i 2 ,…,i n }, then Equatin 2: 1 and 2, results in t ρ that has two bounds, i.e., -1≤ρ≤1. Let δδ be a given minimum correlated confidence, if pattern X has two data's A, B are called correlated with each other, else A and B are called independent (Karim et al., 2013). If pattern X has more than two items, we define a correlated pattern and an independent pattern as follows:

Definition 1
Correlated pattern x result in y then both the patterns are correlated in the case of depended patterns then Y⊆X and |ρ (AB)|>δ; where δ is a predefined value of ρ.

Definition 2
In the case of independent pattern, there exists a pattern x then no such patterns reflect on the same pattern (subsets).
Let T = {i 1 ,i 2 ,..,i m } be a set of m distinct literals called items and D is the set of variable length transaction over T. The interestingness measure all-confidence denoted by α of a pattern X can de defined as follows Equation 3:

Definition 3
In the case of dependent pattern, there exists a pattern x and y, where the confidence level is either maximum or equal to its value. Such patterns focus on its associated pattern.

Definition 4
Associated-correlated pattern-In case of associative pattern, there exist a association between pattern between two subsets of A and B.

FP Tree Construct
The mapping of sub patterns along with the time series data and its path were denoted as `t' as pattern time series and `s' as its sequence. It starts from the root node'n' and it is mapped with the sub pattern string of (j,k,e) with pointers starting from the k th position of sequence j and provides the results in terms of its pattern. The tree is fully balanced for the node that has balance pointers connecting its descends. The concept of backward closure property (Karthik and Pujeri, 2013) makes pre-pruning in connection with the constraints levels for every pointer linked with the sub pattern. The Frequent Correlated Periodic Pattern mining (FCPP) categorized based on constraints dealing with antimonotone, monotonic and succinct constraints which is also addressed in our previous work (Karthik and Pujeri, 2013). Nodes with confidence value as conf(b) = ((N-sup(b) ))⁄((N-q)<1) will be pruned; it is used as an antimonotonic constraint and a node in the consensus tree will branch out only if a support value is ≤q which is used as a monotonic constraint (Lee and Raedt, 2004). Each pointer in a consensus node has to satisfy degree of mutation e>d, otherwise it will be also pruned which sustains all position in consensus node like succinct constraint (Lee and Raedt, 2004). For each pointers without the mutation level e>d will not participate in production of pointers in next consensus node.

1.
For each string j of given input sequence N do 2.
For each symbol k of input string j of lenght L do 3.
If the kth symbol ith sequence is b 1 ∈ do 4.
Put (j,k in new node b 1 S , find (j,k substring is in all b 1 S for b 1 ≠b 2 and j in b 1 T for each b 1 ∈Σ if and only if sup (b i )> threshold.

5.
For each ith sequence from 1 to do 6.
For each entry (j,k,e) in each nodes if e<d then for all b i+1 ≠b i+1 15.
put (j,k,e+1) in if and only if conf End Begin 4; 17.
For each node

Periodicity Detection Algorithm
The usage of consensus tree provides sufficient data for identifying the periodicity of time series database. The concept of linear distance was applied for estimating the distance between two sub pattern that creates distance vector and represents it in matrix format. The Fig. 1 shows the results of distance vector with its subsequent starting and end position and also estimates the possible repetition of the sub pattern occurrence with respect to the consensus tree structure. It also maximizes the occurrence frequency based on the frequency count. The Starting Position and EndingPostion of the subpatterns along with its occurrence frequency was well recognized using FCPP algorithm. As a result the algorithm takes three parameters for consideration (starting position, ending position and its frequency)

Algorithm 2. Difference Matrix (Diffmatrix) Algorithm
Input: Starting position pointer for time series data with its position

Output:
Difference vector of A 1.

Finding Correlation in Periodic Patterns
We use Discrete Fourier Transform (DFT) to identify correlated item pairs in consensus node. The DFT of a item x = x 0 , x 1 ,…,x m-1 is a sequence X = X 0 , X 1 ,…,X m-1 = DFT(x) of complex numbers given by x e k m m ∑ f = 0, 1.., m-1. We also define the normalization of x as x = x 0 , x 1 ,…,x m-1 such that x k = (x i -µ x )/σ x are mean and standard deviation of the values x = x 0 ,x 1 …,x m-1 . The correlation coefficient of two item x and y can be reduced to the Euclidean distance between their normalized series such as where d(x',y') is the Euclidean distance between x' and y'. By reducing the correlation coefficient to Euclidean distance, we can apply the technique (Zhu and Shasha, 2002) to report the correlation between the item pairs exists in consensus node which is higher than a specific threshold. Few item pairs can be ignored for which ( ) < 2 (1 -) k d X,Y m T , since they cannot have correlation above a given threshold T. By ignoring such pairs, we will get a set of likely correlated signal pairs. Conceptually, the algorithm produces a matrix like one shown in Fig. 2, where all pairs with correlation above a threshold and some pairs with correlation below the threshold are marked as 1 and all other pairs are marked as 0. We can call this a pruning matrix P and use it in subsequent steps.
In our technique, the pattern occurrence of item in a node is partition based on the capacity of cache. If the cache does not fit with all instance of the node, we need to partition the instance of the node. Thus, computing correlation between signals in different batches incurs additional costs. Hence we chose existing algorithm F-M partitioning algorithm for partitioning instances of a node into equal size.
Consider a discretized sequence {a b a b a} for an interval range of 2. The 2*5 Matrix M produced for the input sequence is given as follows: In this matrix, the first row represents symbol 'a' and the second row represents symbol 'b'. The application of autocorrelation on each of the rows separately will produce the below result: In the correlated output R, every non-zero element represents the total number of occurrences of the symbol starting from that position. In that, the first element represents the total number of occurrences of the symbol. In this example, the output 3 in the first row represents the total number of occurrence of symbol a and 2 in the next row represents the total number of occurrence of symbol "b. The index positions of the non-zero elements are derived from the matrix. From that index position, the perfect and imperfect periodic rates are computed. In this example symbols a and b has occurred with a perfect periodic rate of 1. Every non-zero element of the row is auto correlated with the adjacent element of every other row until a zero value or end of the series is reached. The formula used is as follows:

Experimental Evaluations
We tested our algorithm based on our previous work that gathers information based on time series approach (Pujeri and Karthik, 2012) over a number of data sets. For real data experiments, we used supermarket data which contains sanitized data of timed sales transactions for Wal-Mart stores over a period of 15 months. Synthetic data taken from Machine Learning Repository (Blake and Merz, 1998) were also used. We tested how FCPP satisfies this on both synthetic and real data. The algorithm can find all periodic patterns 100% along with their correlation coefficient. This is an important feature in using FP tree which guarantees identifying all repeating patterns.
In order to test the accuracy, we test the algorithm for various period sizes, distribution and time series length. We used synthetic data obtained from Machine Learning Repository (Blake and Merz, 1998), have been generated in the same done in (Elfeky et al., 2005a). Figure 5a shows the behavior of the algorithm against the number of the time points in the time series. Figure 5b shows that the algorithm speeds up linearly to symbol set |∑| of Science Publications JCS different size. FCPP checks the periodicity for all periods within synthetic data in absence of noise.
For real data experiments, we used the Wal-Mart data which contains hourly based records of all transaction performed at a Supermarket. The data contains the record of around 15 months of data with expected period value of 24. FCPP algorithm with periodicity threshold values ranging from 0.8 to 0.4 and observed: The number of periods captured by algorithm, StPos and EndPos of the sequence, confidence value and the Pattern shown in Table 1. The expected period 24 is captured at the threshold value 0.8. Periodic pattern obtained less in number but accurate, useful and meaningful. Table 1 demonstrates that how periodic pattern are obtained without redundant period. FCPP algorithm does not produce duplicated period due to the presence of supper-pattern (Karthik and Pujeri, 2013) which holds the information of gathered patterns using Diff-matrix.
The result of time correlation shown in the Table 1 based on the wall mart data analysis. As the data grows enormously the efficient growth analysis of time series data using frequent pattern mining deviates as shown in Fig. 3 and 4 based on time factors. In also duplicates as there is change in data and its volume. To address this issue the concept time correlation factor is introduced using pearson correlation coefficient is define as Equatoin 4: In order to identify the exact factors and pairs used in correlation factor was identified by x the data received from Table 1 and its sequence is followed by the variable y Equatoin5: The normalization is estimated based on the parameter score of x and y is achieved, results in the correlation value as per Equation 5.
The correlation sequence is estimated for avoiding the repeated and duplicated index of the factor x and subsequently takes the y as a sequence factor and its further date analysis.
The results were discussed in the Table 1 and shown in the time correlated column. Such factor affects the threshold value based on time series and time correlated value.
The frequent constraint algorithm does not allow duplicate entry information based on the data registered in Diff-Matrix. The representation of FCPP and Par Per algorithm and its impact over time series was shown in Fig. 3 and 4. Figure 5 represents the performance analyzes of time series and its behavior over a period of time was shown. The performance comparison of FCPP and ParPer was checked with the data size between 1 to 10 lac. As the result the FCPP shown better performance compared with WARP and STNR and projects less impact compared with CONV based on runtime. FCPP maximizes its performance over a period of time as there is persistent progress in data and period size were such combination affects the performance of ParPer  and WARP (Elfeky et al., 2005b) in terms of its data size.

CONCLUSION
In this work, the proposed Frequent pattern growth algorithm mines large database which in turn addresses the issues dealing with time series and time correlated approach. The achieved threshold hold from time correlated approach proven to be effective compared with the value achieved using time series. The work also addresses the need for combining both time series and time correlated approach for providing better theroshold value with efficient feasible result that too in terms of large datasets.