A Fast Pattern Matching Algorithm with Two Sliding Windows (TSW)

: In this research, we propose a fast pattern matching algorithm: The Two Sliding Windows (TSW) algorithm. The algorithm makes use of two sliding windows, each window has a size that is equal to the pattern length. Both windows slide in parallel over the text until the first occurrence of the pattern is found or until both windows reach the middle of the text. The experimental results show that TSW algorithm is superior to other algorithms especially when the pattern occurs at the end of the text


INTRODUCTION
Pattern matching is a pivotal theme in computer research because of its relevance to various applications such as web search engines, computational biology, virus scan software, network security and text processing [1][2][3][4] .
Pattern matching focuses on finding the occurrences of a particular pattern P of length 'm' in a text 'T' of length 'n'. Both the pattern and the text are built over a finite alphabet set called ∑ of size σ.
Generally, pattern matching algorithms make use of a single window whose size is equal to the pattern length [5] . The searching process starts by aligning the pattern to the left end of the text and then the corresponding characters from the pattern and the text are compared. Character comparisons continue until a whole match is found or a mismatch occurs, in either case the window is shifted to the right in a certain distance [6][7][8][9][10][11][12] . The shift value, the direction of the sliding window and the order in which comparisons are made varies in different pattern matching algorithms.
Some pattern matching algorithms concentrate on the pattern itself [5] . Other algorithms compare the corresponding characters of the pattern and the text from left to right [6] . Others perform character comparisons from right to left [8,11] . The performance of the algorithms can be enhanced when comparisons are done in a specific order [9,13] . In some algorithms the order of comparisons is irrelevant such as Brute Force and Horspool algorithms [7] .
In this study, we propose a new pattern matching algorithm: The Two Sliding Windows algorithm (TSW). The algorithm concentrates on both the pattern and the text. It makes use of two windows of size that is equal to the size of the pattern. The first window is aligned with the left end of the text while, the second window is aligned with the right end of the text. Both windows slide at the same time (in parallel) over the text in the searching phase to locate the pattern. The windows slide towards each other until the first occurrence of the pattern from either side in the text is found or they reach the middle of the text. If required, all the occurrences of the pattern in the text can be found.
Related works: Several pattern matching algorithms have been developed with a view to enhance the searching processes by minimizing the number of comparisons performed [14][15][16] . To reduce the number of comparisons, the matching process is usually divided into two phases. The pre-processing phase and the searching phase. The pre-processing phase determines the distance (shift value) that the pattern window will move. The searching phase uses this shift value while searching for the pattern in the text with as minimum character comparisons as possible.
In Brute Force algorithm (BF), no pre-processing phase is performed. It compares the pattern with the text from left to right. After each attempt, it shifts the pattern by exactly one position to the right. The time complexity of the searching phase is O (mn) in the worst case and the expected number of text character comparisons is (2n).
New ways to reduce the number of comparisons performed by moving the pattern more than one position are proposed by many algorithms such as Boyer-Moore (BM) [11,17] and Knuth-Morris-Pratt algorithms (KMP) [6,18] . KMP algorithm compares the pattern with the text from left to right. If a mismatch occurs it uses the failure function f(j)that indicates the proper shift of the pattern [6,18] . The failure function f (j) is defined as the length of the longest prefix of P that is the suffix of P[1..j]. Thus, KMP reduces the number of times it compares each character in P with a character in the text T. KMP performs (2n) text character comparisons and the complexity of the pre-processing phase is O (m). KMP achieves a running time of O (n+m), which is optimal in the worst case [6] .
BM algorithm improves the performance by preprocessing the pattern using two shift functions: the bad-character shift and the good-suffix shift. During the searching phase, the pattern is aligned with the text and it is scanned from right to left. If a mismatch occurs, the BM algorithm shifts the pattern with the maximum value taken between the two shift functions. The worst case time complexity when searching all occurrences of the pattern is O (mn) and O (nm −1 ) for best performance [11,17] .
A simplification of BM algorithm is the Horspool algorithm [7] . It does not use the good suffix function, instead it uses the bad-character shift with the rightmost character. Its pre-processing time complexity is O(m+σ) and the searching time complexity is O(mn) [7] .
The Berry-Ravindran algorithm (BR) calculates the shift value based on the bad character shift for two consecutive text characters in the text immediately to the right of the window. This will reduce the number of comparisons in the searching phase. The pre-processing and searching time complexities of BR algorithm are O(σ 2 ) and O(nm) respectively [7] . In this research, the proposed algorithm makes use of the pre-processing phase of BR algorithm.
The Two Sliding Windows (TSW) algorithm: The Two Sliding Windows algorithm (TSW) scans the text from both sides simultaneously. It uses two sliding windows, the size of each window is m which is the same size as the pattern. The two windows search the text in parallel. The text is divided into two parts: the left and the right parts, each part is of size ┌ n/2┐. The left part is scanned from left to right using the left window and the right part is scanned from right to left using the right window. Both windows slide in parallel which makes the TSW algorithm suitable for parallel processors structures. TSW algorithm stops when one of the two sliding windows finds the pattern or the pattern is not found within the text string at all. The TSW algorithm finds either the first occurrence of the pattern in the text through the left window or the last occurrence of the pattern through the right window. If necessary, the algorithm can be modified easily to find all the occurrences of the pattern. Also if the pattern is exactly in the middle of the text, TSW can find it easily.
The TSW algorithm utilizes the idea of BR bad character shift function [8] to get better shift values during the searching phase. BR algorithm provides a maximum shift value in most cases without losing any characters. The main differences between TSW algorithm and BR algorithm are: • TSW uses two sliding windows rather than using one sliding window to scan all text characters as in BR algorithm • The TSW uses two arrays, each array is a one dimensional array of size (m-1). The arrays are used to store the calculated shift values for the two sliding windows. The shift values are calculated only for the pattern characters. While the original BR algorithm uses a two-dimensional array to store the shift values for all the alphabets [8] . Using one dimensional array reduces the search processing time and at the same time reduces the memory requirements needed to store the shift values Pre-processing phase: The pre-processing phase is used to generate two arrays nextl and nextr, each array is a one-dimensional array. The values of the nextl array are calculated according to Berry-Ravindran bad character algorithm (BR). nextl contains the shift values needed to search the text from the left side. To calculate the shift values, the algorithm considers two consecutive text characters a and b which are aligned immediately after the sliding window. Initially, the indexes of the two consecutive characters in the text string from the left are (m+1) and (m+2) for a and b respectively as in Eq. 1.
On the other hand, the values of the nextr array are calculated according to our proposed shift function. nextr contains the shift values needed to search the text from the right side, initially the indexes of the two consecutive characters in the text string from the right The two arrays will be invariable during the searching process. Figure 1 shows the steps of the preprocessing algorithm.
Searching phase: In this phase, the text string is scanned from two directions, from left to right and from right to left. In mismatch cases, during the searching process from the left, the left window is shifted to the right, while during the searching process from the right, the right window is shifted to the left. Both windows are shifted until the pattern is found or the windows reach the middle of the text. Figure 2 explains the steps of the TSW algorithm.
Step1: Compare the characters of the two sliding windows with the corresponding text characters from both sides. If there is a mismatch during comparison from both sides, the algorithm goes to step2, otherwise the comparison process continues until a complete match is found. The algorithm stops and displays the corresponding position of the pattern on the text string. If we search for all the pattern occurrences in the text string, the algorithm continues to step2.
Step2: In this step, we use the shift values from the next arrays depending on the two text characters placed immediately after the pattern window. The two characters are placed to the right side of the left window and to the left side of the right window. The corresponding windows are shifted to the correct positions based on the shift values, the left window is shifted to the right and the right window is shifted to If the first occurrence of the pattern exists in the middle of the text, the TSW algorithm in Fig. 2 continues comparing pattern characters with text characters through the inner loops before the TSW algorithm terminates the searching process through the outer loop.
Working example: In this study we will present an example to clarify the TSW algorithm. Part of nucleotide sequence of a gene (only 47 nucleotides) from Chromosome I (CHR-I) has been used to test the algorithm [10] , this sequence is taken from the gene index 32854-32901 [10] . The plant genome (Arabidopsis thaliana) consists of 27,242 gene sequences distributed over five chromosomes (CHR-I to CHR-V). Pre-processing phase: Initially, shiftl=shiftr=m+2=10.
The shift values are stored in two arrays nextl and nextr as shown in Fig. 3a and 3b respectively.
To build the two next arrays (nextl and nextr), we take each two consecutive characters of the pattern and give it an index starting from 0. For example for the pattern structure GAATCAAT, the consecutive characters GA,AA,AT,TC,CA,AA and AT are given the indexes 0,1,2,3,4,5 and 6 respectively.
The shift values for the nextl array are calculated according to Eq. 1 while the shift values for the nextr array are calculated according to Eq. 2.
Searching phase: The searching process for the pattern p is illustrated through the working example as shown in Fig. 4.

First attempt:
In the first attempt (Fig. 4a), we align the first sliding window with the text from the left. In this case, a mismatch occurs between text character (A) and pattern character (G), therefore we take the two consecutive characters from the text at index 8 and 9 which are (T and C) respectively. To determine the amount of shift (shiftl) we have to do the following two steps: • We find the index of TC in the pattern which is 3 • Since we search from the left side we use nextl array and shiftl = nextl[3] = 5 Therefore the window is shifted to the right 5 steps.
Second attempt: In the second attempt (Fig. 4b), we align the second sliding window with the text from the right. In this case, a mismatch occurs between text character (A) and pattern character (T), therefore we take the two consecutive characters from the text at index 37 and 38 which are (A and A) respectively. To determine the amount of shift (shiftr), we have to do the following two steps: • We find the index of AA in the pattern, AA has two indexes 1 and 5 • Since we search from the right side we use nextr array for the two indexes nextr[1] = 3 , nextr[5] = 7, then we choose the minimum value to determine shiftr. Shiftr = nextr[1] = 3. Therefore the window is shifted to the left 3 steps.
Third attempt: In the third attempt (Fig. 4c), a mismatch occurs from the left between text character (A) and pattern character (G), therefore we take the two consecutive characters from the text at index 13 and 14 which are (A and C) respectively, since AC is not found in the pattern, so the window is shifted to the right 10 steps.
Fourth attempt: In the fourth attempt (Fig. 4d), a mismatch occurs from the right between text character (A) and pattern character (T), therefore we take the two consecutive characters from the text at index 34 and 35 which are (A and T) respectively. To determine the amount of shift (shiftr) we have to do the following two steps: • We find the index of AT in the pattern, AA has two indexes 2 and 6 • Since we search from the right side, we use nextr array for the two indexes . A comparison between the pattern and the text characters leads to a complete match at index 32. In this case, the occurrence of the pattern is found using the right window.

Fig. 4: Working Example
Analysis: Preposition 1: The space complexity is O(2(m-1)) where m is the pattern length.

Lemma 1: The worst case time complexity is O(((n/2-m+1))(m))
Proof: The worst case occurs when at each attempt, all the compared characters of both the pattern and the text are matched except the last character and at the same time the shift value is equal to 1. If the pattern is aligned from the left then shift by one occurs when the first character of the two consecutive characters is matched with the last pattern character, while if the pattern is aligned from the right then shift by one occurs when the second character of the two consecutive characters is matched with the first pattern character.

Lemma 2: The best case time complexity is O(m).
Proof: The best case occurs when the pattern is found at the first index or at the last index (n-m).

Proof:
The Average case occurs when the two consecutive characters of the text directly following the sliding window is not found in the pattern. In this case, the shift value will be (m+2) and hence the time complexity is O([n/(2 * (m+2))]).
In Table 1, the length of the pattern is given in column one while the second column is the number of words selected for each pattern length from Book1. For example, as shown in Table 1 BR algorithm only searches the text from the left side, so the average number of comparisons and attempts in BR algorithm are more than that of our algorithm. Table 2 shows the results of comparing TSW algorithm with other algorithms. TSW algorithm has the minimum average number of comparisons and attempts among all other algorithms. The results are reasonable since TSW algorithm searches the text from both sides while all other algorithms search the text from one side. This can be justified by the following two advantages of TSW algorithm. First, it searches the text from both sides simultaneously. Second, the BR shift function shifts the pattern by a value that ranges from 1 up to m+2 positions from both sides when a mismatch occurs. This has a positive effect on the number of comparisons and attempts in most cases.
On the other hand, BF and KMP have the largest number of comparisons. TSW algorithm performance is observed in Table 5 where a pattern with different lengths is selected from the end of Book1. TSW algorithm finds it with minimum effort by the right to left window. Table 6-8 show the average number of comparisons and attempts needed to search for the first, middle and last appearance of 100 words selected from Table 7: The average number of attempts and comparisons performed to search for (100) patterns selected from the middle of the text Book1. The results of taking 100 words are similar to that of taking a single word with different lengths. As shown in Table 8, TSW algorithm best performance is when we search for words selected from the end of Book1. In case of a complete mismatch, as in Table 9, the average number of comparisons and attempts of the TSW algorithm is the minimum; this is because the shift value in most cases reaches m+2.

CONCLUSION
In this research, we presented a fast pattern matching algorithm The Two Sliding Windows algorithm TSW which makes use of two sliding windows. It employs the main idea of BR by maximizing the shift value and using two sliding windows rather than using one sliding window to scan all text characters as in BR algorithm. The TSW uses two arrays; each array is a one dimensional array of size (m-1). The arrays are used to store the calculated shift values for the two sliding windows, while the original BR algorithm uses a two-dimensional array.
We evaluated TSW performance by using a text string and various set of patterns. Also in the algorithm, during the pre-processing phase we reduced the memory required by using one-dimensional arrays for the pattern characters only. The concept of searching the text from both sides simultaneously gives TSW algorithm a preference over other algorithms in the number of comparisons and attempts especially if the pattern searched for occurs at the end of the text. In future research, we intend to implement the TSW algorithm on real parallel processors to minimize the number of comparisons and attempts. Also we intend to implement the idea of the two sliding windows on other algorithms such as KMP and BM.