Efficient Processing for Binary Submatrix Matching

: The heavy demand for large volumes of digital data has increased the interest in matrix-like representation. Matrices are well organized data structures which are suitable to store uniform data in order to simplify data access and manipulation. For several applications, the need is critical to efficiently search for a specific pattern in matrix structures. A pattern can be represented as an n-dimensional matrix which can be searched for within other larger n-dimensional matrices. This query will be referred to as matrix submatching. In this paper, we present and compare two algorithms for binary matrix submatching on the basis of time requirement. The first algorithm is a naive brute force approach with O(n 2 m 2 ) time requirement. The second approach is based on chain code transformation which reduces the sizes of matrices resulting in less time requirement. (cid:1)


INTRODUCTION
The importance of matrices comes from their wide range of applications in various areas such as image processing, geographic information systems, speech recognition, document classification, and bioengineering [1,6,12] . Operations on matrices are at the heart of scientific computing. Efficient algorithms for working with matrices are therefore of considerable practical interest. Matrix operations such as multiplication received much research attention [2,3,5] . In 1992, Shen and Hu studied a new kind of relationship between matrices, namely, approximate submatrix matching (ASM). Given two n x m matrices A and B, find a k×l submatrix in A and another k×1 submatrix in B such that their difference is minimized under a certain measure function. They discussed the ASM problem under two typical measure functions, namely, convolution and Euclidean distance [10] . In 2006, Koyuterk and Grama built a software system, called PROXIMUS, for error-bounded approximation of highdimensional binary attributed datasets based on nonorthogonal decomposition of binary matrices. This tool can be used for analyzing data arising in a variety of domains ranging from commercial to scientific applications. Using a combination of innovative algorithms, novel data structures, and efficient implementation, PROXIMUS demonstrated rather good accuracy, performance, and scalability to large datasets. The technique was experimented on diverse applications in association with rule mining and DNA microarray analysis [8] .
The matrix containment or submatching problem received almost no attention in the literature. We believe that the matrix submatching problem is quite important and deserves attention from researchers due to the vast applications that may require such functionality. This article chooses to focus on defining and solving the exact binary submatching problem and will certainly pave the way for future research activities leading to non-exact general matrix submatching. The following definition formally presents the MSM function which accepts two matrices A and B and returns a set of (i, j) locations in matrix A where matrix B completely appears in A starting at raw i and column j of matrix A. Matrix B may appear zero or more times in A.

BRUTE-FORCE METHOD
The conventional algorithmic solution for the search problem is to sequentially search for a particular pattern until the pattern has either been found or the search space exhausted without any match. This approach is typically referred to as brute-force search or exhaustive search [2,4,9] . Brute-force search is simple to implement, and will always find a solution if it exists. Brute-force search has the advantage that it requires no imagination or cleverness. Fig. 1 describes a brute-force algorithm for the matrix submatching problem. The algorithm expects two matrices A:nXn and B:mXm where m n as input, while A is the main matrix, B is the submatrix. The algorithm goes through the first n-m+1 rows of the main matrix and for each row it scans the first n-m+1 columns in order to find the upper left corners of potential matches. For each element of the (n-m+1) 2 elements in the main matrix, the algorithm performs at least one comparison and at most m 2 comparisons with the elements of the submatrix. It is obvious that the Brute-force algorithm requires at least (n-m+1) 2 (i.e. Ω(n 2 )) and at most m 2 (n-m+1) 2 (i.e. O(n 2 m 2 )) comparisons. Fig. 3 illustrates a trace for the Brute-Force algorithm with respect to the main matrix A: 6×6 and B: 2×2, which are presented in Fig. 2. The elements of the first five rows and those of the first five columns are inspected as potential upper-left corner matches. For various iterations, the shaded areas in the main matrix represent the elements which are compared with the corresponding ones of the submatrix. The total number of comparisons required to return MSM(A, B) = {A (1,4), A(4, 5)} is 46 comparisons.

CHAIN CODE BASED METHOD
The matrix submatching or matrix containment problem implies searching for a pattern in the form of a matrix inside a larger matrix. The brute-force algorithm tends to work well for matrices which have no assumptions with respect to their contents. This section introduces another solution for the matrix submatching problem based on chain coding which is a succinct way of representing a list of points [6] . Only a starting point is represented by its location while the other points are represented by successive displacements from point to point along a certain path. For several applications of matrices such as image processing, a matrix tends to have repeating adjacent values representing objects. Although, the proposed solution works for general grey values of elements in matrices, the algorithm will be discussed with respect to binary matrices. Using the  // a ' ( i, 1) is the number of the first consecutive zero-value elements in A i starting with a( i, 1). a ' is the number of the next consecutive one-value elements in A i starting with a( i, is the number of the next consecutive zero-value elements in A i starting with a( i, r 1 + r2+1). a ' ( i, 3) r 3 if a( i, r 1 + r 2 +1)= a( i, r 1 + r 2 +2)= …= a( i, r 1 + r 2 + r 3 )=1, where (r 1 + r 2 +r 3 ) <=n IF ((r 1 + r 2 +r 3 )== n ) THEN { k i =3; STOP} …, and so on.
The contents of the vector A i ' will be determined as per Fig. 4. It can be seen from the previous description that the first element of the vector A i ' represents the number of consecutive zeros starting with a(i, 1) of row A i . However, if a(i, 1) contains one instead of zero, the first element of the A i ' vector will be assigned zero. The second element of the A i ' vector will be assigned the number of the next successive ones while the third element will receive the number of the next successive zeros and so on. All rows of the main matrix and those of the submatrix will be transformed in a similar fashion.  We can notice from Fig. 5 that the transformation phase reduces the size of the matrices depending on sequential repetition of the values in the matrix. This reduction in size will decrease the time of matrix searching using our proposed algorithm comparing with the brute-force algorithm that works on the original matrices.   Table 1 states the variables used in the search phase while Fig. 6 illustrates the first part of the algorithm. The chain code based search algorithm builds on the assumption that each vector of the transformed matrices starts with the count of zeros. Obviously, if the first value in the vector is zero, it reflects that the corresponding row in the original matrix starts with one. Fig. 6 illustrates a flow chart for finding the first point of match in TMM.
If TMM starts with a number of zeros or ones larger than that in TSM, we call the function: Check leading zeros or ones as explained in Fig. 7 to search for submatrix matching sequentially.
If the start point of match is found, the function: Return offset in original Main Matrix () as described in After finding the first point of match, we continue searching for potential other points of match as per Fig.  9. Search is terminated when one of the following two cases occurs: • If the last element of TSM Matrix has been reached, then a sub-matrix match has been found • If the value of TMM < the corresponding in TSM, then Flag is set to FALSE If the end of the current row in TSM has been reached, the function: Get the new values( i, j, n, m) will be called in order to update the values of counters i, j, n, and m. Fig. 10 shows how the function works.
To update the value of counter j, function Return offset in TMM () as demonstrated in Fig. 11 will be invoked. This function will return the exact column in the next row in TMM to start search.
After finding the value of j (i.e., lines 1-5) of Fig.  11 which indicates the column in the next row in TMM, we continue searching while maintaining that m (i.e., column counter in TSM) is pointing to a valid position. If the row starts with 1, then the first column will contain 0. In this case, we increment m to point to the next location (i.e., line 6). Then, we compare the location to which j is pointing with the corresponding . Then, we check if the flag is true to register the offset_row and offset_col in result matrix as the first occurrence. Fig. 12 displays the function which validates location correctness. The whole search process will stop once we reach the end of TMM. Fig. 13 shows a trace using the chain code based algorithm for the main and sub matrices shown in Fig.  5. While the brute-force algorithm requires 46 comparisons to complete the search of the indicated matrices, the chain-code based algorithms requires only 17 comparisons to find all occurrences. This is due to the reduction in size caused by the transformation phase by almost 50%. A comprehensive experimental comparison between the two algorithms in terms of the required number of comparisons to find all occurrences is discussed in the following section. In TMM start searching from j = 3, which indicate a location contains summation of consecutive zero.

RXPERIMENTAL RESULTS
The brute-force and chain-code based algorithms are considered sequential search mechanisms for the matrix submatching problem.
In order to experimentally compare the performance of both algorithms, we randomly generated a database for main matrices with sizes 50×50, 75×75, 100×100 and 200×200 and another one for submatrices with sizes 10×10, 15×15, 25×25, 30×30, 35×35, 40×40 and 45×45 using Matlab. The databases contain 1000 occurrences of each indicated size and the average numbers of comparisons required by both algorithms to find the occurrences of submatricies in the corresponding main matrices were computed. The outcome of the experiments is summarized in Fig. 14. Our experiments clearly show that the chain code based algorithm requires half the number of comparisons required by the brute-force approach. This is basically attributed to the compression in size due to the preprocessing phase of the chain-code approach. For several applications, it is typical that a database of matrices exists and a query is posed against the database to retrieve all matrices which contain an incoming sub matrix [7,11] . In such cases, the preprocessing phase for the main matrices needs to be done only once.   Table 2 demonstrates the average percentage of size reduction for randomly generated square binary matrices with various sizes. The maximum average percentage of size reduction is 50%.

CONCLUSION
This article brings focus to the matrix submatching operation as an essential problem to be solved for many applications including watermarking, geographic information systems and pattern recognition. Most of these applications start with a database of matrices and require the retrieval of those matrices which contain an incoming matrix. The chain code based approach presented in this paper consists of two phases; namely, transformation and matching. The transformation phase reduces the sizes of all relevant matrices by nearly half of their original sizes bringing about clear saving in the number of comparisons when compared with the brute force approach. Although, this paper demonstrated superiority of the chain-code approach for binary square matrices, the results hold true for general matrices.