A New Algorithm for Subset Matching Problem

: The subset matching problem is to find all occurrences of a pattern string p of length m in a text string t of length n , where each pattern and text position is a set of characters drawn from some alphabet S . The pattern is said to occur at text position i if the set p [ j ] is a subset of the set t [ i + j - 1], for all j (1 £ j £ m ). This is a generalization of the ordinary string matching and can be used for finding matching subtree patterns. In this research, we propose a new algorithm that needs O( n (cid:215) m ) time in the worst case. But its average time complexity is O( n + m (cid:215) n log1.5 ).


INTRODUCTION
The subset matching problem is a generalization of the ordinary string matching problem, by which both the pattern and text are sequences of sets (of characters). Formally, the text t is a string of length n and the pattern p is a string of length m. Each text position t[i] and each pattern position p[j] is a set of characters (not a single character), taken from a certain alphabet Σ (see the definition given in [7] ). Strings, in which each location is a set of characters, will be called set-strings to distinguish them from ordinary strings. Pattern p is said to match text t at position i if p[j] ⊆ t[i + j -1], for all j (1 ≤ j ≤ m). As an example, consider the set-strings t and p shown in Figure 1. Figure 1(a) shows a matching case, by which we have p[j] ⊆ t[i + j -1] for i = 1 and j = 1, 2, 3, while Figure 1(b) illustrates an unmatching case since for i = 2 we have p [2] ⊄ t[i + 2 -1].
This problem was defined in [5] and is of interest, as it was shown (also in [5] and its improved version [7] ) that the well-known tree pattern matching problem can be linearly reduced to this problem. In addition, as shown in [8] , this problem can also be used to solve general string matching and counting matching [10,11] and enables us to design efficient algorithms for several geometric pattern matching problems. Up to now, the best way for solving subset matching is based on the construction of superimposed codes (bit strings [3,4] ) for the characters in Σ and the convolution operation of vectors [1] . The superimposed codes are generated in such way that no bit string (for a character) is contained in a boolean sum of k other bit strings, where k is the largest size of the sets in both t and p. As indicated in [6] , such superimposed codes can be generated in O(n⋅log 2 m) time. In addition, by decomposing a subset matching into several smaller problems [5] , the convolution operation can also be done in O(n⋅log 2 m) time by using Fourier transformation [1] (if the cardinality of Σ is bounded by a constant). Therefore, the algorithm discussed in [6] needs only O(n⋅log 2 m) time.
In this research, we explore a quite different way to solve this problem. The main idea of our algorithm is to transform a subset matching problem into another subset matching problem by constructing a trie over the text string. In the new subset matching problem, t is reduced to a different string t', in which each position is an integer (instead of a set of characters); and p is changed to another string p', in which each position remains a set (of integers). This transformation gives us a chance to use the existing technique for string matching to solve the problem. Concretely, we will generate a suffix tree over t' and search the suffix tree against p' in a way similar to the traditional methods. The algorithm runs in O(n⋅m) time in the worst case and in O(n + m⋅n log1.5 ) on average.

ALGORITHM DESCRIPTION
Assume that Σ = {1, ..., k}. We construct a 0-1 matrix T = (a ij ) for t = t 1 t 2 ... t n such that a ij = 1 if i ∈ t j and a ij = 0 if i ∉ t j (see Figure 2 for illustration.) In the same way, we construct another 0-1 matrix P = (b ij ) for p = p 1 p 2 ... p m .
Then, each column in T (P) can be considered as a bit string representing a set in t (resp. p). (In the following discussion, we use b(t i ) (b(p j )) to denote the bit string for t i (resp. p j ).) In a next step, we construct a (compact) trie over all b(t i )'s, denoted by trie(T), as illustrated in Figure  3(a).  In this trie, for each node, its left outgoing edge is labeled with a string beginning with 0 and its right outgoing edge is labeled with a string beginning with 1; and each path from the root to a leaf node represents a bit string that is different from the others. In addition, each leaf node v in trie(T) is associated with a set containing all those t i 's that have the same string represented by the path from the root to v. Then, t can be transformed as follows: -Number all the leaf nodes of the trie from left to right (see Figure 3). -Replace each t i in t with an integer that numbers the leaf node, with which a set containing t i is associated. For example, the text string t shown in Figure 1(a) will be transformed into a string t' as shown in Figure  3(b), in which each position is an integer. For this example, t 1 and t 4 are replaced by 5, t 2 by 4, t 3 by 3, t 5 and t 6 by 1 and t 7 by 2.
In order to find all the sets in t, which contain a certain p j , we will search trie(T) against b(p j ) as below.
• Let v (in trie(T)) be the node encountered and b(p j )[i] be the position to be checked. Denote the left and right outgoing edges of v by e l and e r , respectively. We do the following checkings: -If b(p j )[i] = 1, we will explore the right outgoing edge e r of v. -If b(p j )[i] = 0, we will explore both e l and e r .
In fact, this definition just corresponds to the process of checking whether a set contains another as a subset. That is, if b(p j )[i] = 1, we compare only the label of e r , denoted by l(e r ), with the corresponding substring in b(p j ) according to the following criteria: if one bit in b(p j ) is 1, the corresponding bit in l(e r ) must be one; if one bit in b(p j ) is 0, it does not matter whether the corresponding bit in l(e r ) is 1 or 0. If they match, we move to the right child of v. If b(p j )[i] = 0, we will check both l(e l ) and l(e r ). For example, to find all the t i 's in the text string t shown in Figure 1(a), which match p 1 in p shown in the same figure, we will search the trie against b(p 1 ) = 10100. For this, part of the trie will be traversed as illustrated by the heavy lines in Figure 4(a).
This shows that in t there are three sets t 1 , t 2 and t 4 containing p 1 . But in t', t 1 and t 4 are represented by 5 and t 2 is represented by 4. So we associate {4, 5} with p 1 and replace p 1 in p with {4, 5}. In this way, we will transform p into another string p', in which each position remains a set containing some integers that represent all those sets in t, which contain the corresponding set at the same position in p (see Figure  4(b) for illustration.) Each set in p' can be represented by a bit string of length l, where l is the number of different sets (t i 's) in t. If i belongs to the set, the ith position is set to 1; otherwise, it is set to 0. Formally, the above transformation defines two functions as below: where, Set t is the set of all t i 's in t and Ι = {1, ..., l}; and f t (t i ) = a if a is the number for a leaf node in trie(T), with which a set containing t i is associated; and where, Set p is the set of all p j 's in p and 2 I is the set containing all the subsets of Ι , i.e., the power set of Ι; and f t (p j ) = b if b is a set of integers each labeling an leaf node in trie(T), which is encountered when searching trie(T) against b(p j ). Obviously, these two functions satisfy the following property. Proof: It can be directly derived from the above definition of the string transformations.
In a next step, we construct a suffix tree over t' = t 1 't 2 ' ... t n ', the transformed t, using a well-known algorithm such as the algorithms discussed in [12,13] . We remark that the alphabet for t' is {1, ..., l} (l ≤ 2 k } since each t i ' ∈ {1, ..., l}. It is a relatively large. But it is a sorted set, which is constructed when we number the leaf nodes of the trie for t. Therefore, the construction of the suffix tree over t' requires only O(n) time. (More exactly, using McCreight's algorithm [12] , we need O(logl⋅n) time. logl = log2 k = k. For example, for the string shown in Figure 3(b), we will generate a suffix as shown in Figure 5. In this tree, each internal node v is associated with an integer (denoted as int(v)) to indicate the position (in p') to be checked when searching; and each edge is labeled with a substring and all the labels along a path (from the root to a leaf node) form a suffix in t', plus $, a special symbol, which makes every suffix not a prefix of any other. So a leaf node can be considered as a pointer to a suffix. In order to find all the substrings in t, which match p, we will explore the suffix tree for the transformed string t' against the transformed string p' • Search the suffix tree from the root.

•
Let v be the node encountered and p i ' be the set to be checked.  ', the subtree rooted at u j will be continually explored. Otherwise, the subtree will not be searched any more. In addition, we notice that the symbol $ is always ignored when we check the labels associated with the edges in trie(T) In the above process, if we can find an edge e = (v, u) with l(e) = l 1 ... l g ... such that l g is checked against p m ' with l g ∈ p m ', any leaf node in the subtree rooted at u indicates a substring in t, which matches p.
The following is the formal description of the whole process.
p m ' by using the trie constructed over t; 3. Construct a suffix tree t suffix over t'; 4. Search t suffix against p'; 5. for any e = (v, u) in t suffix with l(e) = l 1 ... l g ... such that l g is checked against p m ' with l g ∈ p m ' do 6. {return all the leaf nodes in the subtree rooted at u;} end Example 1: By applying the above algorithm to the problem shown in Figure 1(a), trie(T) shown in Figure  3(a) will be first generated and t will be transformed to t' as shown in Figure 3(b). Then, by searching trie(T) against each p i one by one, we will transform p to p' as shown in Figure 4(b). The suffix tree for t' is shown in Figure 5. Finally, we will search the suffix tree against p' as shown by the heavy edges in Figure 6. For this simple example, only one path in the suffix tree is explored. But multiple paths may be searched in general.
Proof: Let n 1 → n 2 ... → n m → n m+1 be a path that is visited when searching the trie against p'. Let l i = l(n i , n i+1 ) denote the label associated with the edge (n i , n i+1 ) (1 ≤ i ≤ m). Then, we must have l i ∈ p i ' (1 ≤ i ≤ m). In terms of Lemma 1, the substring in t: We remark that all the suffixes represented by the leaf nodes in the subtree rooted at n m+1 have l 1 ... l m as the prefix. So, each of these suffixes corresponds to a substring in t, which matches p.
The time complexity of the algorithm consists of four parts: C 1 , C 2 , C 3 and C 4 , which are defined and estimated below. C 1 is the time used for constructing the trie for t. In the case that Σ is fixed, it needs only O(n) time.
C 2 is the time spent on generating p' for p. Let A represent the largest number of the edges visited when C 3 is the cost for constructing the suffix tree over t'. It is O(n). C 4 is the cost for searching the suffix tree against p'. It is bounded by O(A'⋅m), where A' is the largest edges explored during the searching of the suffix tree.
Therefore, the whole computation process runs in time O(n + A⋅m + A'⋅m). In the worst case, it is O(m⋅n).
Averagely, however, both A and A' are on the order of O(n log1.5 ) as shown in the subsequent section.

ANALYSIS OF A AND A'
In this section, we give a simple analysis of the average value of A. A precise probabilistic analysis is given in Section 5.
In order to analyze the average cost of A, we consider a 'worse' case that the trie is not compact, i.e., each edge is labeled with a single bit (instead of a bit string), which makes the analysis easier. In Figure 7(b), we show a non-compact trie for a set of bit strings shown in Figure 7(a). In the following, we use c s (T) to represent the expected number of the edges visited when searching T against s. In addition, we use s', s'', s''', ... to designate the patterns obtained by circularly shifting the bits of s to the left by 1, 2, 3, ... positions.
Obviously, if the first bit of s is 0, we have, for the expected cost of a random string s, where, T 1 and T 2 represent the two subtrees of the root of T. See Figure 8 for illustration.  It is because in this case, the search has to proceed in parallel along the two subtrees with s changing cyclically to s'.
If the first bit in s is 1, we have since in this case the search proceeds only in T 2 .
In order to get the expectation of c s (T), we make the following assumption.
For each t i in t, each element in t i is taken from Σ with probability p = 1/2, independently from all the other t j 's and all the other elements in t i .
Under this assumption, T 1 and T 2 will have almost the same size 2 N , where N is the number of the nodes in T. So (1) and (2) can be rewritten as follows: and From (3) and (4), we get the following recurrence equation: In terms of (6), we have the following proposition.

Proposition 2:
A is on the order of O(n log1.5 ).

Proof: The number of the nodes in trie(T) is bounded by O(kn). So the average value of A is O((kn) log1.5 ) = O(n log1.5 ).
Since only O(n log1.5 ) edges are visited on average when searching trie(T) against a b(p j ) in p, the size of the set of all those t i 's that contain p j is on the order of O(n log1.5 ) and so is A.

IMPROVEMENTS
The above process can be significantly improved. For p, we can also generate a trie over b(p j )'s, denoted by trie(P), where P represents the 0-1 matrix for p, which is constructed in the same way as T for t. But for ease of control, we will establish non-compact tries for both t and p as illustrated in Figure 9. We will search these two tries simultaneously with the above containment checking simulated. For this purpose, we will maintain a stack, stack, in which each entry is of the form {v, u} with v ∈ trie(T) and u ∈ trie(P). During the process, each time we encounter a node v in trie(T) and a node u in trie(P), we will manipulate stack as below.
• Let v 1 and v 2 be two children of v with edge (v, v 1 ) labeled by 0 and edge (v, v 2 ) by 1; and u 1 and u 2 be two children of u with edge (u, u 1 ) labeled by 0 and edge (u, u 2 ) by 1; • Push three pairs {v 2 , u 2 }, {v 2 , u 1 } and {v 1 , u 1 } (in the order specified) into stack. • If v is a leaf node, put the number associated with v into a set associated with u to record the fact the sets represented by v contain the sets represented by u.
Below is the formal description of the algorithm. In the algorithm, the following two symbols are used: • Num(v) -a number associated with a leaf node v in trie(T). • Matching(u) -a sorted set (of integers) associated with a leaf node u in trie(P). Each integer in the set represents one or more sets in t, which contain the sets represented by u. 8. let u 1 and u 2 be two children of u with (u, u 1 ) labeled by 0 and (u, u 2 ) by 1; 9. push(stack, {v 2 , u 2 }); push(stack, {v 2 , u 1 }); push(stack, {v 1 , u 1 });} 10. } end By the above algorithm, each p j in p will be transformed to a set of integers. Applying this algorithm to the tries shown in Figure 5(a) and (b), we will get the same result as shown in Figure 4(b). But we search trie(T) against only two paths instead of three. In addition, p 1 and p 2 are replaced with the same set {4, 5}. So we implement P' as a pointer sequence with each pointer pointing to a set of integers.
In general, for all those p j 's that share the same prefix, the prefix is checked only once, which enables us to save much time.
The worse case time complexity C can be analyzed as follows. Each pair {v, u} generated during the process, v and u must be on the same level in trie(T) and trie(P), respectively. Let N t be he numbers of different sets (t i 's) and N p the numbers of different sets (p j 's) in p. We have where num T (i) (num P (i)) represents the number of the nodes on level i in trie(T) (resp. in trie(P)). Now we analyze the average time of this algorithm.
We simply use T and P to represent trie(T) and trie(P), respectively. Denote root T the root of T and root P the root of P. Let T 1 be the left subtree of root T and T 2 the right subtree of root T . Let P 1 be the left subtree of root P and P 2 the right subtree of root P . Then, we have the following recurrence equations: C(T, P) = 1 + C(T 1 , P 1 ) + C(T 2 , P 1 ) + C(T 2 , P 2 ), (7) (*root P has both the left and right child nodes.*) (*root P has only the left node.*) (*root P has only the right child node.*) where C(T, P) represents the average number of the pairs (v, u) created during the process with v ∈ T and u ∈ P. From the above equations, we get which leads to the following proposition. For n ≥ 2 and m ≥ 2, 1/(n log1.5 m log1.5 ) < 1/2.25. So C(n, m) ≤ n log1.5 m log1.5 .
Proposition 3 shows that the average cost the algorithm p-transformation is on the order of O(n log1.5 m log1.5 ).

PROBABILISTICAL ANALYSIS
In terms of the analysis conducted in Section 3, we have the following two recurrences: where T 1 and T 2 represent the two subtrees of the root of T. Given N (N ≥ 2) random nodes in T, the probability that We have This equation can be solved by iteration as discussed above: 2. Derive an expression for φ*(σ), which reveals some of its singularities.

Evaluate the reversal Mellin transformation
The integral (30) is evaluated by using Cauchy's theorem as a sum of residues to the right of the vertical line {c + iy | y ∈ ℜ}, where ℜ represents the set of all real numbers. This compuation method was first proposed in [14]. The following is just an extended explanation of it.
Remember that D jh (x) = 1 -(1 -2 -mj-h ) x -x2 -mj-h (1 -2 -mj-h ) x-1 . We rewrite it under the form Now we consider the following expansion, which is valid for small values of x: .

(32)
Let x = 2 -mj-h . Then, we have (by using the above expansion) In addition, for small values 2 -mj-h , we also have Following the classical properties of Mellin transformation, we have the following proposition.
Proof. The following formulas are well-known: In terms of these formulas, we have Now we try to evaluate the following two sums: From (33) and (34), we can see that the two sums given by (40) are uniformly and absolutely convergent when σ is in the following stripe: Furthermore, in terms of (33) and (34), both ω h (σ) and υ h (σ) can be approximated by the following sum: When Re(σ) < σ 0 = -(1m k ), this series can be summed exactly: (43) Thus, φ*(σ) is defined in Stripe and can be computed as follows To compute the integral in (21), we consider the following integral where L N is a rectangular contour oriented clockwise as shown in Figure 10.
where N is an integer. This contour is of a similar type used in ( [9], p. 132).
Let i N φ be the integral along i N L (i = 1, 2, 3, 4). Then, we have the following results:

CONCLUSION
In this research, a new algorithm for the subsetmatching problem is proposed. The main idea of the algorithm is to represent each set t i in the text string t as a single integer a and each set p j in the pattern string p as a set b of integers such that a ∈ b if and only if p j ⊆ t i . This is done by constructing a trie structure over t. In this way, we transform the original problem into a different subset matching problem, which can be efficiently solved by generating a suffix tree over the new text string that has an integer at each position. In the worst case, the algorithm runs in O(n + l⋅m) time, where l is the number of different sets (t i 's) in t. But its average time complexity is O(n + m⋅n log1.5 ).