An Efficient Algorithm for Tree Mapping in XML Databases

In this article, we discuss an efficient algorithm for tree mapping problem in XML databases. Given a target tree T and a pattern tree Q, the algorithm can find all the embeddings of Q in T in O(|T||Q|) time while the existing approaches need exponential time in the worst case.


INTRODUCTION
XML uses a tree-structured model for representing data. Queries in XML languages (such as Xpath [1] , Xquery [2,3] , XML-QL [4] and Quilt [5,6] ) also typically specify selection patterns as a kind of tree-structured relations. For instance, the XPath expression: book[title = 'Art of Programming']//author[fn = 'Donald' and ln = 'Knuth'] matches author elements that (i) have a child subelement fn with content 'Donald', (ii) have a child subelement ln with content 'Knuth' and are descendants of book elements that have a child title subelement with content 'Art of Programming'. This expression can be represented as a tree structure as shown in Fig. 1.

Fig. 1: A query tree
In this tree structure, a node v is labeled with an element name or a string value, denoted label(v). In addition, there are two kinds of edges: child edges (cedges) for parent-child relationships and descendant edges (d-edges) for ancestor-descendant relationships. A c-edge from node v to node u is denoted by v → u in the text and represented by a single arc; u is called a cchild of v. A d-edge is denoted v ⇒ u in the text and represented by a double arc; u is called a d-child of v. Such a query is often called a twig pattern.
In any DAG (directed acyclic graph), a node u is said to be a descendant of a node v if there exists a path (sequence of edges) from v to u. In the case of a twig pattern, this path could consist of any sequence of cedges and/or d-edges. Based on these concepts, the tree embedding can be defined as follows.
Definition 1: An embedding of a twig pattern Q into an XML document T is a mapping f: Q → T, from the nodes of Q to the nodes of T, which satisfies the following conditions: i. Preserve node label: For each u ∈ Q, u and f(u) are of the same label (or more generally, u's predicate is satisfied by f(u).) ii. Preserve c/d-child relationships: If there exist a mapping from Q into T, we say, Q can be imbedded into T, or say, T contains Q.
Notice that an embedding could map several nodes of the query (of the same type) to the same node of the database. It also allows a tree mapped to a path. This definition is quite different from the tree matching defined in [7] .
There is much research on how to find such a mapping efficiently and all the proposed methods can be categorized into two groups. By the first group [2,[8][9][10][11][12][13][14][15][16][17] , a tree pattern is typically decomposed into a set of binary relationships between pairs of nodes, such as parent-child and ancestor-descendant relations. Then, an index structure is used to find all the matching pairs that are joined together to form the final result. By the second group [18][19][20][21][22][23] , a query pattern is decomposed into a set of paths. The final result is constructed by joining all the matching paths together. For all these methods, the join operations involved require exponential time in the worst case. For example, if we decompose a twig pattern into paths to find all the matching paths from a where p is the largest length of a matching path and λ is the number of all such paths.
In this study, we proposed a new algorithm with no join operations involved. The algorithm runs in O(|T|⋅Q leaf ) time and O(T leaf ⋅Q leaf ) space, where T leaf and Q leaf represent the numbers of the leaf nodes in T and in Q, respectively.

TREE ENCODING
To facilitate the checking of reachability (whether a node can be reached from another node through a path), a tree encoding is used [24] .
Consider a tree T. By traversing T in preorder, each node v will obtain a number pre(v) to record the order in which the nodes of the tree are visited. In a similar way, by traversing T in postorder, each node v will get another number post(v). These two numbers can be used to characterize the ancestor-descendant relationships as follows.
Let v and v' be two nodes of a tree T. [24] .
As an example, have a look at the pairs associated with the nodes of the tree shown in Fig. 2. The first element of each pair is the preorder number of the corresponding node and the second is its postorder number. Using such labels, the ancestor-descendant relationships can be easily checked. For instance, by checking the label associated with b against the label for f, we see that b is an ancestor of f in terms of Proposition 1. Note that b's label is (2,4) and f's label is (4, 1) and we have 2 < 4 and 4 > 1. We also see that since the pairs associated with g and c do not satisfy the condition given in Proposition 1, g must not be an ancestor of c and vice versa.
Let (p, q) and (p', q') be two pairs associated with nodes u and v, respectively. We say that (p', q') is subsumed by (p, q), denoted (p', q') (p, q), if p' > p and q' < q. Then, u is an ancestor of v if (p', q') is subsumed by (p, q).
In addition, if p' < p and q' < q, u is to the left of v. Finally, we can associate each node v with a level number l(v) (the nesting depth of the element in a document). In conjunction with the tree encoding, this number can be utilized to tell whether a node is the parent of another node. For example, if

ALGORITHM FOR SIMPLE CASES
Here, we describe an algorithm for simple cases that a twig pattern contains only d-edges. First, we give a basic algorithm to show the main idea in 3.1. Then, in 3.2, we discuss how this algorithm can be substantially improved. In 3.3, we prove the correctness of the algorithm and analyze its computational complexities.

Basic algorithm:
The basic algorithm to be given works in a bottom-up way. During the process, two data structures are maintained and computed to facilitate the discovery of subtree matchings.
represents a subtree of T rooted at v. • Each q in Q is associated with a value δ(q), defined as follows: Initially, for each q ∈ Q, δ(q) is set to φ. During the tree matching process, δ(q) is dynamically changed as below: 1. Let v be a node in T with parent node u.
2. If q appears in α(v), change the value of δ(q) to u. Then, each time before we insert q into α(v), we will do the following checkings: Below is a bottom-up algorithm, working in a recursive way and taking a node v in T as the input (which represents T [v]). Initially, the input is the root of T. The algorithm will mark any node u in T [v] if it finds that T[u] contains Q. In the process, two functions are called: If it is the case, return {q}. Otherwise, it returns an empty set ∅.
• leaf-node-check(u) -It returns a set of leaf nodes in for each q in S do 10.
or to a descendant of u 6. then if q is root then mark u};} 8. return S 1 ; end The algorithm tree-matching( ) searches T bottomup in a recursive way (see line 4). During the process, for each encountered node v in T, we first check whether it is a leaf node (see line 2). If it is a leaf node, the function leaf-node-check( ) is called (see line 12), by which all the matching leaf nodes in Q will be stored in a temporary variable S 2 that will be added to α(v) (see line 13). If v is an internal node, lines 3 -10 are first conducted and then the function leaf-node-check( ) is invoked (see line 12). By executing line 4, treematching( ) is recursively called for each child node v i of v. After that, for each q appearing in α(v i ), its δ value is set to be v (see line 7). In addition, q's parent is inserted into S, a temporary valuable to be used in a next step. Since α(v i )'s will not be used any more after this step, they are simply removed (see line 8). By executing lines 9 -10, we check, for each q' in S, whether v matches q' by calling node-check( ), in which the δ values of q's child nodes are utilized to facilitate the checkings (see lines 3 -5 in node-check( )). The following example helps for illustration. Fig. 3.

Improvements:
The above algorithm can be substantially improved by elaborating the construction of α(v)'s. First, we notice that in the case that v is a leaf node in T, α(v) is a set of the leaf nodes in Q, which match v. Such nodes can be stored in a linked list as illustrated below: with the left-most node appearing first and the rightmost node last. Then, for any 1 ≤ i ≤ j ≤ k, we have pre(q i ) < pre(q j ) and post(q i ) < post(q j ). That is, in α(v), q i 's are sorted according to their preorder and postorder values. Now we consider two α-lists α and α' sorted according to their nodes' preorder and postorder numbers. Define a merging operation over α and α', denoted merge(α, α'), as follows: 1. Assume that α = {v 1 , ..., v p } and α' = {v 1 ', ..., v q '}.
We step through both α and α' from left to right.
Let v i and v j ' be the nodes encountered. We'll make the following checkings. The result of merge(α, α') is stored in α and α' remains unchanged. Especially, the changed α is still sorted according to their nodes' preorder and postorder numbers.
In terms of the above discussion, we have the following algorithm to merge two sorted a-lists together.
This algorithm is almost the same as the previous one, but with the merge operation involved, which effectively reduces the size of each α(v) from O(|Q|) to O(Q leaf ). Special attention should also be paid to line 7, by which we generate a set S that contains the parent nodes of all those nodes appearing in α(v j )'s (j = 1, ..., k), where v j is a child node of the current node v. Since the nodes in α (α = merge(α 1 , ..., α k-1 , α k )) are left-toright sorted (according to the nodes' preorder and postorder numbers), if there are more than one nodes in a sharing the same parent, they must appear consecutively in the list. So each time we insert a parent node q' (of some q in α) into S, we need to check whether it is the same as the previously inserted one. If it is the case, q' will be ignored. Thus, the size of S is also bounded by O(Q leaf ).

Correctness and computational complexity:
In this subsection, we prove the correctness of the algorithm tree-matching( ) and analyze its computational complexities.

Proposition 1:
Let v be a node in T. Then, for each q in α(v) generated by tree-matching( ), we have T [v] contains Q[q].
Proof: We prove the proposition by induction on the height of Q, height(Q).
Induction step: Assume that the proposition holds for any query tree Q' with height(Q') ≤ h. We consider a query tree Q of height h + 1. Let r Q be the root of Q. Let q 1 , ..., q k be the child nodes of r Q . Then, we have height(Q[q j ]) ≤ h (j = 1, ..., k). In terms of the induction hypothesis, for each q in Q[q j ] (j = 1, ..., k), if it appears in α(v i ) (where v i is a child node of v), we have T[v i ] contains Q[q] and δ(q) will be set to be v. Especially, if T[v i ] contains Q[q j ] (j = 1, ..., k), we have q j ∈ α(v i ) and δ(q j ) will be set to be v before v is checked against r Q . Obviously, if label(v) = label(r Q ) and for each q j (j = 1, ..., k), δ(q j ) is equal to v or a descendant of v, Q can be embedded into T [v]. So r Q is inserted into α(v). Now we consider the time complexity of the algorithm, which can be divided into four parts: 1. The first part is the time spent on merging α(v 1 ), ..., In terms of the above analysis, we have the following proposition.

Proposition 2:
The time complexity of tree-matching( ) is bounded by O(|T|Q leaf ).
Proof: See the above discussion. Since at each time point at most T leaf nodes in T are associated with a α-list, the space overhead is bounded by O(T leaf ⋅Q leaf ).

GENERAL CASES
The algorithm discussed earlier can be easily extended to general cases that a query tree contains both c-edges and d-edges. We only need to make the following changes: • For each child node q i of q that is being checked against v, if (q, q i ) is a c-edge, we will check whether δ(q i ) is equal to v. If (q, q i ) is a d-edge, we simply check whether pre(δ(q i ) ≥ pre(v) and post(δ(q i )) ≤ post(v). • Accordingly, the algorithm node-check described earlier should be slightly modified.
(post(δ(q i ) ≤ post(u)))) 10. then flag := false;} if q is root then mark u;} 13. return S 1 ; end This algorithm is similar to the function nodecheck( ). The only difference is that a general subsumption checking process is used, by which cedges and d-edges are checked in different ways.
In addition, the lines 5 -10 in the algorithm treematching( ) given in 3.2 should be replaced with the following segment of code: , q) is a c-edge and q matches v i )) then {δ(q) := v; let q'' be the last element in S; if (q's parent ≠ q'') then S := S ∪ {q's parent};} else remove q from α(v i ); }} α := merge(α(v 1 ), ..., α(v k )); Concerning the correctness of the algorithm, we have to answer a question: whether any c-edge in Q is correctly checked.
First, we note that any c-edge in Q cannot be matched to any path with length larger than 1 in T. That is, it can be matched only to a single edge in T. It is exactly what is done by the algorithm.
Each time we check a node v in T against some q in Q, we will first set d values for any q i appearing in α(v j )'s, where v j is a child node of v. When doing this, for some q i 's, their δ values are changed (to v). Assume that the current δ value for q i is v' (i.e., δ(q i ) = v'). Then, v' must be a descendant of v since the algorithm searches T in a bottom-up way. However, we need to change δ(q i ) from v' to v since a c-edge can match only a single edge in T and the fact that q i matches v j should be recorded so that the c-edge matching is not missed (see Fig. 6 for illustration).
In Fig. 6, v'' is a descendant of v and matches q 2 . So δ(q 2 ) will be set to v'. However, (q, q 2 ) is a c-edge. Therefore, the fact that v'' matches q 2 makes no contribution to the matching of v with q. Since q 2 also matches v 2 , δ(q 2 ) will be changed to v, which enables us to find that T [v] contains Q[q]. In conjunction with Proposition 1, the above analysis shows the correctness of the algorithm. We have the following proposition. Proof: See the above discussion.
The time and space complexities for the general cases are the same as for the simple cases.

CONCLUSION
In this article, a new algorithm is proposed for a kind of tree matching, the so-called twig pattern matching. This is a core operation for XML query processing. The main idea of the algorithm is to explore both T and Q bottom-up, by which each node q in Q is associated with a value (denoted δ(q)) to indicate a node v in T, which has a child node v' such that T[v'] contains Q [q]. In this way, the tree embedding can be checked very efficiently. In addition, by using the tree encoding, as well as the subsumption checking mechanism, we are able to minimize the size of the lists of the matching query nodes associated with the nodes in T to reduce the space overhead. The algorithm runs in O(|T|⋅Q leaf ) time and O(T leaf ⋅Q leaf ) space, where T leaf and Q leaf represent the numbers of the leaf nodes in T and in Q, respectively. More importantly, no costly path join operation is necessary.