TwigINLAB: A Decomposition-Matching-Merging Approach To Improving XML Query Processing

: The emergence of the Web has increased significant interests in querying XML data. Current methods for XML query processing still suffers from producing large intermediate results and are not efficient in supporting query with mixed types of relationships. We propose the TwigINLAB algorithm to process and optimize the query evaluation. Our TwigINLAB adopts the decomposition-matching-merging approach and focuses on optimizing all three sub-processes; introducing a novel compact labeling scheme, optimizing the matching phase and reducing the number of inspection required in the merging phase. Experimental results indicate that TwigINLAB can process both path queries and twig queries better than the TwigStack algorithm on an average of 21.7% and 18.7% respectively in terms of execution time using the SwissProt dataset.


INTRODUCTION
eXtensible Mark-up Language (XML) is emerging as the de facto standard for data exchange over the Web. Since XML is a semi-structured data, two types of user queries namely full-text queries (keyword based search) and structural queries (complex queries specified in tree-like structure) are usually used [1] . This paper is concerned with structural queries.
Structural queries can be viewed as sequences of location steps, where each node in the sequence is an element tag or string value. Query nodes are related by either parent-child (P-C) steps or ancestor-descendant (A-D) steps. These relationships are depicted with a single line and double lines respectively. Besides, query nodes can be related adjacently with one another by sibling or ordered query relationship. Sibling (ordered query) relationship is usually denoted by "[]".
To process such queries, it may undergo a decomposition-matching-merging process. TWIG-XSKETCH [2] , tree signature [3] , MPMGJN [4] , Stack-Tree [5] and PathStack and TwigStack [6] are examples of query processing using the decomposition-matchingmerging approaches. Nevertheless, most of these approaches focus on the second sub-process: the matching phase only.
In this paper, we propose 1. a novel hybrid query optimization architecture, INLAB (combination of INdexing and LABeling techniques), which comprises an XML Parser, XML Encoder, XML Indexer and Query Engine and 2. query optimization algorithms (TwigINLAB) to process twig queries efficiently without traversing the whole XML tree.
INLAB labeling scheme size is only 12 bytes; much shorter compared to previous labeling schemes. This enables quick determination of P-C relationship between elements in the XML database. However, to check for A-D relationship, the index table need to be accessed for confirmation. Besides, INLAB labeling is integer based. Integer processing is very efficient compared to that of string or bit-vector. The index structures of INLAB allow us to efficiently find all elements that belong to the same parent or ancestor.
Our TwigINLAB approach decomposes relationships into a set of path queries. In addition, we focus on optimizing all three decomposition-matchingmerging sub-processes. First, we introduce a novel robust and compact labeling scheme consisting of <self-level: parent> to allow quick determination and decomposition of the types of relationships among each path edge. Subsequently, we optimize the matching phase based on each relationship and finally reduce the number of inspection required in the merging phase.
Twig Query Processing: With the increasing popularity of XML data representation, XML query processing and optimization has attracted a lot of research interest [7,8,9,10,11,12] . In this section, we summarize the related work. There are typically two types of decomposition-matching-merging process. First, a complex query pattern can be decomposed into a set of basic binary relationships between each pair of nodes or second, it can be decomposed into a set of path queries, followed by subsequent matching and merging processes. Our INLAB adopts the latter approach and focuses on optimizing all three sub-processes; introducing novel compact labeling scheme, optimizing the matching phase and reducing the number of inspection required in the merging phase. In the first sub-process, most researchers use the labeling of (docno, begin : end, level) for an element and (docno, wordno, level) for a text word as the positional representation of XML elements and texts. However, we use <self -level : parent> as the positional representation instead. The details on this will be explained in the next section. MPMGJN [4] , Stack-Tree [5] and TwigStack [6] algorithms are based on (docno, begin : end, level) labeling of XML elements. These algorithms accept two lists of sorted individual matching nodes and structurally join pairs of nodes from both lists to produce the matching of the binary relationships.
Another similar approach is to decompose the twig query into a set of path queries instead. Polyzotis et al. propose methods to reduce the number of intermediate results by introducing a filtration step based on some notion of synopses to facilitate query-approximate answers [2] . They propose both TREESKETCH and TWIG-XSKETCH. Another work, done by Amer-Yahia et al is to preprocess the query patterns before the matching phase is executed [13] . Since the efficiency of tree pattern matching depends on the size of the pattern, it is essential to identify and eliminate redundant nodes in the pattern before the matching phase takes place. On the other hand, Zezula et al. propose a novel technique, tree signature, to represent tree structures as ordered sequences of pre-order and post-order ranks of the nodes [3] . They use tree signatures as index structure and find qualifying patterns through integration of structurally consistent path query. Merging together the structural matches in the final process poses the problem of selecting a good join ordering. Wu et al. propose a cost-based join order selection of structural join [7] . Kim et al. suggest partitioning all nodes in an extent into several clusters [14]. Given two extents to be joined, they propose filtering out unnecessary clusters in both extents prior to the joining process.
Our TwigINLAB algorithm is a generalization of the stack-based algorithm first mentioned by Bruno et al. [6] to match twig query. However, we enhance the query processing by utilizing indexes (built only once) to speed up the matching and merging phases. Further elaboration can be found in next section. . 1 shows the INLAB architecture, which consists of the XML Parser to check the wellformedness of the XML document, the XML Encoder to generate the labeling based on a <self-level:parent> scheme, the XML Indexer to create index storing each node parent and child information and the XML Query Engine for pattern query matching. This paper concentrates only on the XML Query Engine (the optimizer). Other components such as XML Parser, XML Encoder and XML Indexer have been reported in [15,16]. The criterion for assessing TwigINLAB is execution time. Fig. 2 depicts an example of XML document labeled based on <self-level: parent>. Structural relationships between element nodes can be efficiently determined from the label as follows:

Fig
1. P-C relationship node 1 is the parent of node 2 if and only if node 1   For example, let publications (0-0:-1) be node 1 and author (14-2:12) be node 2 . The leveldiff between the two nodes is two. This means that we need to trace up the PCTable twice starting from the self attribute of author to check whether publications is ancestor of author as illustrated in Fig. 3. The parent attribute of the retrieved node is equal to the self attribute of publications. Thus, publications and author is of A-D relationship.   4 illustrates the overall processes involved in TwigINLAB processing. Initially, the query pattern is analyzed using the analysisQueryPattern() function. For each query edge, if the twig is of P-C relationship, the parent and child details will be updated in the twigPC (a hashtable to store parent and child) repository. During this process, each node in the twig query is associated with a stream. Each stream contains the positional representations of the node appearance in the XML tree (as shown in Fig. 5). The nodes in the stream are sorted by their self attribute, and thus, this will determine the order of the node to be processed. Associated with each stream is a stack. Stack is used to store the possible intermediate results.   Next, the partitionTwig() function takes place. If the query is path query (only one leaf node), this function is skipped and it will proceed to the twigJoin() function. However, if it is twig query, during this function, the twig pattern is decomposed into two or more path queries. Starting from the root of twig query pattern, for each start tag event, it pushes the tag into twigStack (a stack to keep track of twig query sequence). When it reaches an end tag event, it checks whether the current entry at the top of twigStack is a leaf node. If it is a leaf, the query node will be added one by one to the vpathList (a vector to store query nodes in leaf-to-root order) until it reaches the root. Finally, it will be output in reverse order by the function reverse(). The final output of this function is a set of path queries in root-to-leaf order in pq (a hashtable to keep each distinct path query).
For each path query, it recursively calls the twigJoin() function to find the possible path matches. Each possible match is pushed into the stack in the twigJoin() function. For instance, using the twig query in Fig. 5 as an example, after the partitionTwig() function, there are two path queries: book-author and book=publisher. Initially, the path query book-author is to be processed first. Based on the self attribute in each first occurrence in T book, and T author, query node book is being processed first. Element <1-1:0> is then pushed into S book. The next returned query node is the immediate child of book, which is author. Element <3-2:1> is pushed into S author because parent attribute of book is equal to self attribute of author. Since author is the leaf query node, a partial solution is formed between book-author. Based on the next occurrences, the next returned node is element <4-2:1> as it has the next smallest self attribute. This element is then pushed into S author because the parent attribute of book is equal to the self attribute of author. Since author is the leaf query node, another partial solution is formed between book-author. This process repeats until it reaches the leaf node of the all paths as illustrated in Fig. 6.
Next, these matches are merged back through the mergeTwig() function. In the mergeTwig() function, all partial solutions from the twigJoin() function are merged together to generate the final solutions. This function begins by comparing each entry in the partial solutions of two path queries at a time. All the occurrences in the partial solutions are in sorted order of their self-attributes. If each entry first node is equal, or if the query edge is of P-C relationship and the second query node is of sibling and predecessor relationship, the partial solution will be added to the final solutions. For query edge with A-D relationship, if the second query node is a predecessor, it will be added as a final solution. In both cases, the inner loop begins the iteration from the current j position. Hence, this function skips the unnecessary iteration of non-feasible partial solutions. However, if the first node in the second path query is greater than node1, the next inner loop will begin from position j-1 (for cases where j > 0). Fig. 7 illustrates the merging process.
Finally, the final solutions are output through the outputSolution() function.

RESULTS AND DISCUSSION
We have implemented TwigINLAB using Java API for XML Processing (JAXP). Experiments have been carried out on the SwissProt dataset (112MB) obtained from the University of Washington XML repository [17] . We modified the SwissProt dataset into various file sizes ranging from 10MB until 110MB for the purpose of measuring the scalability of both approaches in supporting large-scale dataset.
We evaluated the performance of TwigINLAB as compared to TwigStack on two main types of queries namely, path query and twig query. For each type of query, we measure the performance of both algorithms on (a) Q1:-Query with P-C relationship (b) Q2:-Query with A-D relationship and (c) Q3:-Mixed query.
All our experiments are performed on 1.7GHz Pentium IV processor with 512 MB SDRAM running on Windows XP systems. All numbers presented here are produced by running the experiments multiple times and averaging the execution times of several consecutive runs. Figures 8, 9 and 10 show the execution time of TwigINLAB and TwigStack for both path and twig query. Fig. 8       From these figures, we draw several observations and conclusions:- • When the twig query contains only P-C edges, TwigINLAB performs around 24.5% better as compared to TwigStack (shown in Fig. 9). This may be due to the INLAB labeling scheme, which is optimal to support P-C relationships. • Although TwigINLAB still outperforms TwigStack for query with edges of A-D relationship by around 17.8%, the difference is less significant as compared to query with edges of P-C relationships. This may be due to the extra time needed to determine whether the two nodes is in A-D relationship by multiple lookups on the index table until the ancestor level is reached.
• For each test case, TwigINLAB increases less drastically as compared to TwigStack. This shows that TwigINLAB is more scalable in processing large-scale datasets efficiently.

CONCLUSION
In this paper, we have presented the TwigINLAB algorithm to optimize all the sub-processes involved in the decomposition-matching-merging approaches. Experimental results show that, in terms of execution time, on average, TwigINLAB performs about 21.7% better for path query and about 18.7% better for twig query compared to the TwigStack. Also, TwigINLAB is more scalable compared to TwigStack. As such, TwigINLAB supports large-scale query of datasets efficiently.