A New Method of Generating Index Label for Dynamic XML Data

,


INTRODUCTION
In these wonderful and privileged times, none of this would be possible without data. XML (eXtensible Markup Language) is powering the revolution of the data. In the Internet, the documents used XML format is useful for representing and exchanging the information. The indexing methods of the XML documents involves with various numbering and labeling schemes. While processing the XML document, for each of the node, the unique persistent label value as an index will be assigned so that it gives the fast access to these nodes while querying the data. The dynamic nature of the database needs to the change labels of the existing nodes each time. For the frequent updating of the document, instead of evaluating the query the re-computation of the labels will occupy most of the time in the query processing. The parent-child relationship can be inferred from the label values of the nodes. It is useful for structural query processing (http://www.w3.org/TR/xquery/). The proposed method provides the ways for inserting, deleting and updating nodes without changing their label values. The experimental results show an improvement in the system.

MATERIALS AND METHODS
For processing the query over the XML document, various methods have been proposed. The different ways of labeling and numbering of the nodes are used in these schemes. A few examples are Path indexing, Node indexing and structural indexing. The strengths and weaknesses of several node labeling schemes have been discussed by Su-Cheng and Chien-Sing (2009). The comparison of various labeling schemes for ancestor queries has been discussed by Kaplan et al. (2002). The recomputation of the labeling problem present in the study of Tatarinov et al. (2002) and Li and Moon (2001). The methods proposed by Meuss and Strohmaier (1999); Grust (2002); Amato et al. (2003), Lee (2008a, 2008b), Mirabi et al. (2010) are using path indexing. These methods do not provide the update operation in the XML document. The performance of the query processing is increased by indexing methods (Kaelin, 2004). The XML document for a library system and its tree representation is shown in Fig. 1. In this sample document, library is the root node present at the level 0 and has two child nodes books in the next level 1 and so on. The numbering methods based on the prefix values were proposed by Cohen et al. (2002), in which a specific code will be assigned for each node in the document. The binary code will be doubled by adding sequence of zeros, in the double-bit growth approach. Tatarinov et al. (2002) used Dewey prefix-based numbering scheme. The structural information i.e., parent-child relationship can be found in the Li and Moon (2001) system which uses range information. The preserved codes are used in the prefix-based system PBi Tree proposed by Yu, Luo, Meng and Lu (Jeffrey et al., 2004). When the reserved code is not enough then renumbering is needed. While updating takes place, it will affect the label values of the existing nodes.
The problem due to updating is shown in the Fig. 2. The determinant way of adding or deleting nodes is not described in the XML Schema. This will leads to updating problem in the labeled document.
The order in which the siblings to be maintained is not described in the schema.
But the order of the siblings or child nodes are very important one in the querying process i.e., especially in the structural queries. For example, to get the result of the query such as "Getting the title of the7th book in the library document" needs the information about the order of the siblings. The frequently changing document and large size document needs more time for re-computing the label values rather than processing the query. The proposed system NLSXU will provide a solution of this re-computation problem. The queries involved with structural pattern can be effectively processed by using the indexing scheme which uses persistent labels for the document nodes in the XML document.

NLSXU-proposed system:
In the existing systems, Amato et al. (2003), Cohen et al. (2002), Tatarinov et al. (2002); Li and Moon (2001) (2005) proposed a system LSDX (Labeling Scheme for dynamically updating XML Data). In that, both the combination of letters and digits for generating the labels is used.

Unicode characters in labels:
In the NLSXU, the labels will be generated using letters, digits and Unicode characters (Unicode Consortium, 2010). The combination of letters, digits, will be different from other methods. In NLSXU, the labels are generated from the set of characters. The set includes the digits (0-9), uppercase letters (A-Z), lowercase letters (a-z) and few of the characters in the Unicode character set. The Unicode Standard contains a set of unified Han ideographic characters. The term "CJK" refers to the Chinese, Japanese and Korean languages. The Unicode value of the characters in the set will be taken into account such that it maintains the order of the siblings of a node in the XML document. In this experimentation, for generating labels, the characters in the Unicode Character set are chosen arbitrarily. i.e., the characters in ASCII system (10 digits, 26 uppercase letters, 26 lowercase letters) and 20901 characters from CJK unified ideographs are used. Thus the size of the index will be greatly reduced.  Fig. 3, the algorithm needs several steps for generating the persistent labels for the XML document in NLSXU. It is a recursive algorithm which uses the depth first traversal order and assigns the labels.
Maintaining the order of the siblings: The labeling sequence in the proposed system is different from the other existing systems such as LSDX by Doung and Zhang (2005). The previous study NLSX uses only the alphabets including both upper case and lower case and also numbers. The new idea for identifying the siblings of a node will be introduced in this proposed NLSXU Scheme. The combination of the ASCII characters and CJK Unicode Ideographs will be made using the increased order of the code point value of the character in Unicode Character Set. The UTF-8 version 5.2 will be used for this purpose. The Unicode code point value of the characters used are given below: Digits (0-9): "\u0030" to "\u0039" Uppercase letters (A-Z): "\u0041" to "\u005A" Lowercase letters (a-z): "\u0061" to "\u007A" CJK Unified Ideographs (CJK letters): "\u4E00" to "\u9FCB" Thus the sequence of characters used for counting the siblings of a node will be given in the Eq.
Where: D = The range 0-9 U = The range a-z L = The range A-Z CJK = The CJK Unified Ideographs LCJK refers to the last character in CJK Unified Ideographs. In CJK Unified Ideographs, only the first 20901 characters only are used. The code point of the last character in CJK set considered is "\u9FCB". The characters present in each pair of parenthesis will be varied depend on the notation used. For example, in a pair within a parenthesis (LCJK, D) will make the number of combinations using the last character of the CJK Unicode character set and the digits range from 0 to 9. The sequence shown in Eq. 1, is very carefully designed for such documents. i.e., at any point the labels can be expanded in both left and right directions by using this sequence; thus the problem of recomputation is avoided in dynamic XML documents.

Solution for dynamic change in the document:
The problem of re-computation of labels exists in the labeling systems proposed by Tatarinov et al. (2002); Li and Moon (2001); Meuss and Strohmaier (1999); Grust (2002), Amato et al. (2003), Lee (2008a, 2008b). According to NLSXU, when a new first child node or new left most child is added in the existing document, the previous count of the old first child node or old left most child will be retrieved from the Eq. 1. And it will be used as the position of the new first child node or new left most child. Thus, the persistent labels of the existing document need not be changed for the changes in any part of the document. i.e., the position of the child node of a node will be identified by the sequence mentioned in the Eq. 1. So, from the letters and digits of the label, the ancestors and descendants of a node will be identified easily.

RESULTS AND DISCUSSION
The proposed system NLSXU has been implemented in Java 2 JDK 5.0. For manipulating the nodes in the XML document, Sun Microsystems SAX parser is used. The XMark datasets (IBM Corporation, 1999;Schmidt et al., 2002) are used for the experiments. The scaling factors 0.01-0.5 are used for generating the different size XML documents. Additionally, the experiments were conducted with the various XML repositories obtained from the XML data repository (Miklau, 2001), Digital Bibliography Library Project (DBLP) (Ley, 2003) and the Protein Sequence Database), a bench mark dataset, namely, XMark. The dataset TreeBank, which has many more distinct elements and deeper structure, is used for experiments. The dataset SwissProt is also used for the same. On Genuine Intel CPU, 2140 @1.60GHz, 2.49 GB of RAM on Windows XP system with 75GB hard disk the experiments were conducted to analyze the performance of the proposed system.

Analysis of the index sizes:
While comparing, it is clear that the proposed scheme NLSXU will reduce the space for the synthetic data sets up to 26, 34, 71 and 95% with NLSX, LSDX, GRP and SP schemes respectively as shown in the Fig. 4. The size of the index of the real world datasets using the methods NLSX and NLSXU can be compared and the results are shown as in the Fig. 5. Using the proposed scheme NLSXU, the index size is greatly reduced by 81% of the existing scheme NLSX.
Time taken for generating labels: From the analysis for the synthetic datasets as shown in the Fig. 6, it is found that 80 and 15% of the time is reduced by using NLSXU for generating the labels when compared with LSDX and NLSX schemes respectively. Similarly, for the real world datasets, the graphs as shown in the Fig. 7, it is found that a good amount of time (96 and 66%) is reduced by using NLSXU for generating the labels for the real world datasets when compared to the schemes LSDX and NLSX respectively.

CONCLUSION
One of the issues in the dynamic XML is the need for re-computation of the labels. To avoid such problem, in the new proposed scheme NLSXU uses a different sequence using unicode characters for labeling the siblings or child nodes. Additionally, the label provides the information about the ancestor and descendant relationship in the document. The comparison of the results for the real-world and synthetic dataset shows that the size of the index generated and time taken for generating the labels will be reduced to a greater amount. Also the clustering of the results and ranking the elements may also be added for further improvement. Also the label compression techniques may be improving the performance.