First Token Algorithm for Searching Compound Terms Using Thesaurus Database

,


INTRODUCTION
Searching is the basic process in Information Retrieval (IR) science. Documents, data within documents, relational databases, Schemas and WWW are the main sources where information can be retrieved. Searching information needs a search engines types and different data sources. Text search engines are the most common search engines type, Full-text search is the process of examining all of the words in a computer-stored document(s) or database to match search words supplied by the user. Full-text searching techniques become widely common and supported in either web applications or desktop application programs. Text search is applicable in e-business, human resources departments and others. Also, it is a basic supported feature in any word processing application such as Microsoft Word or database engines like Oracle (Doug et al., 2011), MySQL and SqlServer.
A Thesaurus is a list of very important term (single-word or multi-word) in a given domain of knowledge and a set of related terms for each term in the list. It is used for indexing, classifying, searching and text mining. Terms in thesaurus are listed alphabetically and some are hierarchically, this hierarchically indicates the relation between terms, the broader term "BT" represent the super class of the term while the narrower term "NT" represents the subclass (es) of the term. Some thesauri have the USE and Used For (UF) relations to indicate the alternation of terms (Robert, 2006;Abuzir, 2010).
Searching about text into a thesaurus database or any other data sources require the traversing of each term or compound term of the text. Our objective is to introduce an efficient search algorithm within the thesaurus database; this search algorithm can be used in either indexing or information retrieval applications. The next Sections are an overview over the problems we address, Brute-force algorithm and our enhanced algorithm First Token (FT) Algorithm. Finally, discussion, results and conclusion are presented in the last sections.
Background: Searching for text in database or any other data source based on string searching algorithms. These Algorithms check the existence and the location of a substring (also called pattern) into another string (Lin, 2009;Chen et al., 2011;Sleit et al., 2009).
Many algorithms of string matching were introduced as an enhancement of the simplest string matching algorithm. The Naïve search (brute-force) is the simplest and the less efficient algorithm among string matching algorithms (Lokman and Zain, 2010). Bruteforce algorithm is simple to implement, need no preprocessing of text and always find the result if it is exists. It based on making a comparison at each and every possible point while sliding the window of search (Christian and Robert, 2000).
Knuth-Morris-Pratt (KMP) and Boyer-Moore (BM) algorithms (Lin, 2009) are the commonly used algorithms in string matching. Both are similar in idea used, time complexity and both don't perform complicated arithmetic on characters. BM algorithm is more complicated than KMP but it is a little faster in practice. Finite State Machine (FSM) (Cormen, 2001) was introduced as a base for string matching algorithm, this algorithm firstly builds a state table then simulate it on the input text. The bitap algorithm (Shift-or, shiftand or Baeza-Yates-Gonnet algorithm) is a fuzzy string matching algorithm, this algorithm adapts easily to approximate string matching and uses the bitwise techniques, it is efficient if the pattern length is no longer than the memory-word size of the machine (Manber and Wu, 1992). Benjamin et al., (2006) described XTM system which has the ability to search for text that matches a set of rules or patterns "regular-expression", like socialsecurity numbers, email addresses, phone numbers. This regular-expression matching can be performed concurrently for up to 50 rules. In recent years keyword search over semi structured and structured data has been extensively studied by Fredriksson (2010); Al-mazroi and Rashid, 2011); Alajlan et al. (2009. Other researchers Agrawal et al. (2002; He et al. (2007); Carmel et al. (2003); Vu et al. (2008); treated keyword search in databases as a graph. These approaches are computationally expensive.  proposes the combination of two algorithms namely Berry-Ravindran and Skip Search Algorithms to form a hybrid algorithm in order to boost search performance.

MATERIALS AND METHODS
Brute-force algorithm (Lin, 2009) is simple to implement no need of preprocessing of text and always find the result if it is exists. However this technique is proportionally cost growth to the problem size growth, for example consider the problem of finding the number of occurrences of each word within a document that are exists in a database field which is one word term, the brute force technique will traverse all tokens (t) and query the database to check the existence, the total number of queries in this case is (t) times. Suppose that the terms in database field are of length (l-1) tokens, that is mean we can form a compound terms of length (l). The total numbers of queries to search for the compound terms can be calculate by Eq. 1: The following Fig. 1-3 explains the growth rate in the number of queries with respect to text size and the maximum count of tokens in the database field.
To explain the previous formula and graphs, consider the following text. "Information Retrieval (IR) is the science of searching for documents, or information within document as well as that of searching relational databases and the World Wide Web". Also consider the following list of terms:

Id
Term Tokens count 1 Information retrieval 2 2 IR 1 3 World Wide Web 3 4 Information technology standards 3 In our first test, we used our sample text about information retrieval. To search the text using brute force algorithm, the text should be traversed 3 times which is the maximum number of tokens in the list. In the first phase the algorithm will search for a single term (token), each word in the text will be used to query the database. In this case, the number of queries equal the number of tokens count in the term. In second phase, a compound term of two words will be considered as one term and this term will be used to query the database. So, the first term in our example will be "Information retrieval "while the second one will be "retrieval IR" and so on, this will yield a (t-1) terms in this round. Third phase will use a term consist of three tokens, starting from the term "Information retrieval IR" and the last term will be "World Wide Web". The number of queries in this round is (t-2). The total number of queries in all the three phases in our example can be calculated using the following Eq. 2: In general, the total number of queries of text consists of (t) tokens and (l) the maximum tokens count in terms in the database is Eq. 3: Consider the following series: 1 + 2 + 3 + … + (t-l) + (t-l+1) + (t-l+2) + … + t-3 + t-2 + t-1 + t = t n 1 n = ∑ As a result the total number of queries can be expressed by the following formula (4) From Eq. 4 and based on our text sample we can calculate the total number of queries for Brute-force algorithm. The text contains 14 tokens (t) (tokens are in Bold, the rest are stop words and will be ignored by the system) and the maximum number of Token Count (l) in a term is 3. We can find that the total number of queries is equal to 39 queries: Our study based on the existing approach and the analysis of the effectiveness of different sources on the total number of queries and on the total time. We described the structure of the databases and explained how our approach reduced the number of queries and the total time required to finish the required task.
Database structure: The proposed enhancement depend on creating two other tables related to the main list of terms in the databases, the first one will contain a list of first token of each terms, while the other will contain the Id of terms that begins with specified Token. The following Entity-Relation diagram E-R Diagram (Fig. 4) illustrates the relations between tables. Table-1 shows an instance of the database from our sample example. The flowchart in Fig. 5 explains how to use the E-R Diagram in Fig. 4 and Table 1. Fig. 5 shows the main steps in searching for a term in the databases. The process of searching text terms in the database can be performed by traversing the text tokens for one time. In this phase each token of the text will be used to query the table of (Tokens) from the new model.

The proposed algorithm:
If the system returns a (TokenID) from Tokens table, this means that two extra queries are needed, the first one is querying the Terms_Tokens table, to get all (TermIDs) that begins with the specified (TokenID). The second one is querying Terms table to get a Temporary List of (Tokens count) for that term Id (TermIDs) and the list of the terms in the thesaurus database (Terms List). The (Tokens count) of terms used to determine the length of the compound term that our system can extract from text collection.  Token ID  Token  Token ID  Term ID  Terms ID  Term  Token count  1  Information  1  1  1  Information retrieval  2  2  IR  1  4  2  IR  1  3  World  2  2  3  World Wide Web  3  3  3  4 Information technology standards 3

Fig. 5: Searching for a term in the databases model
The system parsed our sample text collections using the tokens counts and constructs a list of compound terms (Build List) start with the term in query. Finally, the system use the list of the terms returned by our query (list of thesaurus terms from the database) to search for the occurrence of these terms in the compound terms extracted and build by our system from the text collection. This model automated and restricted the construction process of the compound terms from the text collection. It is clear how long is the compound term and the starting term.
Back to our example, starting with token "Information", we query the Tokens Table, this gives us the (TokenID = 1), meaning that we need to perform two extra queries, first we use (TokenID = 1) to query Terms_Tokens Table, resulting the following list of  TokenIDs and TermIDs:   TokenID  TermID  1  1  1  4 Now, we query Terms Table for TermIDs 1 and 4. The result of this query will contain the terms and its (TokensCount) as follows:

TermID
Term Tokens Count 1 Information retrieval 2 4 Information technology standards 3 Based on the previous result, the system build terms from text collection starting with the current token and length of 2 and 3 tokens, the built list will be as follows:

Information retrieval Information Retrieval (IR)
The final step is to check the existence of terms from query result table within the Built Terms list. Numerically, our example need to make 13 queries to the Tokens table, with 3 extra 2 queries when process the tokens "Information","IR","World". While the brute force technique need to perform 39 queries. The following pseudo code listing of the proposed algorithm illustrates the proposed enhancement approach.

RESULTS
An experiment: Our data collections consist of five different thesauri. Table 2 gives a summary of these thesauri. A sample of 15 text collections was used. We test our system with these data collections. We experiment with these collections and databases different length of tokens. The variable length of the tokens ranges were from 50-991 tokens. The system uses stop list to remove noisy terms from the text collections. We ran both algorithms Brute Force and First Token (FT) using our data collections and thesauri. In each experiment we found the average processing time for each algorithm based on the dynamic changing of the length of tokens that range from 50-991. We plotted and compared the result for each experiment.
In the first experiment we used the first thesaurus (Training Thesaurus). The Training Thesaurus constitutes the controlled vocabulary of reference in the field of vocational education and training. We used our Tool ThesCov to built this Thesaurus from Web site related to the domain of Training. The other thesauri were constructed using our Tool ThesCov (Abuzir, 2010).

DISCUSSION
In Table 3 the average time (normalized) for both algorithms was calculated. Comparing our results for brute force algorithm and First Token (FT) algorithm, we can conclude that FT algorithm is more efficient in time on all cases of token length, especially for large number of tokens matching. The graph in Fig. 6 shows the time required for each algorithm using the first thesaurus.
We repeated the test with the other four thesauri and different data collections. A Summary of the average time required for both Brute Force and First Token algorithms to search terms of different length from our text collection using thesauri is shown in Table 3. Figure 7 and 8 show time required for BF and FT algorithms respectively using the different thesauri. Figure 9 shows time required for BF and FT algorithms using the different thesauri.
The worst case of the proposed enhancement algorithm occurs when each token of the text found in Tokens table that means we need more two extra queries. Here we need the same total number of queries as brute force algorithm.  ,2,3,4,5,6,7,8,9,10,11,12,14,15

CONCLUSION
In summary, the proposed approach builds a new database structure (Fig. 4). The system creates these tables only one time. Using these new tables the system decrease the number of queries needed to search in the database. The new structure build new index to be used in searching thesaurus database instead of using the whole database. The system searches for any term in the new indexed tables instead of the original database.
In this study we proposed a string searching algorithm called First Token (FT) as an improvement of the brute force algorithm. Our experiments and data collections showed that the proposed algorithm is efficient. Our algorithm can perform in a faster and more efficient manner than brute force algorithm. Our algorithm decrease the number of queries required to query the databases and search time.