CTSS: A Tool for Efficient Information Extraction with Soft Matching Rules for Text Mining

The abundance of information available digitally in modern world had made a demand for structured information. The problem of text mining which dealt with discovering useful information from unstructured text had attracted the attention of researchers. The role of Information Extraction (IE) software was to identify relevant information from texts, extracting information from a variety of sources and aggregating it to create a single view. Information extraction systems depended on particular corpora and were poor in recall values. Therefore, developing the system as domain-independent as well as improving the recall was an important challenge for IE. In this research, the authors proposed a domain-independent algorithm for information extraction, called SOFTRULEMINING for extracting the aim, methodology and conclusion from technical abstracts. The algorithm was implemented by combining trigram model with softmatching rules. A tool CTSS was constructed using SOFTRULEMINING and was tested with technical abstracts of www.computer.org and www.ansinet.org and found that the tool had improved its recall value and therefore the precision value in comparison with other search engines.


INTRODUCTION
The specific notion of Information extraction has received wide attention in last decade (1990s) through the series of Message Understanding Conferences, founded by US defense research group DARPA. Researchers from NLP and IE have used common evaluations to accelerate their research progress, through these conferences. They have compared different systems to give a certain transparency to the field.
Previous studies have shown that bag of words, natural language processing techniques which may utilize rule-based grammars, part-of-speech taggers and parsers , development of templates, Learning methods, Hidden markov models, Bayesian networks, Data compression, Machine learning, etc as some of the techniques adopted in IE [7,11,21] . Bag of words is the traditional method used for extracting information like sentiment (Casey Whitelaw, Navendu Garg and Shlomo Argamon) [11] , Library books categorization [32] , topic ontology [7] , Feature Generation for Text Categorization Using World Knowledge [22] .
Text: Statistical hidden state sequence models, such as Hidden Markov Models (HMMs) [24] , Conditional Markov Models (CMMs) and Conditional Random Fields (CRFs) [30] are a prominent recent approach to information extraction tasks. Some of the other systems existing for IE is extracting information on interacting proteins from biomedical text using manually developed patterns [21] , extracting the names of organizations and their headquarters by generating patterns and extracting tuples from plain-text documents (Snowball system), a genre-based extraction patterns using natural language processing techniques for extracting the rhetoric information contained in technical abstracts [29] , extracting a database from postings to the USENET newsgroup, Austin.jobs, etc using predefined templates [31] , etc. By discovering predictive relationships between different pieces of extracted data, data mining algorithms can be used to improve the accuracy of information extraction. The recall value of an IE system is significantly lower than its precision; such predictive relationships can be productively used to improve recall by suggesting additional information to extract.
System Architecture: The objective of the system is to extract the aim, methodology and conclusion specified by authors in technical abstracts. The general architecture of a text mining system is depicted in The system deals with extracting information from multiple documents, stored in database and using data mining techniques to extract knowledge in the form of rules. By discovering predictive relationships between different pieces of extracted data, data mining algorithms can be used to improve the accuracy of information extraction. Knowledge Discovery in Databases benefits IE by discovering rules that support predictions that can improve the accuracy of subsequent IE.
Parsing: To extract the rules, the IE task takes the set of tagged documents and produces a template representation for every document. This can be easily converted into rule-like form. For this purpose, a set of domain-independent extraction patterns are written so that we could match them against the input documents. Each extraction pattern constructs an output representation that involves two levels of linguistic knowledge: the rhetorical information expressed in the abstract and the semantic information contained in it, which we later convert into a predicate-like form. The left-hand expression states the pattern to be identified and the right hand side (following the colon) states the corresponding semantic action to be produced.
The process starts with the splitting of a given sentence into various tokens (words), from which the stop words, such as the, a, an, it, etc. are removed, as they contribute no meaning for recognition of key terms used for IE. The Morphological and lexical processing concerns how words are constructed from more basic meaning units called morphemes. A morpheme is the primitive unit of meaning in a language (the meaning of the word proposed is derivable from the meaning of the verb propose the stem word) and the inclusion of suffixes may transforms a verb into adverb. The morphological processing deals with the identification of stem word, which is a verb. Syntactic Analysis concerns how words can be put together to form correct sentences and determines what structural role each word plays in the sentence and what phrases are subparts of what other phrases. Domain analysis includes the general knowledge about the structure of the world that language users must have in order to, for example, maintain efficient knowledge discovery. The Verb Phrase (VP) is decomposed into two elements: the predicate action and the sequence of terms that represent its argument: • Generalize (error, diffusion, produce, FM, halftone..) → Where generalize is the predicate action • Error, diffusion,… as its argument.
The documents are parsed and the type of each and every word is analyzed. The Trigram set is used to extract the essential features and it is expressed in a tuple form like (previous token, current token, next token): • Current Token: This is the token in its full form, as it occurs in the text. Verb is always considered as the current token • Previous Token: This is the token to the immediate left of the current token or a special marker, if the current token is first in the sentence • Next Token: This is the token to the immediate right of the current token or a special marker if the current token is last in the sentence If the Current token (in the form of verb) retrieves is (be), the model retrieve the next verb as the keyword. If the sentence is in the active form, the keywords followed by the verb are retrieved and if it is in the passive form, the entire sentence from the beginning will be extracted. 15 rules which satisfy the Trigram model were written.
Softmatching rules: The IE system in this work is extracted using trigram model and rules are constructed using patterns which need not strictly adhere to the procedure. The Fig. 2 shows a sample of softmatching rules, those are introduced.
The rules are softmatching rules, as these are some frequently occurring terms which best fits the templates. Introduction of these softmatching rules have shown the improvement over the precision value, so as the recall. The algorithm SOFTRULEMINING is implemented for Information extraction using softmatching rules and is depicted in Fig. 3.
As Information extraction systems are domain specific, machine learning plays a vital role in classification and prediction. During the learning Process of machine learning, a sample of the database is used to train the system to properly perform the desired task. The quality of the training data determines how  well the program learns. The documents are trained with a bag of words and in order to normalize theKeywords, the inverse document frequency is used in which each document can be represented as a term vector of the form ā = (a 1 ,a 2 ,….a n ).

URL Crawler
Hash  Fig. 4: Architecture of CTSS Each term a i has a weight w i associated with it and w i denotes the normalized frequency of word in the vector space , where w i = tf i . idf i where tf i is the term frequency of a i , idf i is inverse document frequency denoted as log (N/DF) where N is the total number of documents and DF is the number of documents in which a term has appeared in a text collection.

Architecture OF CTSS -information extraction tool:
CTSS is a tool that is developed using Java for extracting information different URLs. This tool is implemented using the SOFTRULEMINING algorithm as the algorithm has shown better recall value than the trigram model that is adopted. Figure 4 shows the architecture of CTSS.
Given the URL as input, the web crawler fetches the pages from the links present. The system searches with the given set of patterns and if matches, it indexes the selected strings and store it in a hash table. Every web page has an associated ID number called a docID, which is assigned whenever a new URL is parsed out of a webpage. The indexing function is performed by a indexer and a sorter. The indexer performs a number of functions. It reads the hash table contents and records the word and its corresponding position in the document.
Another important function performed by the index is, it parses out all the links on web pages and stores the extracted information about them in an anchors file. This file contains enough information to determine, where each link points from and to and the text of the The searcher is run by the webserver and uses the lexicon given by the user and the indexer to extract the information. The algorithm for the proposed approach is explained in Fig. 5.

RESULTS AND DISCUSSION
Discovered knowledge is only useful and informative if it is accurate. It is important to measure the discovered knowledge on independent test data. For the dataset, 200 abstracts were collected from www.computer.org containing 2 data sets related to information retrieval and image processing and manually annotated with correct extraction patterns. Inorder to construct the patterns classification algorithms C4.8, Random tree, Random forest, Decision tree, Decision stump were used with 10-folds cross validation. Genetic algorithms with crossover probability 0.99 and mutation level 0.01 is performed and it is found that the genetic algorithm producing better recall value compared to other classification methods. The data trained using genetic algorithm is then used for the purpose of constructing patterns.
Patterns are constructed using the tokens trained using genetic algorithm and SOFTRULEMINING is then used for information extraction. The results obtained using SOFTRULEMINING is compared with results of HMM model and Hardmatchingrules. The results are depicted in Table 1.
The patterns, which are constructed are verified using training data and tested using different domains like www.computer.org and www.ansinet.org. Three-fourth of the technical magazines from  www.computer.org are checked using the proposed algorithm and it is found that the system has improved its recall value after the implementation of softmatching rules.
The rules for identifying the occurrence of the current token preceded and followed by the proper order specified and finding the threshold (inverse document frequency, between 0 and 1). If the rules satisfy the condition, they are added to the rule, else it is pruned. For each rule extracted, see whether the training set of data matches the current token, if it matches the rules are extracted and stored in the structured format.
The tool is run on a P-IV system and time for extraction using google search engine and SOFTRULEMINING are studied and the proposed system has shown better recall value and saves time as compared to google search engine for extracting the specified information as shown in Table 2. Since the Google search engine fetches the relevant documents, scanning through the documents and extracting the key information is time consuming, whereas in CTSS, the webpages are directly scanned and indexed which saves time.
Evaluation: After designing a set of probabilities and an algorithm for some particular application, it is necessary to evaluate the efficiency of the algorithm. The general method for doing this is to divide the corpus into two parts: the training set and the test set. A test set consists of 10-20%of the total data. Running the algorithm on the training set is considered a reliable method of evaluation. A more thorough method of testing is called cross-validation, which involves training on the remainder of the corpus and then evaluating on the new test set.

CONCLUSION
Since the success of any machine learning algorithm depends on the type of features selected, 120 patterns were written using softmatching rules, which have improved the recall value of the information extraction system. The following are some of the findings of the system: • In specifying the aim and conclusion authors have used only a frequent set of tokens in different domains than for specifying the methodology. More training is needed for identifying tokens for methodology • The system is tested with different websites having technical abstracts and the introduction of softmatching rules have shown good performance over the existing methods. Therefore the proposed system can be considered as a domain-independent system • The algorithm SOFTRULEMINING has been proposed and it has shown 84% recall value as against the other methods which have shown recall value of 70% and less • The previous technique has dealt with a single domain as well as with manually collected documents, whereas in the proposed system, the algorithm is tested with live data from www.computer.org and www.ansinet.org. The recall value is efficient than google search engine • The construction of patterns needed efficient learning algorithms. The system tokens are classified and trained using classification techniques like C4.8, Random tree , random forest , Decision trees, Decision stump at ten folds cross validation. Similarly classification is done using genetic algorithm at various crossover probabilities like 0.6, 0.7, 0.8 0.99 and mutation level 0.01 in which the crossover level 0.99 have shown a good recall value compared to the other methods.
Therefore Genetic algorithm is used for the purpose of learning • The SOFTRULEMINING is implemented as a tool called CTSS, which fetches abstracts from given URLs and extracts and store the information in the form of database. The proposed tool CTSS is found to show better recall value than the results obtained after extracting information through google search engine