© 2005 Science Publications A Methodology to Segment the Text for Index Terms

The problem of information overload is a hot issue with the growth of the world wide web. The need for the tools those should be able to absorb this huge information and eliminate this problem is evident especially for IR systems. Text is not a simple sequence of words but carries a structure. It is essential to handle these uncontrollable complex structures of sentence, grammatical and lexical irrelevancy of different units. The main idea to handle these problems is to segment the text into elementary units, which will be simpler and lesser complex than their equivalent text. We have used cue phrases, punctuations. We are presenting an algorithm, which is not only efficient but also handling more than 500 cue phrases and most of punctuations. This proposed algorithm can yield elementary units, which can be used by Rhetorical Relations Finder to get relations among them, which can be used by the RST Parser for the construction of RST Tree which will be used to design an RST based indexer. In future, the algorithm can be enhanced for handling other discourse markers, which will enable us to handle the most complex cases where cue phrases and punctuations are not applicable.


INTRODUCTION
It is commonly admitted that text has structure, which is independent of world and domain specific knowledge [1] . Sentences in the text are closely interrelated and grouped in certain ways to form the whole text. Sentences relations are weaker than the relation that exists between words but the sentences are interpreted jointly and they meant to coexist. IN the same way there exist a relation between the paragraphs in any document [1][2][3][4][5] .
To find the relationship between two adjacent short paragraphs is much easier than to find the relationship between two longer units of text. This fact has caused to distinction between the global and local structures of discourse.
The theories [6,7] support this distinction. The theory developed by Gresz and sinder deals specifically with local discourse structure [6] . This shows the need for identifying the boundaries of the segments of text. Segmentation is the process to identify the units of text whose sentences are strongly connected to each other. Text segmentation of any document problem focus on how to identify the regions of text ends and another begins of that document. Simply we can say that a text is divided into N Segments, which display certain characteristics (i.e Text spans, a topic or an idea) Text  Tg1  Tg2  Tg3  Tg4 Tg5 There are many application areas of the text segmentation like Information retrieval [8][9][10] .
Text Segmenter is also helpful in finding the subtopic, which facilitates the user to jump from one topic to another topic or to his required information. The Text Segmenter also provides the structural information about the document, which enables to find the relations, which exists in any document. it can also be used for effective query and analysis. Our main interest in text segmentation is to use for Indexing purpose and ultimately to improve the performance of Information Retrieval Systems. There are many ways in which text can be broken down into segments [4] . We have used cue phrases and punctuations for text segmentation.
Cue phrases are words that connect two or more spans and add structure to the discourse of text, for example, some cue phrases are given: "first", "and", "now", "accordingly", "actually", "also", "although" etc. Marcue created a set of more than 450 cue phrases [11][12][13] . Also, Simon H Corston-Oliver describes a set of linguistic cues that can be identified in a text as an evidence of discourse relations [14] as well as to segment the text.

NEED FOR TEXT SEGMENTATION
In order to automatically build the valid text structure of an arbitrary text, we need only to determine the elementary units of that text. Therefore, an accurate determination of the elementary units of a text is the most important task. Using cue phrases is one of the best ways to determine elementary units [15] . According to Litman and Hirschberg, Cue phrases are words, phrases, or linguistic expressions that may directly and explicitly mark the structure of a discourse [16] .
The main cause to divide text into elementary units is to eliminate the complexity of large grammatical sentences. Finding the elementary units at early stages carries many advantages as following: * Elementary units provides an ease to handle Natural Language Programming * Elementary units of text help to extract relations more easily. * Elementary units enhance the efficiency of complete process, as small text units are easier to handle with respect to their larger equivalents.
Hearst introduced the Text Tilling algorithm [16] . This algorithm segments text into multiple paragraphs of coherent discourse units. As Hearst's approach segments at the paragraph level, this is not suitable for the applications like Information Retrieval.
Kozima gives the approach of a "Lexical Cohesion profile" to keep track of the semantic cohesiveness of words in a text within a fixed length window [17] . Kozima uses a semantic network to provide knowledge about related word pairs. The network is trained automatically using a language specific knowledge, generalizing it by applying it to a window of text and finding the cohesiveness of successive text windows in a document and hence finding the boundaries of the text segments.
Reynan [18] present the graphically motivated segmentation technique called dot plotting .It uses the simplified notion of lexical cohesion. It exclusively depends upon on word repetition to find tight regions of topic similarity.
Grosz and Sidner proposed the Grosz and Sidner's Theory (GST), in which, they named linguistic textual units as Discourse Segments (DS), which are used in construction of discourse structure. Although GST provided an idea of DS or elementary units, but they never explained as a particular methodology [6] .
Daniel Marcu proposed a comprehensive methodology of Shallow Processing to decompose given text in to elementary units and named them as Text Spans [4] . He used cue phrases, comma and parenthetical as basic tool. He also elaborated the idea to handle the duel behavior of cue phrases (sentential and discourse usage) like and and or. Unfortunately he missed much other punctuation, which can play a vital role in the segmentation of text into elementary units and his algorithm is not as efficient as it should be also it is does not provide an automated solution to text segmentation In this study, we presented novel scalable and robust text segmentation technique, which cover all the deficiencies present in the algorithm of Daniel Marcue.

PROPOSED ALGORITHM FOR TEXT SEGMENTATION
The accuracy and efficiency can be achieved by using the following proposed algorithms, which is provides the stub parts of RST for tree development. The getTextSpan procedure serves as the main starting point for this purposed solution, this procedure takes input the string to be processed and calls the makeParagraphArray procedure to divide text into paragraphs with the help of utility procedure getParagrph and afterwards the output is sent to makeSentenceArray which divides the paragraphs into sentences with the help of utility procedure getSentence. And each sentence is analyzed by analyseTextSpan procedure, which provides the core functionality to extract elementary units. Input: Unstructured text Output: The Elementary Units Processing: Step 1: Procedure Name getTextSpan Input String to Process Output

ArrayList of Elementary Units Process
This Procedure takes input the string to be processed and calls the makeParagraphArray procedure to divide text into paragraphs and afterwards the output is sent to makeSentenceArray which divides the paragraphs into sentences. And each sentence is analyzed by analyseTextSpan procedure, which provide the core functionality to extract elementary units.

COMPUTATIONAL EVALUATIONS
Although complexity of the proposed solution seems to be exponential from general mathematical representation The analytic study shows that our algorithm is more efficient for text segmentation.

CONCLUSION
We have presented a technique of segmenting the text into elementary units, which will help us to form a rhetorical tree. The text spans will be used for extracting relations. The psuedocode presented has been implemented in the programming language C# and results are promising. The results are being verified on different types of the text as well as mathematically it has been proved that the execution flow of our proposed algorithm is linear which is more efficient than exponential algorithms.
Future work: Currently we are working on the extension of our proposed weight assignment approach and considering both keywords and the RST relationships of a collection for the purpose of indexing and referring it to as this indexing technique as composite dynamic indexing technique. The output of the technique demonstrator in the paper will be used for RST relation based tree constructing whose node will contain text segment and relations. The next paper will demonstrate this technique. These techniques will finally be used for indexing technique in the IR Systems.