Malay Interrogative Knowledge Corpus

Problem statement: The growth in the number of documents written in Malay language is 
enormously available on the web and intranets. There is a need to identify the information in the Malay 
documents that contain knowledge. This triggers the need to investigate the availability of knowledge 
in them. Approach: This study uses interrogative theory to identify knowledge from documents or 
texts. Results: The results are expected to lead towards establishment of new set of interrogative rules 
for Malay corpus. Conclusions/Recommendations: This study contributes the interrogative 
knowledge identification thru the development of Malay Interrogative Knowledge Corpus (MalayIK-Corpus). 
It facilitates to explicitly capture and make available Malay knowledge representation in a 
knowledge-base system.


INTRODUCTION
The development of the Malay Interrogative Knowledge Corpus (MalayIK-Corpus) is due to unavailable public domain utilities or tools for Malay language to codify computational grammar and collect morphological rules, semantic or syntactic templates. Even, there is no public domain parser to analyze Malay texts and general computational lexicon for Malay words. Ahmad (1995) reports that the use of dictionary for Malay words is inevitable as far as Malay documents are concerned. Unfortunately, there is no Malay corpus that has been published yet except a dictionary of root words which contain 22,433 entries (Ahmad, 1995;Abdullah, 2006). Therefore, the development of MalayIK-Corpus has to manually modify the dictionaries into a MalayIK-Corpus. Firstly this paper presents the development of the corpus. Then, it highlights stop words and the development of stop words list in texts processing and follow by results and discussion. Finally is the conclusion.

DEVELOPMENT OF THE CORPUS
The MalayIK-Corpus is a Malay language corpus where the Malay dictionary of Kamus Dewan (Dewan Bahasa Perpustakaan, 2002; and the dictionary of root words act as important secondary controls of the lexicon entries. It is derived from 6,000 word entries (about 4,000 root words and 2,000 derivations). It also refers to the dictionary of Kamus Imbuhan Bahasa Melayu (Ali et al., 1993), Kamus Dwibahasa Oxford Fajar (Hawkins, 2001), and Kamus Komprehensif Bahasa Melayu (Othman, 2005). Besides, books on Malay language are also used in preparing the grammatical information entries (Ahmad, 2001;Asraf, 2002;Latif & Rashid, 2003).
It looks upon the interrogative theory of knowledge identification and representation as the background theory for the foundation of the MalayIK-Corpus development. The interrogative-based approach is described as the "who, when, what, where, how and why" analysis (Quigley & Debons, 1999). It makes distinctions between data, information, and knowledge. The MalayIK-Corpus used grammatical information of lexicon to answers the interrogativebased question.
The "when/where/who/what" identifies the information. The "how/why" identifies the knowledge. While the grammatical information of lexicon that answers no question identifies data. Hence the most important attribute is the grammatical information of lexicon entry to answer the question of the lexicon grammatical information interrogatively besides the root word.

Attributes of the Interrogative Knowledge Corpus
For the purpose of this development, Microsoft Access is used as a database for the MalayIK-Corpus. It is easier to maintain and develop because the lexicon capacity is not huge. The task is merely done to create and update information of lexicons for the corpus. Some other available databases or tools that can also be used according to the needs of the task are Oracle, SQL Server, XML and others. The lexicons entries are manually inserted in the database using standard Data Manipulation Language (DML) of the related database. Each entry of the MalayIK-Corpus contains attributes of: i. root word (kata dasar)); ii. lexicon (perkataan); iii. grammatical information of lexicon entry (kata masuk); iv. interrogative element (elemen interogatif)may consists of either what (apa), when (bila) (when), who (siapa), where (di mana), why (mengapa) or how (bagaimana) which answers the grammatical information of the word entry; and v. status -indicates status of the lexicon for processing purposes which includes stop words. Status 1 indicates noun (kata nama am) or adjective (adjektif) while status 2 indicates stop word.
In order to create a general purpose corpus for Malay, the Ahmad's and Abdullah's stop words (Ahmad, 1995;Abdullah, 2006) are included which indicate pronoun, auxiliary verb, adverb, predicate, preposition, negative, conjunction, relative and determinant.  i. create attributes for corpus; ii. extract lexicons from the document collection; iii. verify the lexicons entries with Malay language expert; iv. insert lexicons entries in the database; and v. extend words encountered which are ambiguous or unclear in its context of answering the interrogative question, then the opinion of the Malay language expert will be referred.

STOP WORD LIST
Stop words, or stopwords, is a name given to words which are filtered out prior to, or after, processing of text. A stop word list (stoplist) is a set of or list of stop words which is typically language specific, although it may contain words (and other character sequences like numbers and punctuations). A search engine or other natural language processing system may contain a variety of stop lists, one per language, or it may contain a single stop list that is multilingual. These stop words are poor discriminators and cannot possibly be used by them to give any hint value and identify document content. Hence, they are eliminated from the set of index terms (van Rijsbergen, 1979) in search engine or document retrieval system. Salton and McGill (1983) report that such words comprise about 40% to 50% of a collection of documents text words. There is no definite list of stop words, which all natural language processing tools incorporate. Not all NLP tools use a stop list. Some tools specifically avoid the use of a stop list in order to support phrase searching.

Development of a Stop Word List
A list of stop words is included in the development of the MalayIK-Corpus, in order to eliminate words which have no values. The development of a stop words list in MalayIK-Corpus adopts approaches used by van Rijsbergen (1979). The purpose is for identification of such stop words list having the same aim to find those of no values. The approach used is the combination of manual selection method and statistical counting of high frequent words. The statistical method of occurrences is to find words of high and very low number of occurrences that are taken as stop words. The total numbers of 6,479 words are extracted from the test collection of Malay unstructured documents collection. The extracted words are ranked by frequency of occurrence in decreasing order.   Table 2 shows that the most frequent lexicons in the test collection documents are conjunction of 'dan' (and), relative of 'yang' (which), and preposition of 'di' (at). These words are created by Ahmad (1995) and Abdullah (2006) as stop words. This shows that these words are function words and commonly appeared in any text documents. Abdullah (2006) reports that inclusion of these words in the list of stop words comply with the fact that these words will not contribute to the content of the collection. The reason being, these words will mark the whole collection as relevant document in a query. With that, it complies with the fact that these words need to be eliminated in order to build up knowledge representation. However, in constructing phrases and identifying interrogative elements of when, where, why and how, the stop words list is being avoided for its usage.

Foundation of the Stop Word List
The stop words list that is created by Ahmad (1995) contains 314 entries and 20 entries from Abdullah (2006). This makes a total of 334 entries of Malay stop words originated from Quranic documents. It is interesting to note that content-bearing words, i.e., 'pertanian' (agriculture), 'halal' (lawful), and 'makanan' (food), also appear in Table 2. Their high positions derive from the fact that the lengthiest documents in the test collection documents is from newspaper which reports on the main domain of agriculture.

RESULTS & DISSCUSION
Interrogative Knowledge Identification Framework is used to address the need for the mechanism to identify knowledge from unstructured document . They used lexicon interrogative analysis to identify and extract knowledge in each of the complete sentences written in the document. It is also used to extract interrogative lexical constructs from the individual unstructured document. Each of the lexicons is analyzed with lexicon interrogative analysis matching rules of MalayIK-Corpus using the standard DML. The DML is used to analyze, check and insert the lexicon into interrogative annotation as interrogative lexical construct if it exists. Any new lexicon analyzed and existed is inserted and defined primarily in MalayIK-Corpus.
The sample used in this experiment, 15% of 42,733 words from MalayIK-Corpus are sufficient and justified to produce better results in extracting identified knowledge. It is more than the suggested by Gay and Airasian (2003, page 113) for sample of more than 5,000 units, a sample size of 400 (8%) should be adequate. The results obtained are measured in terms of percentage of quantitative retrieval performance recall and precision metrics (Baeza-Yates & Ribeiro-Neto, 1999) coupling with research methods and concept in information system research ). The accuracy of the knowledge extracted is measured by precision (fraction of the retrieved knowledge which has been relevant), and recall (fraction of the relevant knowledge which has been retrieved). Comparison of results is done with an expert evaluation. The Malay documents collection is given to the expert to identify the knowledge that resides in the collection interrogatively.
Results of the experiments in the form of precision and recall tables were explained in detail in Sidi (2008). The interrogative element of why has shown a significant accuracy in identifying knowledge. Unfortunately, it is not true for the interrogative element of how. Both these interrogative elements are used to identify knowledge within the text in unstructured document. Moreover, the analysis of results has also confirmed significant accuracy in identifying and extracting information for the interrogative elements of what and who. Unfortunately, the accuracy differences are not significant for the interrogative elements of where and when. The reasons for the performances differences are possibly caused by the quality of various formats and styles of writing the Malay documents collection used.

CONCLUSION
The paper presents a development of MalayIK-Corpus to identify knowledge in documents. It facilitates to identify and explicitly capture and make available Malay knowledge representation in a knowledge-base system. This leads to potential increase sharable and reusable of the knowledge in documents among the community. However, the MalayIK-Corpus is lacking of ease for navigation in its system interface. It is not fully automated on the creation of the Malay corpus.