Feature Selection for High Dimensional Data: An Evolutionary Filter Approach

: Problem statement: Feature selection is a task of crucial importance for the application of machine learning in various domains. In addition, the recent increase of data dimensionality poses a severe challenge to many existing feature selection approaches with respect to efficiency and effectiveness. As an example, genetic algorithm is an effective search algorithm that lends itself directly to feature selection; however this direct application is hindered by the recent increase of data dimensionality. Therefore adapting genetic algorithm to cope with the high dimensionality of the data becomes increasingly appealing. Approach: In this study, we proposed an adapted version of genetic algorithm that can be applied for feature selection in high dimensional data. The proposed approach is based essentially on a variable length representation scheme and a set of modified and proposed genetic operators. To assess the effectiveness of the proposed approach, we applied it for cues phrase selection and compared its performance with a number of ranking approaches which are always applied for this task. Results and Conclusion: The results provide experimental evidences on the effectiveness of the proposed approach for feature selection in high dimensional data.


INTRODUCTION
Machine Learning (ML) is a rapidly expanding field with many applications in diverse areas such as natural language processing (Marquez, 2000), bioinformatics (Baldi and Brunak, 2001), image processing (Sajn and Kukar, 2010;Lee et al., 2010). It provides tools by which large quantities of data can be automatically analyzed. Fundamental to ML is feature selection, also called dimensionality reduction, which identifies the most salient features, so that the ML algorithm focuses on data aspects most useful for analysis and future prediction. Feature selection algorithm repeatedly selects a subset of original features called candidate subset and measure the optimality of the candidate subset using evaluation function. In doing so, the feature selection approach reduces data dimensionality, removes irrelevant data, increases learning accuracy and improves result comprehensibility (Blum and Langley, 1997;Dash and Liu, 1997;Kohavi and John, 1997).
Technically speaking, feature selection algorithm consists of four basic processes, shown in Fig. 1: subset generation, subset evaluation, stopping criterion and result validation (Liu and Yu, 2005). These steps are performed by three core components, namely, search algorithms, evaluation function and performance analyzer (Dash and Liu, 1997). Subset generation is performed by the search technique (Blum and Langley, 1997) that repeatedly generates candidate feature subset and evaluates it using the evaluation function until a given stopping criterion is met. The selected best subset usually needs to be validated by the performance analyzer, usually by applying the ML algorithm on new instances of data using the selected features. Feature selection is considered successful if the dimensionality of the data is reduced and the performance of the ML improves or remains unaffected. Feature selection has been a fertile field of research and development since the 1970s in statistical pattern recognition (Ben-Bassat, 1982;Siedlecki and Sklansky, 1988) and ML (Blum and Langley, 1997;Kohavi and John, 1997;John et al., 1994). As a result, various feature selection approaches have accumulated over the years. To better underst and the inner instrument of each approach and the commonalities and differences between them, several taxonomies have emerged such as those proposed by Dash and Liu (1997), (Liu and Yu, 2005;Saeys et al., 2007). This study adopts the taxonomy depicted in Fig. 2, which has gained a wide consensus among researchers.
At the top level, the taxonomy splits the feature selection approaches into two categories, namely embedded and disembodied, based on whether the feature selection process is incorporated into the ML process of model construction or performed separately. The embedded feature selection approaches (Lal et al., 2006) search for the optimal subset of features during the model construction and can be viewed as a search in the combined space of features and models. Embedded approaches are thus specific to a given ML algorithm and consequently have the advantage of including the interaction with the constructed model, while at the same time being computationally feasible. Conversely, the disembodied approaches perform the feature selection as a separate process before the application of ML algorithm.
At the second level, the disembodied feature selection approaches are classified as filter, wrapper, or hybrid, based on the type of the evaluation function. Wrapper approaches use the ML algorithm itself to evaluate the goodness of feature subset. The rationale is that the ML algorithm that will ultimately use the feature subset should provide a better estimation of the goodness of the feature subset (Langley, 1994). Among the advantages of wrapper approaches are their selection is aware of the inductive bias of ML and the ability to take into account feature dependencies between the selected features. Common drawbacks of these approaches are the higher risk of over fitting and the expensive cost of the required computation. Filter approaches, on the other h and, assess the goodness of feature subset by using independent measure, which is based on the intrinsic properties of the data, rather than the ML algorithm. In subsequent stage, the selected subset of features is presented as input to the ML algorithm. In this regards, a wide variety of measures have been used as evaluation function for filter approaches. To cite but few, consistency driven measure (Almuallim and Dietterich, 1991), information theoretic measures (Ben-Bassat, 1982), dependency measures (Hall, 2000), consistency measures (Liu and Motoda, 1998) and even using another ML algorithm as a filter (Hall, 1999). Some of the advantages of filter approaches can be summarized as that they are easily scaled to very high dimensional data and they are computationally simple and fast. A common drawback of filter methods is that they ignore the interaction with the constructed model. Hybrid approaches were devised to alleviate the time complexity imposed by the use of wrapper approaches by hybridizing them with filter approaches. The hybrid approaches combine filter and wrapper approaches to achieve best possible performance with a particular ML algorithm (Xing et al., 2001;Das, 2001).
As depicted in Fig. 1, the filter approaches themselves are further partitioned into two groups, namely, ranking approaches and subset search approaches, based on whether they evaluate the goodness of features individually or through feature subsets. Ranking approaches assign weights to features individually based on their in formativeness to the target concepts. A well known example of ranking approaches is Relief (Kira and Rendell, 1992). The main drawback of Kira and Rendell approaches is that they can only capture the relevance of features to the target concepts, but cannot discover dependencies between them. Conversely, subset search approaches, such as FOCUS (Almuallim and Dietterich, 1991), employ search algorithm to search through candidate feature subsets (Dash and Liu, 2003). The search is guided by certain evaluation to capture the goodness of each subset and ultimately an optimal (or near optimal) subset is selected when the search stops (Liu and Motoda, 1998). Unlike the ranking approaches, feature subset approaches evaluate feature as a whole and therefore take into account the dependency between selected features.
In spite of the vast body of feature selection approaches, the incessant increase of data dimensions (number of features) and data size (number of instances) poses sever challenges with respect to their efficiency and effectiveness (Liu and Yu, 2005;Zheng and Zhang, 2007). One of these challenges, which is that the focus of this study, is the so-called curse of dimensionality (Hastie et al., 2001). Classically, the dimensionality is considered low if the number of features is of some tens and high if the number is in the range 100-500 (Moser and Murty, 2000). However, in recent applications such as natural language processing, genome analysis and astronomy, the dimensionality of the data can be thousands and even tens of thousands. Such high dimensional data causes a major problem for the feature selection approaches as most of these approaches have quadratic or higher time complexity about the data dimensionality, which consequently affect their efficiency. In view of the above taxonomy, it is not hard to conclude that the wrapper approaches are impractical for such data, due to the time complexity of using the ML algorithm as the evaluation function and the complexity of searching the huge search space. The embedded approaches, on the other h ands, are specific to some ML algorithms, though, their time complexity is far less than wrapper approaches. It is, therefore, commonly accepted fact that the filter approaches are preferred for feature selection in high dimensional data due to their computational efficiency (Liu and Yu, 2005;Zheng and Zhang, 2007;Duch, 2006). Some examples of researches that use filter approach for feature selection in high dimensional domains are (Biesiada and Duch, 2005;Bins and Draper, 2001;Yu and Liu, 2003;Li et al., 2004;Guo et al., 2008). Within the filter approaches, the subset search approaches are more efficient than the ranking approaches, due to the inability of the ranking approaches to account for the dependencies between the selected features.
Although subset search approaches sound the most suitable, among others, for feature selection in high dimensional data, the scalability of these approaches is affected drastically as the dimensionality of data becomes high. In order for these approaches to cope with the high dimensions of the data, either some simplification assumptions are adopted, or an adapted version of these techniques has to be developed. Genetic Algorithm (GA) is a striking example of subset search approaches that have been applied successfully for feature selection in various contexts (Liu et al., 1995;Ozdemir et al., 2001;Zhang and Hu, 2005;Lanzi, 1997), due to its advantages over many other search approaches when search spaces are highly modal, discontinuous, or highly constrained (Zhu et al., 2006). Despite the striking success of GA, the increase of the data dimensionality poses a challenge to its straightforward application and consequently to its efficiency . Some attempts to address this challenge have been made by introducing a simplification assumption with respect to the number of features that must be selected (Liu and Yu, 2005;Hong and Cho, 2006;Sanchez-Ferrero and Arribas, 2007;Lecocke and Hess, 2007). Definitely, such assumption is not correct as the number of features that must be selected cannot be known a priori.
In this study, instead of assuming that the number of the selected features is known a priori, an adapted version of genetic algorithm for feature selection in high dimensional data is developed. The adapted version exploits the variable length representation scheme, hence called Variable Length Genetic Algorithm (VLGA) and makes use of a set of genetic operators to genetically manipulate the variable length chromosomes.
Genetic algorithm for feature selection: GA is a biologically inspired search algorithm, which is loosely based on molecular genetics and natural selection. The basic principles of GA were stated by Hong and Cho (2006). Since then, GA has been reviewed in a number of works (Goldberg, 1989;Haupt and Haupt, 2004;Mitchell, 1996;Vose, 1999). In the standard GA, the candidate solutions are described as bit strings (referred to as chromosomes) whose interpretation depends on the application. The search for an optimal solution begins with a r random population of initial solutions. The chromosomes of the current population are evaluated relative to a given measure of fitness, with the fit chromosome selected probabilistically as seeds for the next population by means of genetic operations such as random mutation and crossover. In general the standard GA consists of the following components.
Population: It consists of a predefined number of chromosomes, in which each chromosome represents a potential solution of a given problem.
Fitness function: It is the driving force of the evolution in GA. The fitness function returns a numerical value, which is supposed to be proportional to the utility or the ability of the potential solution that chromosome represents.
Selection scheme: It is used to select a chromosome in the population for genetic operations. It is based on the survival-of-the-fittest strategy.
Genetic operators: They are the basis of the GA evolution. They recombine the chromosomes of the current population to produce a new population. Conventionally, three operators are implemented; reproduction, crossover and mutation. In the reproduction operator, a chromosome is r randomly selected from the current generation based on its fitness and then copied without any change into the next generation. The crossover probabilistically selects two chromosomes from the current population based on their fitness values and then recombines them to generate offspring. The mutation operator insures the population against permanent fixation by randomly flipping the bits value of a selected chromosome at a randomly selected position. Stopping criteria to decide when to terminate the run of GA and the control parameters are the probabilities values that control the execution GA. The general procedure of the standard GA is given in Fig. 3.
The seminal work on using GA for feature selection goes back to Siedlecki and Sklansky (1988). Since then, there have been numerous works on using GA for feature selection in various contexts, in wrapper (Liu and Yu, 2005), filter, or hybrid mode. As previously mentioned, in the wrapper mode the ML algorithm is used as the evaluation function, therefore a brief review of the works of GA in the wrapper mode can be carried out based on the employed ML algorithm. K-nearest neighbor was the first ML algorithm employed as GA fitness function in the seminal work of Siedlecki and Sklansky (1988). Kelly and Davis (1991) used GA with k-nearest neighbor to find a vector of weightings of the features to reduce the effects of irrelevant or misleading features. Similarly, GA was combined with k-nearest neighbor to find an optimal feature weighting to optimize a classification task (Punch et al., 1993). This approach has proven especially useful with large data sets, where standard feature selection techniques are computationally expensive. GA with fitness function based on the classification accuracy of k-nearest neighbor and features subset complexity was used to improve the performance of image annotation system (Lu et al., 2008). Li et al. (2001) combined GA and k-nearestneighbor to select feature (genes) that can jointly discriminate between different classes of samples (e.g. normal versus tumor). This approach is capable of selecting a subset of predictive genes from a large noisy data for sample classification.
Artificial neural networks were employed as GA fitness function for feature selection in several works. For example, GA with neural networks was combined for feature selection in pattern classification and knowledge discovery (Yang and Honavar, 1998). It was also used with neural networks for selecting features for defect classification of wood boards (Caballero and Estevez, 1998). Hong and Cho (2006) proposed GA with neural network to select feature subset to get high accuracy for classification. Similarly GA with neural network was proposed for feature selection for the classification of different types of small breast abnormalities (Zhang et al., 2004). Another ML algorithm that was employed as a fitness function of GA is support vector machine. To cite examples, Eads et al. (2002) used GA with support vector machine for feature selection in time series classification. Frohlich (2004) investigated GA with support vector machine and compared them with other existing algorithms for feature selection. Also, Morariu et al. (2006) presented GA with a fitness function based on the support vector machine for feature selection which has proven to be efficient for nonlinearly separable input data. For the classification of hyper-spectral data, GA with support vector machine was proposed by Zhuo et al. (2008). Yu and Cho (2003) proposed a feature selection approach, in which GA was employed to implement a r andomized search and support vector machine was employed as a base learner for keystroke dynamics identity verification.
GA with decision tree, (e.g., ID3, C4.5) was explored in (Vafaie and De Jong, 1995;1992) to find the best feature set to be used by the induction system on difficult texture classification problems. William (2004) designed a generic fitness function for validation of input specification and then used it to develop GA wrapper for feature selection for decision tree inducers. The effectiveness of GA for feature selection in the automatic text summarization task was investigated by Silla et al. (2004) where the decision tree was used as a fitness function.
In the filter mode, as a subset search approach, GA seems more computationally attractive than in the wrapper mode. This is because the computational time of GA tends to be high and the run of the ML algorithm is needed every time a chromosome in GA population is evaluated. Therefore combining it with the ML algorithm in a wrapper mode is not so efficient. Some examples of using GA in the filter mode include (Liu et al.,1995), in which mutual information measurement between classes and features were used as evaluation function. Based on the experimental results of h and written digit recognition, this method reduces the number of features needed in the recognition process without impairing the performance of the classifier significantly. Ozdemir et al. (2001) used GA for feature selection by minimizing a cost function derived from the correlation matrix between the features and the activity of interest that is being modeled. In this work, from a dataset with 160 features, GA selected a feature subset (40 features) which built a better predictive model than with full feature set. Another example is the work of Zhang and Hu (2005), in which GA was used with Mutual Information (MI) to evolve a near optimal input feature subset for neural networks. A fast filter GA approach for feature selection which improves previous results presented in the literature of feature selection was described by Lanzi (1997).
As a hybrid approach for feature selection (Shahla et al., 2009), GA was investigated by Cantu-Paz (2004), in which GA and a method based on class separability applied to the selection of feature subsets for classification problems. This approach is able to find compact feature subsets that give the most accurate results, while beating the execution time of some wrappers. A feature selection approach named Relief F-GA-Wrapper was proposed by Zhang et al. (2003) to combine the advantages of filter and wrapper. In this approach, the original features are evaluated by the ReliefF filter approach and the resulting estimation is embedded into the GA to search optimal feature subset with the train accuracy of ML algorithm for h and written Chinese characters dataset. Additionally, Fatourechi et al. (2007), proposed two stages feature selection. The first stage employs mutual information to filter out the least discriminate features, resulting in a reduced feature space. Then a GA is applied to the reduced feature space to further reduce its dimensionality and select the best set of features.
In addition to the aforementioned applications, GA continues to attract researchers to combine it with others techniques to improve the efficiency of feature selection in various ways (Shahla et al., 2009;Yang et al., 2011). Gheyas and Smith (2010) have improved a version of GA that tackles feature selection problem.
An important aspect of the previous applications of GA for feature selection is that the standard GA with fixed length binary representation scheme to represent each chromosome of the population as a feature subset. For an n-dimension feature space, each chromosome is encoded by an n-bit binary string b1 ... bn. b i = 1, if the ith feature is present in the feature subset represented by the chromosome and b i = 0 otherwise. Figure 4 is a hypothetical chromosome represented using the fixed length binary representation scheme of the standard conventional GA.
The advantage of this representation is that the standard GA can be used straightforward without any modification. Unfortunately, the fixed length binary representation is appropriate if the dimension of the data is not high. As the dimension of the data becomes huge, the chromosome becomes very long and the evolution of GA becomes inefficient (Arbor et al., 2006).The case even worsens when only small number of these features is needed. There have been several attempts to tackle this problem and apply GA for feature selection in high dimensional data (Liu and Yu, 2005;Sanchez-Ferrero and Arribas, 2007;Lecocke and Hess, 2007;Arbor et al., 2006;Silla et al., 2004). These attempts are based on a simplification assumption of pre-specifying the number of the features that must be selected.
Accordingly, the chromosome encodes the indices of the selected features, rather than the presence or absence of each feature. Figure 5 depicts the chromosome representation adopted by these works.
Although the above representation facilitates the application of the standard GA, the assumption of having pre-specified number of features is not correct and need a prior knowledge about the domain to estimate the number of features that must be selected. Alternatively, this study presents an efficient solution to the problem that exploits the variable length representation scheme of GA and encodes each chromosome as the selected subset of feature. However, using variable length representation calls for modifying the genetic operators or devising new operators to cope with the new representation scheme. In the following sections, the elements of the alternative VLGA developed for feature selection are described in details.

MATERIALS AND METHODS
VLGA for feature selection in high dimensional data: we describe the proposed approach for feature selection in high dimensional data. The proposed approach is essentially a variable length GA developed specifically for this task. Before diving into the details of the proposed approach, it is worth mentioning that the idea of using variable length representation in the context of evolutionary algorithms is as old as the algorithms themselves. Fogel and Walsh (1966) seem to be among the first experimenting with the variable length representation. In their work, they evolved finite state machines of a varying number of states, therefore making use of operators like addition and deletion. Holland and Holland (1975) proposed the concepts of gene duplication and gene deletion in order to raise the computational power of evolutionary algorithms. Smith departed from the early fixed-length character strings by introducing variable length strings, including strings whose elements were if-then rules (rather than single characters) (Smith, 1980). Since the first attempts of using variable length representations, many researchers made use of the idea under different motivations such as engineering applications (Davidor, 1991a;1991b) or raising the computational power of evolutionary algorithms (Schtz, 1997).
With regard to GA as an evolutionary algorithm, the use of variable length representation has been proposed in several versions. Well known versions are messy GA (Goldberg et al., 1990), genetic programming (Koza, 1992) and species adaptation GA (Harvey, 1995). These versions differ in the specification of the representation scheme and consequently the genetic operators. Messy GA uses binary representation in which each gene is represented by a pair of numbers that are the gene position and the gene value. Messy GA uses the mutation operator as with standard GA. Instead of crossover, messy GA uses the splice and cut operators. The splice operator joins one chromosome to the end of the other. The cut operator splits one chromosome into two smaller chromosomes. Genetic programming is an extension of GA with variable length representation scheme in the form of hierarchical tree representing computer program. The aim of genetic programming is to find the best tree (computer program) that solves a given problem. It adapts genetic operators of the standard GA to cope with the tree representation scheme. Species adaptation GA uses a variable length binary representation scheme. It differs from the standard GAs subtly but significant. Evolution is directed by selection exploiting differences in fitness causes by variations in the genetic makeup of the population. While mutation operator in the standard GA and genetic programming is considered as a background operator and crossover is usually assumed to be the primary operator, in species adaptation GA the reverse is true. Of these two genetic operators, mutation is primary and crossover, though useful, is secondary. Besides that, researchers may opt to develop a domain-specific version of variable length representation GA to better meet the requirements of the domain, rather than using existing ones. For example, Zebulum et al. (2000) investigated the application of GA in the field of evolutionary electronics, in which a special variable length GA was proposed to cope with the main issues of variable length evolutionary systems. Following this trend, in this study, a special version of GA for feature selection in domains with huge dimensional data is developed as described below.
Representation scheme: The representation scheme of the proposed VLGA approach is based on variable length non-binary representation scheme, in which each chromosome represents the selected subset of features. It is a direct representation scheme with no encoding or decoding process to map between the genotype and the phenotype levels. Figure 6 is a hypothetical chromosome represented using this scheme. An interesting aspect of this representation scheme is that it is positional independent meaning that the gene position has no role in determining the aspects of the chromosome at the phenotype level.
Feature space mask: Technically speaking, GA explores the promising points in the search space via genetic operations, therefore, the representation scheme and the genetic operators should give rise to an effective exploration of the search space. Using the proposed representation scheme directly does not assist the genetic operators to explore new points in the search space. Therefore, to ensure a good exploration of the feature space, we propose a feature space mask. The feature space mask is a binary string with length equal to the size of feature space, in which each bit marks the status of a single feature in the feature space. Accordingly, the value 1 indicates that the feature is being used by the current population and the value 0 indicates that the feature is not in use. Figure 7 shows feature space mask schematically. It shows that the feature f 2 , f 3 and f n are participating in the current GA population, whereas the feature f1, f n-2 , f n-1 are not. As it will be described, the status of the feature in the feature space mask is updated either immediately after performing a genetic operator or through a rebuilding step of the phrase space mask which takes place during the transition from generation t to generation t + 1.

Fitness function:
The fitness function is the driving force for the evolution in GA. For feature selection it is the evaluation function that evaluates a candidate subset of features. As the aim of feature selection is to find a minimum number of features with a maximum in formativeness, the fitness function of a subset p consists of a combination of two measures, namely subset in formativeness and subset complexity as follows: In this formula, Info (p) denotes the estimated in formativeness of the features subset p and L (p) is a measure for the complexity of the feature subset usually the number of utilized features. Furthermore, N is the feature space cardinality and pf is a punishment factor to weigh the multiple objectives of the fitness function. The number of features used by a subset is intended to lead the algorithm to regions of small complexity.

VLGA selection scheme:
The selection scheme of the proposed VLGA approach for feature selection is (k, q) tournament selection. It randomly chooses k chromosomes from the current population and with certain probability q returns the best chromosome, otherwise return the worst chromosome.

VLGA genetic operators:
The proposed VLGA approach makes use of three genetic operators modified from the standard GA to cope with the variable length representation scheme. Furthermore, it introduces a new operator called AlterLength.

Reproduction:
The reproduction operator of the proposed VLGA is similar to the reproduction operator of standard GA. With the reproduction probability Pr, a chromosome is randomly selected from the current generation and then copied into the new generation without any modification.
Crossover: To cope with the variable length representation of the proposed VLGA approach, the uniform crossover of the standard GA has been adapted. The uniform crossover (Mitchell, 1996) is an operator that decides with a probability which parent will contribute to each of the gene values in the offspring chromosomes.
For the proposed VLGA, the uniform crossover has been modified as follows. First two parents (chromosomes) from the current population are selected. Then with a probability 0.5 the length of the offspring is chosen to be either the length of the short or long parent. If the length of the short parent is chosen, then a uniform crossover is performed between the short parent and an equal length segment from the long parent. If the length of the parent is chosen, then a uniform crossover is performed between the short parent and an equal length segment of the longer parent. The remaining parts of the long parent are appended to the beginning and the end of the offspring. Figure 8 shows VLGA uniform crossover schematically.

Mutation:
The proposed approach for mutation is to replace the values of some genes by new values from the feature space which are not participating in the current population. The mutation operator is applied with probability P m to each chromosome generated from the crossover operation. This operator is performed with assistance of the feature space mask. More specifically, for each gene in the chromosome, if it is selected for mutation then it is replaced by a r randomly selected feature from the feature space which has its status marked inactive and then the status of the selected feature in the features space mask is set to active immediately. With regard to the mutated feature (gene), its status in the feature space is not set to inactive immediately because this feature is still in use by the parents (members of the current GA population). Setting the status of the mutated gene to inactive is performed after all genetic operations on the current population are completed and a rebuilding step for the feature space mask is performed. Figure 9 shows mutation operator schematically. Alter length: The crossover and mutation operators are designed specifically to introduce variation to the content of the chromosome. To introduce a variation to the length of the chromosome, the alter length operator is proposed. The alter length operator r randomly exp ands (shrinks) the chromosome by inserting (deleting) a single feature to (from) the chromosome. In case of insertion, the inserted feature is r anomaly selected from inactive features in the feature space. In case of deletion the selected feature is deleted from the chromosome and its status in the feature space mask remains active until the rebuilding step of the feature space mask. The AlertLength operator is performed with a probability Pal. as shown in Fig. 10.

Stopping criterion and control parameters:
The proposed VLGA approach makes use of the standard stopping criteria used in the conventional GA which are either to stop after a pre-defined number of generations or to stop when the evolution does not introduce any significant evolution. Regarding the control parameters, VLGA uses the following parameters: Population Size (PopSize), tournament selection parameters (q, k), reproduction probability (Pr), crossover probability (Pc), mutation probability (Pm) and alter length probability (Pal).
Case study: VLGA for lexical cue selection: To evaluate the proposed VLGA approach for feature selection in huge dimensional data, it has been applied for feature selection in the context of designing dialogue act recognition (DAR) model. To underst and the context of the VLGA application, we start with a brief description of the lexical cue selection in the context of DAR.
Lexical cue selection for DAR: Dialogue Act (DA) is defined as a concise abstraction of a speaker's intention in his utterance. It has roots in several language theories of meaning, particularly speech act theory (Austin, 1962), which interprets any utterance as a kind of action, called speech act and categorizes speech acts into speech acts categories (Searle, 1975). DA, however, extends speech act by taking into account the context of the utterance (Bunt, 1994). Figure 11 is a hypothetical dialogue annotated with DAs.
The automatic recognition of DA, Dialogue Act Recognition (DAR), is a task of crucial importance for the processing of natural language at discourse level. It is defined as follows: Given an utterance with its preceding context, how to determine the DA it realizes. Formally, it is a classification task in which the goal is to assign a suitable DA to the given utterance. Due to its importance for various applications such as dialogue systems, machine translation, speech recognition and meeting summarization, it has received a considerable amount of attention (Jurafsky, 2004).
Recently, Machine Learning (ML) techniques have become the current trend for tackling the DAR problem (Fishel, 2007). In this regard, various ML techniques have been investigated and the resulting models have become known as cue-based models (Sridhar et al., 2009;Keizer and Akker, 2007). ML technique builds a cue-based model of DAR by learning from utterances of a dialogue corpus the association rules between surface linguistic features of utterances and the set of DAs. In doing so, ML exploits various types of linguistic features such as cue phrases, syntactic features, prosodic features. Among different types of linguistic features, cue phrases are the strongest (Jurafsky et al., 1998). They are defined by Hirschberg and Litman (1993) as linguistic expressions that function as explicit indicators of the structure of a discourse. Since not all phrases are relevant to the DAR, prior to applying a ML technique the selection of relevant cue phrases is of crucial importance. A successful selection of cue phrases would speed up the learning process, reduce the required training data and improve the classification accuracy (Blum and Langley, 1997).
One cue-based model, which has been used as a context of the current research, is Dynamic Bayesian Network (DBN) model (Yahya et al., 2006;. As depicted in Fig. 12, the DBN model of DAR consists of T time slices, in which each slice is a Bayesian Network (BN) composed of a number of r random variables. The DBN models a sequence of utterances over time in such a way that each BN corresponds to a single utterance. In this sense DBN is time invariant, meaning that the structure and parameters of BN is the same for all time slices. Moreover, in each BN, there is a hidden random variable which represents the DA that need to be recognized and a set of observation variables extracted from the linguistic features of the corresponding utterance. In this model, dynamic Bayesian ML algorithms have been employed to construct the DBN model from a dialogue corpus.
An essential issue aroused while building the DBN model or DAR is the specification of the observation variables. For this model, it has been suggested that the number of the random variables in each BN should be equal to the number of DAs that the model recognizes. Moreover, each variable is defined as a logical rule, Disjunctive Normal Form (DNF), consists of a set of cue phrases which are informative to one and only one DA and expressed as follows: where, DI is the target DI and pi is a cue phrase selected for that DA. In doing so, each variable works as a binary classifier for the given DA: Where: TP = The number of time the selected phrases give true when the utterance belongs to the target DA TN = The number of times the selected phrases gives false when the utterance does not belong to the target DA NI = The total number of utterances in the dialogue corpus.
As the literature indicates that only the ranking approaches have been investigated (Samuel et al., 1999;Webb et al., 2205;Lesch, 2005;Kats, 2006;Verbree et al., 2006) due to their computational efficiency, regardless of their inefficiency with respect to the relevance and redundancy of the selection. In addition to examining the proposed VLGA on the lexical cue selection, a number of ranking approaches which are always applied for cue phrase selection have been experimented. The overall procedure of these approaches is to score each potential feature according to a particular metric and then pick out the best n features. Table 1 contains a list of the ranking approaches that have been selected as a baseline approaches. In their formulas f denotes the feature and c denotes the class which represent a phrase and a DA respectively.

Settings of the experiments:
To experiment the proposed VLGA and the baseline approaches on lexical cues selection in the abovementioned context, SCHISMA dialogue corpus (Andernach et al., 1995), a collection of 64 dialogues in the domain of information exchange and transaction in a theater, has been annotated with DAs from DAMSL coding scheme (Allen and Core, 1997). First each utterance is subdivided into one or more segments and the dialogue acts are assigned to the segments. In the current study, we focus on the following Das: • After annotating the corpus with DAs, the following processes were performed to generate the phrases space • Tokenization: Tokenization occurs at utterance level and the token is defined as a sequence of letters or digits separated by separator (e.g. ,"." , ":" , ";"). In this process, all punctuations are discarded except "?" which is treated as token • Removing morphological variations: It has been noticed that most of morphological variations in SCHISMA corpus are the plurals and tenses variations which are not significant for the recognition process • Semantic clustering: Clusters certain words into semantic classes based on their semantic relatedness and then replace each occurrence of the words with the cluster name. For SCHISMA corpus, the following semantic clusters were identified • Show Name: Any show name appears in the corpus • Player Name: Any player name appears in the corpus

N-gram phrases generation:
In this process all phrases that consist of one, two and three words were generated from each utterance in the corpus.

Removing less frequent phrase:
To reduce the dimension of the phrases space, the phrases occur less than a frequency threshold number were removed. Based on the experiments of Webb et al. (2005), the frequency threshold was 3. The above preprocessing steps resulted in a phrases space of 1336 phrase. This phrases space was used in the experiments of the baseline approach. However, in the subsequent experiments further preprocessing steps were introduced which make the phrases space for each DA has different size.

RESULTS
In the following, the results obtained from four experimental cases are presented.
Baseline approaches: Each of the ranking approaches listed in Table 2, was experimented on the selection of cue phrases for each DA. More specifically, for each DA, each ranking approach ranked the phrases using its own metric. Then, the fitness value, F(p), along with in formativeness value, Info(p) and complexity value, L(p)/N of each k phrases (k = 1, 2, ... n) in the ranked list were calculated and the top k phrases that maximize the fitness value, F(p), is the selected set of phrases for that DA.
VLGA Approach: Case 1: The aim of this case of experiments is to evaluate the proposed VLGA approach on the selection of cues phrases for each DA given in Table 6. The settings of the control parameters are as follows PopSize = 500, q = 10, k = 0.7, r = 0.3, Pc = 0.7, Pr = 0.1, Pal = 0.2, Pm = 0.1 and the stopping criterion is to stop if there is no significant improvement within 10 generations. Table 5 summarizes the results obtained from this case of experiments. It should be mentioned that the selection of the parameter's values was in light of (Mitchell, 1996) and some sensitivity experiments for some parameters such as PopSize, Pc, Pr, Pm, Pal have been performed and the best values found have been selected. Additionally, the run of VLGA was repeated five times and the results of the best run were reported. Figure 13-14 are example of the GA evolution during the selection of the cue phrases for statement DA. The curves correspond to the best evolutionary trends. In general, it can be noticed that there is a rapid growth at the early generations followed by a long period of slow evolution until meeting the stopping criterion. This reflects the nature of the search space of cue phrases which is hugely multimodal and contains a lot of peaks. An interesting aspect of the average population fitness curve is that despite the fluctuations, an overall look at the curve shows a general tendency to improving the average fitness value, particularly at the early generations. The speaker makes a claim about the world 817 Query-if The speaker asks the hearer whether something 108 is the case or not Query-ref The speaker asks the hearer for information in 598 the form of references that satisfy some specification given by the speaker Positive-answer The speaker answer in positive 561 Negative-answer The speaker answer in negative 72

No-Blf
The utterance is not tagged with any blf DA 968  An efficient way to exploit negative phrases, which was described by Zheng et al. (2007) is to select positive and negative phrases independently based on their use. For cue phrase selection in DAR, the positive phrases can be used to indicate the membership of an instance to the target DA and the negative phrases can be used to help in increasing the relevancy of the positive phrases by confidently rejecting instances which do not belong to the target DA but still contain the positive phrases. For example, in SCHISMA dialogue corpus, the positive phrase "ticket" is relevant for both the statement and query-ref DAs. To increase the relevancy of this cue phrase for the statement DA, negative cue phrase such as "how much" and "?", which are relevant to the query-ref DA, yet not to the statement DA, might be selected and conjuncted with the "ticket" to accept only the utterances that contain "ticket" and does not contain "how much" and "?". In general the aim here is to select positive and negative cue phrases that meet the following expression where ppj is a positive phrase and npj ... npk j are negative phrases associated with it: To account for the negative cue phrases, each phrase occur within the utterance that belongs to the target DA is marked positive and each phrase occur within the utterance that does not belong to the target DA is marked negative. It could happen that some phrases occur in utterances that are labeled with the target DA and in utterances not labeled with the target DA, hence it might be possible to find tow identical phrases marked with negative and positive. Table 5 summarizes the results obtained from this stage of experiment.
The information about phrase's position within an utterance is useful to increase its relevancy for a given DA. We conducted this stage of experiment to investigate the ability of the genetic-based approach to select cue phrases after incorporation of the phrase's positional information. To do that, each positive phrase in the phrase space was marked with one of three possible positional labels, which represent the position of the phrase within the utterance. These labels are: Begin, if the phrase occurs at the beginning of any utterance labeled with the target DA, End, if the phrase occurs at the end of any utterance labeled with the target DA and Contain, if the phrase occurs elsewhere. It might happen that certain phrase occurs in different positions within the utterance. In this case, multiple instances of this phrase, each with different position label, are created. Consequently, each DA has a different size search space as shown in Table 5. The genetic-based approach was applied with the same parameters specified in the previous stages and the results are shown in Table 6.
Validation experiments: The aim of this stage of experiments is to validate the use of the proposed genetic-based approach for the ML algorithm application. More specifically, the cues phrases generated from the above experiments were used to build DBN model for DAR. The hypothesis is that the more relevant cue phrases, the more accurate DAR. First the sets of cue phrases generated by the geneticbased approach in each of the previous stages were used to specify the DBN random variables as described in earlier, so that each random variable is a binary classifier for a single DA. Then DBNs ML algorithms were used with 10-fold cross-validation to construct the structure of the DBN model, assess its parameter and estimate its recognition accuracy using probabilistic networks library (Intel, 2004) which is freely available from http://www.intel.com/research/mrl/pnl . The same experiment was repeated using the sets of cue phrases generated by MI and the results of these experiments are summarized in Table 6.

DISCUSSION
In the following, the results obtained from four experimental cases are discussed.

Baseline approaches:
The results of the baseline approach experiments are given in Table 3. From these results, it can be observed that there is a similarity between the performance of MI and OR from one side and the performance of IG and χ 2 from the other side in three aspects. First, from the complexity values, L(P)/N, it is clear that MI and OR tends to select larger number of phrases than IG and χ 2 . Second, the In f o(P) values of MI and OR are higher than IG and χ 2 . Third, as a direct result of the similarity in Info(P) and L(P)/N values within each group ,(MI, OR) and (IG, χ 2 ), the pattern of the fitness values is similar within each group, though, between the two groups the comparison of fitness values are not conclusive.
The similarity between the two groups, (MI, OR) and (IG, χ 2 ), can be understood through the following facts. For each DA, each phrase has two sides, positive and negative. The positive side depends on the presence of the phrase in the utterances labeled with the target DA and the absence of the phrase from the utterances labeled with other DAs.
The negative side depends on the absence of the phrase from the utterances labeled with the target DA and the presence of the phrase in the utterances labeled with other DAs. Based on that, the ranking approach is classified as either one-sided metric or two-sided metric depending on whether it's metric account for the negative side of the phrase or not (Bunt, 1994) are phrases with the highest positive sides and, definitely, the lowest negative sides. With regard to the two-sided metrics, they rank the phrases according to a combination of both positive and negative sides. Therefore the top k phrases in the list are phrases with the highest negative or positive sides.
From Table 2, it is clear that MI and OR are onesided metrics and IG and χ 2 are two-sided metrics. It is also obvious that the fitness measure, Eq. 2, which was used for the selection of cue phrases from the ranked list, has its Info (P) subpart depends on the positive side of the phrases rather than negative side. Therefore, the ranking of the one-sided metrics is more appropriate for the fitness measure than the two-sided metrics which can interpret the higher Info(P) values of the cue phrases selected by MI and OR. However, the inability of these approaches to account for the correlation between cue phrases lead to the selection of large number of cues phrases in case of MI and OR. In other words, the ranking approaches assume that the relevance of a set of phrases is equal to the summation of the individual relevance of each phrase which leads to redundant selection.
The general conclusion that can be drawn from this stage of experiments is that the ranking approaches are not able to maintain a tradeoff between the two subparts of the fitness functions. They tend to optimize one subpart at the expense of the other.
VLGA approach: Case 1: The results of this case of experiments are empirical evidences on the efficiency of the VLGA approach for the selection of useful features from huge data. More detailed a comparative look at the result of Table 3-5 shows that the VLGA approach outperforms the ranking approaches for cue phrase selection. This is obvious from the differences between the fitness values, F (P), for VLGA and ranking approaches. The informativeness values, Info(P), of the VLGA approach are higher than their corresponding values of the ranking approaches. With regards to the complexity of the selected cue phrases, L (p)/N, it is obvious that the VLGA approach tends to select smaller number of phrases than MI and OR ranking approaches, yet more than IG and x 2 to confirm that, a paired t-test of the statistical significance of the difference between the F(p) values for both MI ranking approach and VLGA approach was performed at level P < 0.05 and 5 degree of freedom. The obtained t value (t = 3.1123) shows that the difference is statistically significant.
The above findings are direct results of the ability of the VLGA approach to maintain a trade off in formativeness for complexity which can be attributed to two factors. First, in the VLGA approach, the evaluation and the selection processes are based on the fitness measure which depends on the subsequent use of the selected phrases. In contrast to that, in the ranking approach the evaluation of the phrases is based on the ranking approach metric which evaluate the phrase based on the intrinsic properties of the phrases whereas the selection depends on the fitness measure. The second factor is the ability of the VLGA approach to account for the correlation between the selected cue phrases. Unlike the ranking approaches, the VLGA approach evaluates the selected cue phrases as whole rather than evaluating each phrase individually and then assuming the relevancy of the set is equal to the summation of the individual relevancy of each phrase which leads to redundant selection. VLGA approach: Case 2: It appears from the results in Table 6 that there is an improvement in the fitness values, F(P), of the selected cues for each DA which can be attributed to the improvement of the corresponding Relev(P) values due to using the cue's positional information. In terms of complexity, L(P)/N, there is a slight decrease in its values for some DAs, however there are cases, where the L(P)/N values are similar or even better, for instance in positive answer DA, there is an obvious improvement in both values, Relev(P) and L(P)/N. This is empirical evidence on the ability of the genetic-based approach and on the role of the positional information.
The analysis of the statistical significance of the difference between the fitness values, F(P), of Table 6 and Table 5 using paired t-test at level p < 0.05 and 5 degree of freedom confirm this conclusion. The obtained t value (t = 4.0410) shows that the difference is very statistically significant.
Validation experiments: It is clear that the difference between the performances of the genetic base approach and MI in cue phrases selection affect the accuracy of the DBNs of DAR on the basis that the better cues selections approach, the higher recognition accuracy. To underst and the influence of the cues phrase selection approaches on the recognition accuracy, it should be borne into mind that the construction of the DBNs models of DAR is based on the binary representation of the datasets which are resulted from the extraction of the random variables from the utterances. In this representation, utterances that belong to a certain DA should have a distinct pattern, which is composed in the ideal case of n -1 bit with 0s values and a single bit with 1 value (n is the number of random variables) that corresponds to the random variables of this DA. It is obvious also that the quality of the representation depends on the relevancy of the selected cues phrases that form the random variables. In other words, the better cues selection approach, the better data representation and consequently the better constructed DBNs models.

CONCLUSION
In this study, an adapted GA approach for feature selection in huge dimensional data is introduced. The proposed approach is a variable length GA with specialized genetic operator developed specifically for this task. Several stages of experiment were conducted and the obtained results suggest a number of important conclusions. Firstly, the results confirm that the ranking approaches are not the optimal approaches for cues phrases selection in DAR and similar high dimensional domains. The selection in these approaches is independent of the subsequent use and they are not able to account for the correlation between the selected features. Secondly, the results of the proposed geneticbased approach shows the ability of the genetic-based approach to account for the correlation between the selected cues enables them to select a minimal number of relevant phrases. It is apparent from the high reduction of the number of the selected cues. Thirdly, In contrast to the ranking approaches, the proposed genetic-based approach shows its ability to exploit the negative phrases to increase the relevancy of the selected cue phrases. Fourthly, the results confirm that the cue's positional information is useful to improve the relevancy of the selected cue phrases. In general the proposed genetic-based approach has proved its efficiency for the selection of useful cue phrases for DAR. Finally, although the genetic-based approach was applied to cue phrase selection, it can be applied for feature selection in any similar high dimensional domains.