Mining Sports Articles using Cuckoo Search and Tabu Search with SMOTE Preprocessing Technique

Sentiment analysis is one of the most popular domains for natural language text classification, crucial for improving information extraction. However, massive data availability is one of the biggest problems for opinion mining due to accuracy considerations. Selecting high discriminative features from an opinion mining database is still an ongoing research topic. This study presents a two-stage heuristic feature selection method to classify sports articles using Tabu search and Cuckoo search via Lévy flight. Lévy flight is used to prevent the solution from being trapped at local optima. Comparative results on a benchmark dataset prove that our method shows significant improvements in the overall accuracy from 82.6% up to 89.5%.


Introduction
The Internet is a rich source of various points of view and an increasing number of individuals are using the Web as a medium for sharing their opinions and attitudes in text. This includes online product or service reviews, travel advice, social media discussions and blogs, customer recommendations, movie and book reviews and stock market predictions (Zhou et al., 2013). Therefore, this motivates providing tools to automatically extract and analyze public opinion for business marketing or social studies and understand consumers' preferences (Liu, 2010). Sentiment analysis, which involves evaluating sentences as objective or subjective, is challenging to interpret natural language as subjectivity needs more investigation (Pang and Lee, 2008). Moreover, subjectivity analysis depends on the fact that expressions and sentence phrases may express varying intensities depending on the context in which they occur. Furthermore, articles of text need not be entirely classified as subjective or objective. Hence, subjectivity can be expressed in different ways as proposed in (Liu, 2012) and overall, it is considered highly domain-dependent since it is affected by the sentiments of words.
There is a great need to develop an automated solution to differentiate objective and subjective articles (Pang and Lee, 2008). Consequently, many features have been reported to select the best presentation for subjectivity detection, ranging from the lexicon and syntactic features to semantic features, including phrase pattern, N-grams, character and word-level lexical features phrase-level sentiment scores (Liu, 2012). As a result, the large scale of such feature datasets is a significant challenge.
Feature selection aims to significantly minimize the computational overhead and consequently enhance the overall classification performance through eliminating irrelevant and insignificant features from datasets before model implementation (Pang et al., 2002;Turney, 2002), which is an essential requirement in text-based sentiment-analysis problems.
The Tabu search and Cuckoo search algorithms have gained significant attention from researchers (Chen et al., 2021;Srivastava et al., 2012). The proposed work's motivation is to design a two-stage bio-inspired hybrid algorithm based on Tabu search and Cuckoo search via Lévy flight to select the optimal features in a sports-related text dataset for subjectivity analysis. This study combines the Tabu algorithm's ability to converge to a solution and the Cuckoo mechanism of backtracking from local optima by Lévy flight (Glover, 1989). Cuckoo search has been widely used in adaptive search strategies for constructing computational models (Yang and Deb, 2009). One of the most desirable features of the algorithm is computationally efficient and easy to implement with a smaller number of parameters (Yang and Deb, 2009). Tabu Search (TS) is added to reduce the number of iterations and execution time of the algorithm, thus reducing the overall complexity (Chen et al., 2021). Among machine learning algorithms, the Random Forest method (RandF) has received increased attention within several classification problems (Glover, 1989;Yang and Deb, 2009). RandF is an ensemble machine learning technique that was developed by (Breiman, 2001). This classifier has been well utilized in many classification problems but is relatively uncommon in sentiment analysis.
The main objective is to improve the prediction accuracy with a resampling scheme to overcome the dataset imbalance issue as there is a tradeoff between the accuracy and size of the generated feature subsets. In addition to the methods mentioned above, techniques in this study, MLP, SimpleLogistic, k-NN, RandF and C4.5 classifiers will be used to compare the performance of our proposed feature selection technique in terms of precision, ROC and Cohen's kappa coefficient using the dataset used by (Hajj et al., 2019).
The main contributions of this article are as follows:  Consider the sentiment analysis problem to having two stages  Apply Tabu search and Cuckoo search via Lévy flight to feature selection  Apply SMOTE technique to balance the training data in the classification stage  Apply several classification models in the classification stage The remainder of this study is organized in the following manner. Section 2 explains the theoretical approach of feature selection methods and the proposed technique. The evaluation procedure, the dataset and the experimental results are presented in section 3. Finally, the conclusions are summarized in section 4.

Methodology
One of the motivation goals in research to improve classification performance is applying hybrid-learning approaches instead of individual ones. We first select features by the Cuckoo search algorithm in our method and then we apply Tabu search to construct a new feature subset.

Tabu Search Technique
Tabu search is an iterative memory-based algorithm that Glover proposed in 1986 to solve combinatorial optimization problems (Glover, 1989;1990). Since then, Tabu search has been successfully applied in several multiclass classification problems (Hajj et al., 2019). It comprises of local search mechanism combined with Tabu mechanism.
Tabu search starts with an initial solution X' among neighborhood solutions, where  is the set of feasible solutions. Then the algorithm searches and evaluate all the possible neighbor's solution N(X) to obtain a new one with an improved functional value. A solution candidate X' N(X) can be reached from X if the new solution X' is not registered in the Tabu list, or it satisfies the aspiration criterion (Tahir et al., 2004b). If the candidate solution X' is better than Xbest, the value of best is overridden; otherwise Tabu search will go uphill to avoid local minima.
Tabu search avoids cycling by limiting visiting previously visited solutions for a certain number of iterations. This undoubtedly improves the performance of the local search. Then, the neighborhood search resumes based on the new solution X' until the stopping criterion is met (Korycinski et al., 2003;Sait and Youssef, 1999).

Cuckoo Search Algorithm
The Cuckoo search algorithm is basically derived from the strange reproductive behavior of particular cuckoo species. These species choose to put eggs in randomly chosen nests of other host birds but have similar matching patterns of the hosts' own eggs to reduce their ability to discover them (Yang and Deb, 2014). The cuckoos rely on these host birds to accommodate their eggs. Sometimes, when the host birds recognize unfamiliar eggs, it usually rejects it or abandons their nests.
According to the cuckoo algorithm, each egg in the nest represents a possible solution and the foreign cuckoo egg represents a new solution. The goal is to employ potentially better solutions (cuckoos) to replace the nests' solution (Civicioglu and Besdok, 2013). Cuckoo search algorithm generates a new candidate solution (nest)   1 r i x  for a cuckoo n (Kaveh and Bakhshpoori, 2013;Rodrigues et al., 2013): where, s is the step size and a >0 is the step size scaling, which is related to the problem of interest. In most cases,  is set to 1. The symbol  is an entry-wise multiplication that is similar to those used in the PSO algorithm. The Cuckoo search is based on Lévy flights to avoid a local optimum (Korycinski et al., 2003). The concept of Lévy flights explores the solution space (s) by providing a random walk with random steps drawn from a Lévy distribution for large steps (Tahir et al., 2004a), given by:

Synthetic Minority Over-Sampling (SMOTE) Technique
Through the classification process, especially in imbalanced datasets, there is a great challenge where classifiers tend to ignore minority classes. SMOTE technique was suggested by Chawla et al. in 2002 as a solution is to under-sample the majority class towards improving classification sensitivity in the minority class instances. The SMOTE technique adopts an oversampling approach that operates at the feature level to balance the number of instances (Chawla et al., 2002).
The SMOTE technique works by resampling minority class through randomly interpolating new synthetic instances between the minority class and its nearest neighbours based on over-sampling rate, β% and the number of the nearest minority class data samples neighbours (K). Depending on the required rate β%, the SMOTE randomly adds new instances until the dataset is balanced. For illustration, for a minority data sample (xo) if the required balance rate β% is 200% and samples neighbours K is 3, one of every three nearest neighbors is randomly repeated two times. A line is created using a random Kth neighbor linking xo to this neighbor and then, a random point on the line is selected to create one synthetic instance. Thus, any new synthetic instance xs is created by: where, xs denotes a new synthetic instance,   t o x is the tth selected nearest neighbor of xo in the minority class and δ is a random number (δ[0,1]). Breiman (1996) proposed a new and promising treebased ensemble classifier called RandF, which is based on a combination tree of predictors. RandF consists of a combination of individual base classifiers where each tree is generated using a random vector sampled independently from the classification input vector to enable a much faster construction of trees. For classification, all trees' classification votes are combined using a rule-based approach or based on an iterative error minimization technique by reducing the weights for the correctly classified samples.

Random Forest Classifier (RandF)
The building of an ensemble of classifiers in RandF can be summarized as follows (Breiman, 1996):  The RandF training algorithm starts with constructing multiple trees:  In this study, we use the random trees in building the RandF classifier with no pruning, which makes it light from a computational perspective  The next step is preparing the training set for each tree, which is formed by randomly sampling the training dataset using a bootstrapping technique with replacement:  This step is called the bagging step (Breiman, 2001). The selected samples are called in-bag samples and the rest are set aside as out-ofbag samples  For each new training set generated, approximately one-third of the in-bag set data are duplicated (sampling with replacement) and used for building the tree. The remaining training samples (out-of-bag samples) are used to test the tree classification performance. Figure 1 illustrates the data sampling procedure. Each tree is constructed using a different bootstrap sample  RandF increases the trees' diversity by choosing and using a random number of features (four features in this study) to construct the nodes and leaves of a random tree classifier. According to (Breiman, 2001), this step minimizes the correlation among the features, decreases the sensitivity to noise in the data and increases the accuracy of classification at the same time  Building a random tree begins at the top of the tree with the in-bag dataset:  The first step involves selecting a feature at the root node and then splitting the training data into subsets for every possible value of the feature. This makes a branch for each possible value of the attribute. Tree design requires choosing a suitable attribute selection measure for splitting and the selection of the root node to maximize dissimilarity between classes  If the information gain is positive, the node is split.
Otherwise, the node becomes a leaf node that provides a decision for the most common target class in the training subset  The partitioning procedure is repeated recursively at each branch node using the subset that reaches the branch and the remaining attributes, which continues until all attributes are selected  The highest information gain of the remaining attributes is selected as the next attribute. Eventually, the most occurring target class in the training subset that reached that node is assigned as the classification decision  The procedure is repeated to build all trees  After building all trees, the out-of-bag dataset is used to test trees and the entire forest. The obtained average misclassification error can be used to adjust the weights of the vote of each tree  In this study, the implementation of RandF gives each tree the same weight

Dataset
In this study, we used a dataset from a previous study (Hajj et al., 2019) for the assessment. The dataset comprises 52 features extracted from a corpus composed of 1,000 articles gathered from 658 sports articles, which were collected from over 50 interactive websites, including NBA.com, Fox Sports and Eurosport UK. The first 48 features are about the corpus's syntactic information, while the last 4 are semantic features.
The feature set is built over the concept of measuring the frequency counts of objective and subjective words in the text. Initially, it starts with summing positive, negative and objective scores and using it to normalize each word's text scores. Then, update the subjective and objective word counters according to a comparison between normalized scores and a threshold. Accordingly, the subjective word counter would be incremented if a word has a positive or negative score more significant than this threshold. Otherwise, the objective word counter would be incremented.
The list of features with their descriptions are shown in Table 1. A detailed explanation of the feature list can be found in a previous study (Rizk and Awad, 2012). Table 2 describes the class distribution over the two classes, which clearly shows that the dataset is imbalanced (65.8% instances are classified as objective statements).

Evaluation Metrics
The proposed model's performance is measured using precision, the Receiver Operator Characteristic (ROC) and Cohen's kappa coefficient. According to the confusion matrix in Table 3, precision is defined as: The area under the ROC curve is a graphical plot for evaluating two-class decision problems. The ROC curve standard metric for analyzing classifier performance over a range of tradeoffs between True Positive (TP) and False Positive (FP) error rates (Dietterich, 2000;Smialowski et al., 2012). ROC usually ranges from 0.5 for an entirely random classifier and 1.0 for the perfect classifier.
Kappa error or Cohen's kappa coefficient is a useful measure to compare different classifiers' performance and the quality of selected features and ranges from 1 to -1 (Ben-David, 2008). When the kappa value approaches 1, there is a better chance for an agreement and When the kappa value approaches -1, it shows a low chance for agreement. The kappa error measure can be calculated using the following formula

P(A) is the total agreement probability and P(E)
is the chance agreement's theoretical probability.

Results
Initially, feature selection is carried out in two main steps. Firstly, we construct a new reductive feature space using the Cuckoo search technique. In the first step, the original feature dimension is decreased from a to b. In the second step, the feature-length is reduced from b to c, the new feature space.
The results of the classifications with and without feature selection are reported. Table 4 reports the precision, ROC and Cohen's kappa coefficient of MLP, SimpleLogistic, k-NN, RandF and C4.5, to demonstrate the suggested feature selection techniques' performance which all use tenfold CV procedure evaluation (Schumacher et al., 1997). Prior to performing feature selection, the SimpleLogistic classifier (82.6%) slightly outperformed the RandF (82.1%), MLP (80.7%) and k-NN (75.6%) classifiers.
In the first phase, we investigated the effect of several feature set construction methods on classification performance. This task was carried out using four natureinspired algorithms or swarm intelligence techniques: The Ant, Bat, PSO and Cuckoo algorithms. The selected features according to these techniques are summarized in Table 5. As noted, the dimensionality of sports article features was remarkably reduced. We reduced the size of the dataset from 52 attributes to only 17 to 23 attributes. For example, PSO helped in reducing the feature size by 67%.
However, RandF achieved a precision rate of 82.6% when the Cuckoo algorithm selected only 34% of the features. The other classifiers, MLP,SimpleLogistic,achieved precision rates of 79.5,82.4,73.3 and 78%, respectively. When comparing classifiers, RandF outperformed the other classifiers by achieving more significant improvement, particularly when it was combined with feature reduction. However, this accuracy is lower than the best accuracy achieved on this database, which indicates the importance of feature reduction for eliminating excessive features for all classifiers. Although the results indicated comparable outcome between Ant and Cuckoo algorithms, we preferred to proceed with Cuckoo search based on Lévy flights in the next stage, as it has quick and efficient convergence, less complexity, easier to implement with a smaller number of parameters compared to PSO, Ant and Bat algorithms (Beheshti and Shamsuddin, 2013;Kamat and Karegowda, 2014). Figure 2 shows the feature selection techniques' agreements. The Venn diagram shows that the three feature-selection approaches share 12 features according to the results generated. The 12 common elements in Ant, Bat and PSO are the frequencies of foreign words, modal auxiliaries, singular common nouns, pre-determiners, comparative adverbs, superlative adverbs, particles, base form verbs, WH-determiners, first-person pronouns, second-person pronouns and words with an objective.
Next, the best-reduced dataset feature is presented with the proposed Tabu search approach, which further optimizes the data dimensions and finds an optimal set of features. At the end of this step, a subset of features is chosen for the next round. The optimal features of the Tabu search technique are shown in Table 7.
The number of features was remarkably reduced, so less storage space is required to execute the classification algorithms. This step helped in reducing the size of the dataset to only 11 to 17 attributes. After applying the Tabu search in the second phase, the RandF classifier outperformed the other classifiers when comparing classifiers. It achieved a precision rate of 83.1%, which validates the features selected by the proposed reduction technique.
The Cuckoo search feature-selection technique enhanced the performance in most cases. Table 8 demonstrates the comparative results of the second phase's classification performance using the Tabu search algorithm to detect the most significant features. RandF classifier achieved the highest precision rate (83.1% with 11 features). The Tabu search helped in reducing the dimension of features and improved the classification performance. Table 8 also shows the final classification results of the proposed technique for mining sports article data. The SMOTE technique was applied to the reduced dataset to increase the samples of the minority class. The training set was resized using SMOTE at an oversampling rate of 200% to balance the number of instances in the two classes. This step contributes to making the dataset more diverse and balanced. The highest precision rate is associated with RandF and the suggested feature selection technique (89.5% with an 80% reduction in features). This method outperforms the classification results when using all the features. The results demonstrate that these features are sufficient to represent the dataset's class information.

Discussion
Cuckoo and Tabu search helped in improving the classification performance with a limited number of features. In terms of precision, ROC and Cohen's kappa coefficient, the proposed technique with SMOTE significantly improved the classification accuracy of the minority class while keeping the classification accuracy of the majority class high. The nine common features according to the results generated using Cuckoo-Tabu search and the three techniques were the frequencies of foreign words, modal auxiliaries, singular common nouns, pre-determiners, comparative adverbs, base form verbs, WH determiners, first-person pronouns and second-person pronouns (Fig. 3). Table 9 shows the effect of classification using the nine common features between the At, Bat, PSO and Cuckoo-Tabu search techniques. The scored results are not better than those in the first phase. The results from the suggested two-stage attribute selection phase show better performance than those of datasets that were not preprocessed and when these attribute selection techniques are used independently. Moreover, the results are better than those on the same dataset with an approach using a modified Cortical Algorithm (CA) (Sait and Youssef, 1999). That approach had an accuracy of 85.6% with a 40% reduction in features.

Conclusion
In this study, we are motivated to study the impact of suggesting a two-stage heuristic feature selection method using Tabu search and Cuckoo search with Lévy flight in proposed classifying sports articles. The experiments showed that applying Tabu search and Cuckoo search techniques helped in remarkably reducing feature numbers. The suggested model enhanced the precision performance and achieved promising results. Furthermore, altering the original data with SMOTE technique helped to increase the region of the minority class, which eventually helped with handling imbalanced data.