Application of Data Mining Classifiers on Sunflower Edible Oil Bleaching Process: A Comprehensive Comparative Analysis

: Sunflower oil is widely used as edible oil. It is commonly extracted by solvent extraction method from the sunflower seed. After extraction, crude sunflower oil is obtained. Crude sunflower oil has some undesirable impurities and dark colors. These impurities and dark colors require removal. The bleaching process is applied to remove the color. The bleaching earth is used in the refining and removes color. The specifications of crude sunflower oil such as impurity, free fatty acid ratio, wax, color index and the temperature of the process, the vacuum of the process, the amount of bleaching earth used affect the bleaching output color value. In this study, machine learning algorithms are used to predict the bleaching output color. In order to predict, Waikato Environment for Knowledge Analysis (WEKA), an open-source Data Mining workbench is run. 15 well-known machine learning classifier algorithms, suitable for our data such as k-nearest neighbors, multilayer perceptron and random forest are performed. Each algorithm is tested on a real dataset by a 10-fold cross-validation method. The correlation coefficient, mean absolute error and root mean squared error is calculated for each algorithm and benchmarked. Results show that Random Forest Classifier is the most effective classifier for our data. Additionally, Wilcoxon Signed-Rank statistical test is conducted whether Random Forest Classifier is the most effective classifier for some k-fold cross validation.


Introduction
Today, rather than the data problem, there is a problem of extracting meaningful information from large volumes of data. Data mining techniques help to transform large volumes of data into meaningful information so that data can be classified, grouped, make past and future predictions or utilize to design effective business strategies for an enterprise (Arora et al., 2020;Sharma, 2020a). Data mining contains the use of complex data analysis tools to detect previously unknown, valid forms and relationships in large data set (Karasozen et al., 2006). These tools can contain mathematical algorithms, statistical models and machine learning methods. One of the machine learning techniques is the classification that is used to forecast group membership for data instances (Kumar et al., 2014). Machine Learning is defined by Stanford University as a science that provides computer to carry out some intelligent activities based on actual data and without being clearly programmed (Sharma et al., 2019). Machine learning algorithms are used in different research domains such as healthcare (Gupta et al., 2013;Vijiyarani and Sudha, 2013;Sharma et al., 2017;Kaur and Sharma, 2019;Meng and Saddeh, 2019;Sharma, 2020b), stock management (Khedr and Yaseen, 2017;Sharma et al., 2018;Zhong and Enke, 2019), software (Gilal et al., 2018;Sharma, 2017;Dias Canedo and Cordeiro Mendes, 2020). Additively, benchmarking of machine learning algorithms are used in many real-life applications in order to recognize handwritten digit (Bottou et al., 1994), to classify clinical samples (Sampson et al., 2011), to predict heart diseases (Austin et al., 2013;Abdar et al., 2015;Pouriyeh et al., 2017;Tougui et al., 2020), to detect software defect (Aleem et al., 2015;Abdou and Darwish, 2018;Alsaeedi and Khan, 2019;Aquil and Ishak, 2020), to classify diabetes mellitus (Maniruzzaman et al., 2017;Rodríguez-Rodríguez et al., 2021), or to predict congenital heart defects (Luo et al., 2017).
Sunflower oil is one of the indispensable sources of vegetable oil in Turkey. It is extracted by solvent extraction method from sunflower seed and after extraction crude oil is obtained. Crude sunflower oil is refined to make it edible. Both physical and chemical refining processes are used in these oils. These processes are degumming, neutralization, dewaxing and winterization, bleaching, deodorization, respectively.
This study considers the bleaching process that is applied in order to remove the color. Crude sunflower oil has some impurities and dark colors. To remove the undesirable dark color the bleaching earth is used. Impurity, free fatty acid ratio, wax, color index and the temperature of the process, the vacuum of the process and the amount of bleaching earth used to affect the bleaching output color value.
The ability to predict the forthcoming changes is of great importance for proper decision-making (Goli et al., 2018). In this study, 15 well-known machine learning classifier algorithms are used to predict the bleaching output color of sunflower oil using the aforamentioned specifications. In addition, these 15 algorithms are compared by calculating a correlation coefficient, mean absolute error and root mean squared error. Finally, the Wilcoxon Signed-Rank statistical test is used to see if the best performing algorithm is the same for some kfold cross-validation.
The contributions of this study can be listed as follows: (i) machine learning techniques are used for a real life application, (ii) machine learning techniques are applied for the first time for bleaching process, (iii) the results obtained are evaluated for a possible decision support system for input parameters in the bleaching process. That is, with the best algorithm obtained, the output color can be predicted against the changes that may occur in the input parameters. This provides the opportunity to change the input parameters in advance to achieve the desired output color.
The rest of paper is organized as follows. In the following part, the framework of the study is presented and each stage (Data description, Data pre-processing, Kfold cross validation, Classifier algorithms) explained step by step. Results are discussed and then a statistical test is conducted. Finally, the study is concluded.

Study Framework
The study framework steps are shown in Fig. 1.

Data Description
In this study, a real data set is used to predict the bleaching output color. Impurity, free fatty acid ratio, wax, color index, the temperature of the process, the vacuum of the process, the amount of bleaching earth are the essential specifications that affect the bleaching output color value. The real 79-day data set obtained from the bleaching process is presented in Table 1.

Data Pre-processing
Pre-processing of bleaching process data is carried out by using min-max normalization. Min-max normalization formulation is given in Eq. (1)

K-fold Cross Validation
K-fold cross-validation, one of the commonly used methods, is conduct to test the proposed algorithm. In kfold cross-validation, the data is divided into k subsets. The method is repeated k times, in each time, one of the k subsets is used as the validation set and the other k-1 subset are separated to create a training set (Kenger and Özceylan, 2020). In this study, each algorithm is conduct on dataset by 10-fold cross-validation method.

Classifier Algorithms
In this study, a real-life application is conducted. Some machine learning algorithms such as K-Nearest Neighbors classifier (KNN), Simple Linear Regression (SLR), Gaussian Processes (GP), KStar classifier (KS), Decision Table Classifier ( K-nearest neighbors classifier algorithm is the most popular classification technique in data mining (Gazalba and Reza, 2017). The algorithm classifies an instance according to a majority choice of its k most similar instances (Aha et al., 1991). It can select the suitable value of K based on cross-validation and also do distance weighting.
Steps of K-nearest neighbors classifier algorithm is given in following (Damarta et al., 2021) The simple linear regression model is used to demonstrate or forecast the relationship between two variables or factors (Gupta, 2015). Equation (2) represents the formulation of the simple linear regression: X and Y are the two factors that are included in a simple linear regression analysis. The regression model defines how y is related to x is known as .  in the equation represents the y intercept of the regression line and β represents the slope. A regression line can demonstrate a negative linear relationship, a positive linear relationship, or no relationship (Duda and Hart, 1973).
Gaussian processes classifier provides a powerful unifying model for approximating and reasoning about datasets. Gaussian processes supply the 'glue' that allows us to conduct active mining on spatial clusters (Ramakrishnan et al., 2005).
KStar is an instance-based classifier and a type of lazy learning. KStar holds a training set and then performs small processing; finally looks for a test set. Then the test set is classified according to the similarity of the stored training set (Sultana et al., 2016).
Decision table classifier chooses the most discerning attributes from the training sample set to form a search table, then used to classify new cases. Dissimilar subsets of attributes are assessed by using a performance estimation method (Sinha and Zhao, 2008).
Decision stumps are one level decision trees (Iba and Lankley, 1992). In classification problems, each node in a decision stump represents a feature in an instance to be classified and each branch represents a value that the node can take (Kotsiantis et al., 2006).
ZeroR classifier is a simple algorithm and classifies the majority class. Although there is no forecasting ability in ZeroR, it is useful for the benchmark for other classification methods (Nasa and Suman, 2012).
Random tree algorithm builds a tree considering K randomly chosen attributes at each node. It does not use pruning method (Nithya and Santhi, 2015).
M5rules works as follows: A tree learner (in this case model trees) is applied to the full training dataset and a pruned tree is learned. Next, the best leaf (according to some heuristic) is made into a rule and the tree is discarded. All instances covered by the rule are removed from the dataset. The process is applied recursively to the remaining instances and terminates when all instances are covered by one or more rules (Holmes et al., 1999).
RepTree generates multiple trees in different iterations by using the regression tree logic. Then, it chooses the best one from all generated trees (Kalmegh, 2015).
The locally weighted learning approach is a sort of training data selection method that is built on the idea of a naive Bayes on the neighborhood of the test instance, instead of on the whole training data (Jiang et al., 2013).
M5 model trees classifier algorithm is modified from the original M5 tree algorithm by (Wang and Witten, 1997).
The M5P tree algorithm processes enumerated attributes and attribute missing values. All enumerated attributes are turned into binary variables before tree construction (Zhan et al., 2011).
The random forest classifier is a machine learning method and classifies data by using decision trees. The basic principle is to construct a multitude of independent trees built from an initial sample. The forest construction uses two random processes. Firstly, every tree of the forest is constructed from a random sample picked with replacement. Then, a decision tree is built as a binary tree from this sample (Paul et al., 2017).
A multilayer perceptron is a feedforward artificial neural network model that matches the input data set onto the appropriate output set. It derivates from the standard linear perceptron and uses three or more layers of neurons with nonlinear activation functions. The complexity of the multilayer perceptron network can be altered by changing the number of layers and units in each layer (Khalil Alsmadi et al., 2009).
The machine learning algorithms briefly described above have been applied as a case study in the sunflower oil industry. These algorithms are used to predict bleaching output color and results of each algorithm are compared with each other.

Evaluation
In this study, machine learning algorithms are used to predict the bleaching output color. Waikato Environment for Knowledge Analysis (WEKA) is used to run the 15 well-known machine learning classifier algorithms. For each algorithm, correlation coefficient, mean absolute error and root mean squared error are calculated and benchmarked. The power of the linear relationship between two variables is measured by the correlation coefficient (Ratner, 2009) and correlation coefficient is calculated by Eq. (3). (Goli et al., 2019). Mean absolute error is calculated by summing the absolute values of the errors to obtain the 'total error' and then dividing it into the total error by sample size (Willmott and Matsuura, 2005). The equality is presented in Eq. (4) (Goli et al., 2019).
Finally, the root mean square error measures the mean greatness of the error (Saigal and Mehrotra, 2012). The Eq. (5) calculates the root mean square error (Goli et al., 2021). The results are presented in Table 2. According to the Table 2, random forest classifier is the best classifier algorithm and Zeror classifier is the worst classifier algorithm for our data. Locally weighted learning is the second best algorithm, followed by linear regression and REPTree Classifier. With the best algorithm obtained, the output color can be predicted against the changes that may occur in the input parameters. This provides the opportunity to change the input parameters in advance. So, the desired output color can be achieved:

Statistical Test
As can be seen in Table 2, Random Forest Classifier is better than the other competitive classifiers for 10-fold cross validation. However, the Wilcoxon Signed Rank statistical test is used to see whether the Random Forest Classifier is the most effective classifier when 5-fold, 15-fold, 20-fold, 25-fold, 30-fold and 50-fold is implemented instead of 10fold cross validation. We suggest two hypotheses to test this situation. Hypothesis theses are as follows: H0: Random Forest Classifier isn't better than other classifier for all k-fold cross validation. Ha: Random Forest Classifier is better than other classifier for all k-fold cross validation. Table 3 shows the effectivity of all classifiers according to 5-fold, 10-fold, 15-fold, 20-fold, 25-fold, 30fold and 50-fold cross validation.   Wilcoxon signed rank test is conducted based on Table 3. The results of Wilcoxon signed rank test is shown in Table 4. The '+' in Table 4 refers to the sum of rank values in the results, where the Random Forest Classifier is better than the other classifier. According to the Wilcoxon signed-rank test, if the test value is less than critical value, null hypothesis (H0) is rejected. Reject H0 refers there is enough evidence at the 5% level of significance to support the claim that Random Forest Classifier is better than other classifier for all kfold cross validation.
Wilcoxon Signed-Rank Test results are given in Appendix A as a paired comparison.

Conclusion
Bleaching is an important process in order to obtain the proper output color, for sunflower oil. There are multiple factors that affect bleaching output color, such as impurity, free fatty acid ratio, wax, color index and the temperature of the process, the vacuum of the process, the amount of bleaching earth. We obtain a real 79-day data set from the bleaching process. Then we normalize the data by using min-max normalization. We select 15 well-known classifier algorithms suitable for the bleaching process data. We conduct 10-fold cross-validation to test the selected algorithms. Finally, correlation coefficient, mean absolute error and root mean squared error are calculated to benchmark classifier algorithms. According to obtained results, for our data set, the best and worst classifier algorithms are random forest classifier and ZeRor classifier, respectively.
In addition, the Wilcoxon Signed Rank statistical test is conducted to test if the Random Forest Classifier is the most effective classifier when also 5-fold, 15fold, 20-fold, 25-fold, 30-fold and 50-fold is implemented. The Wilcoxon Signed Rank statistical test results also show that Random Forest Classifier is better than other 14 classifiers.
For future research, (i) a decision support system can be developed by using new classifier algorithms and applied in the sunflower oil company or other inputs for the bleaching process can be considered; (ii) the effects of inputs on output quality can be analyzed; (iii) the current data set can be tested with new algorithms or hybrid algorithms; (iv) the impact (weight) of each input parameter can be calculated. Sharma, M., Singh, G., & Singh, R. (2017)