Supervised Machine Learning Model-Based Approach for Performance Prediction of Students

: Predicting students’ performance is one of the crucial issue for learning contexts, since it helps to develop alternative recommendation systems for academically weak students. Thus, several methods and practices have been applied for educational improvement. However, most of the existing methods do not determine the performance of the students. In this study, we have studied the execution of six machine learning models (Decision tree, Random Forest, Support Vector Machine, Logistic Regression, Ada Boost, Stochastic Gradient Descent) to analyze and evaluate the students’ achievements. The performance is evaluated in term of accuracy, precision, sensitivity and f-measure. Among the selected models, the results validate that the efficiency of Stochastic Gradient Descent is better in training small dataset. In addition, it also produces the higher accuracy as compared with other models. This contribution aims to develop the best model which may derive the conclusion on students' academic achievement.


Introduction
There has been a revival of interest in recent years for the universities to adopt and innovate in global educational standards and requirements, where they faced stiff completion to attract talented students and to retain them (Razaque et al., 2019) Thus, making timely decisions for the transformation in the field of education and for modulating the educational systems are highly demanded (Varvarigos, 2020). The current evaluation system of student's knowledge is barely showing the preparation degree, especially for those who can potentially fail the final examination. Due to the large flow of students and time constraints; the professors are not always able to indicate precisely a "risky" student. In addition to this, universities are increasingly introducing e-learning systems (Barteit et al., 2020). Creating an automated student performance prediction method will allow teachers to track down weak students and thereby help them complete the current training course (Ivanović et al., 2018) The prediction of student performance is a fundamental task that is being researched by using Educational Data Mining (EDM) (Injadat et al., 2020). This process predicts the value of an unidentified variable that describes the students' outcomes (Pass/Fail), grades, marks, etc. The study in the field of EDM was presented in (Peña-Ayala, 2014). Later, a more detailed theoretical study of the EDM was conducted in (Angeli et al., 2017).
The educational data mining technique was introduced to improve graduate students' performance in research (Ashraf et al., 2020). The purpose of their work was to solve the problem of students' low grades (Satı, 2018). The problem of predicting student performance was also addressed in (Amrieh et al., 2016). The results revealed a relationship between students' behavior and academic performance. Tomasevic et al. (2020) considered the use of the decision tree classifier as the main method for EDM research. Asif et al. (2017) aimed to justify the capabilities of data mining techniques in the context of higher education. The classification methods of data mining were applied. Yahya and Osman (2019) analyzed data mining algorithms (a rule learner, the nearest neighbor classifier and neural network) with application in graduate students' performance. An empirical analysis of data mining algorithms for educational purposes was discussed in (Zhang et al., 2018).
In this study, we have studied the execution of six machine learning algorithms: (Decision tree, Random Forest, Support Vector Machine, Logistic Regression, Ada Boost, Stochastic Gradient Descent). Evaluation measures such as sensitivity, precision, accuracy and fmeasure have been used as performance evaluation metrics. The purpose of the prediction is to estimate the unknown value -the final student rating, where a model for each algorithm is formed and the data trained into one dataset to identify the most effective algorithm. The key points of this study are as follows:  Classification of students' performance based on several attributes  Preprocessing techniques to convert raw data into a suitable form which can be used by mining algorithms  Comparative analysis of the performance of learning models based on sensitivity, precision, accuracy and f-measure The remainder of the paper is organized as follows: Section 2 discusses the problem and significance. Section 3 presents the related work. Section 4 demonstrates the materials and background of the supervised models used in this study. Section 5 presents the proposed methodology. Section 6. Discusses the results. Section 7 is the conclusion and future work.

Problem Identification and Significance
The main asset of higher educational institutions are students. Thus, the performance of students reflects the quality of teaching and, thus, affects the ranking of the university. Universities should deeply analyze their activities in the educational market to identify their uniqueness and build on their further development. In this regard, the university management should focus more on educational performance of learners. Also, they should analyze the characteristics of students at the admission stage to select the most promising ones. Moreover, academic performance plays an important role in the career development of students after their graduation, since this is one of the factors that employers pay attention to the labor market.
However, the current knowledge control system in the form of tests regulates the students' performance post factum, not allowing to identify potentially weak students during the whole educational process. As a result, students of educational institutions lose attractiveness for potential employers and universities go down in the ranking.

Related Work
In this section, the salient features of existing approaches are extensively discussed. Even though the use of mining techniques in predicting the academic performance of students is a relatively young field, many methods have been introduced on this subject. The reason for this is the huge potential for educational institutions to improve their performance.
There are various aspects that may affect the performance of students during their studies. These factors could be social, demographic and academic interests. Academically speaking, grades in previous classes are of great importance because they help to measure students' abilities and play a very significant role in predicting their final grades (Roy and Garg, 2017).
The authors (Hussain et al., 2019) performed a study that used machine learning techniques to predict the difficulties that students will encounter in a subsequent digital design course session. This study analyzed the data logged by a technology-enhanced learning system called Electronics Education and Design Suite (DEEDS) using machine learning algorithms, where students are allowed to solve digital design exercises with different levels of difficulty while logging input data. The researchers observed that ANNs and SVMs can integrate into the TEL system easily and perform higher accuracy than do other algorithms.
In research, (Kumar and Singh, 2019) combined K-Means clustering algorithm with Artificial Neural Network and Support Vector Machine classification algorithm to assess student performance and reduce human effort. The dataset used in the study was taken from EDFacts about the results of state assessment in reading/language, art and mathematics of session 2010-11. Using Bayesian classification method on 34 different attributes, they determined that some school attributes and some subject marks related attributes were highly correlated with the student academic performance. Hlosta et al. (2017) introduced the self-learning system using the machine learning algorithm to find students at risk in a new course without any previous history data. Their approach uses information about already submitted assessments that presents the problem of unbalanced data for training and testing models. The study validated that XGBoost achieved the best performance.
The research carried out by (Czibula et al., 2019) proposed a new classification model Students Performance prediction using Relational Association Rules (S PRAR) for predicting the students' results based on the different academic discipline applying the relational association rules. RARs extend the classic correlation rules to express the different relationships between data attributes. Tran et al. (2017) predict students' performance in academic environments by analyzing and comparing a regression-based model and recommendation-based model. In the regression-based approach, several skills related to the course have been added with the aim of enhancing the overall predictive performance. Ahmad et al. (2020) highlighted the importance of Educational Data Mining. The study conducted to find out the performance-based prediction of students by investigating their demographic and academic factors using machine learning methods. The research was recommended to predict the academic performance of the students by taking other circumstances of students into account such as personality, cognitive, psychological and economic domain.
As all the existing approaches focused either on determining the outliers or improving the educational performance. Our study particularly focuses on accuracy, sensitivity, precision and f-measure using state-of-the-art algorithms. Based on the results, the best algorithm is recommended for improving the performance of educational institutions.

Materials and Background
This work uses several machine learning algorithms to solve classification problems. Figure 1 illustrates the steps for systems architecture. First, the data goes through preprocessing phase to convert raw data into a suitable form which can be used by mining algorithms. When the review texts are cleaned, we split the datasets for training and testing in ratio of 70 to 30%. The training set is partitioned into smaller splits and the model is trained using k-folds cross validation, while the testing set is used to validate the trained model and measure the overall performance.

A. Data Preparation
The dataset used in this study was taken from the University of California Irvine (UCI) machine learning repository. The name of the dataset is "Student Performance Data Set". Data was collected based on student performance from Portugal which is built from the data of 395 students (Cortez (n.d.)).

B. Data Selection and Transformation
The data were collected on grounds of students and surveys database. Predictors used to form the proposed model which include: Estimates, demographic data, social status and educational related features is demonstrated in Table 1.

C. Supervised Machine Learning Methods
In this section, we discuss the machine learning algorithms; we describe the mechanism and hyperparameters of each. All the supervised machine learning algorithms mentioned below were installed in Python using the scikit-learn module. This work used six different machine-learning algorithms to solve the classification problem.

Decision tree
Decision tree method is a way of representing rules in a hierarchical, sequential structure, where each object corresponds to a single node that provides a solution as illustrated in algorithm 1. The field of application of decision trees is currently wide, but all tasks solved by this method can be combined into the following three classes:  Data description. Decision trees allow to store information about data in a compact form  Classification. Decision trees do an excellent job with classification tasks, i.e., assigning objects to one of the previously known classes.

Random Forest
Random Forest is a composition (ensemble) of many decision trees, which helps to reduce the problem of retraining and improve accuracy in comparison with a single tree. The forecast is obtained by aggregating the responses of multiple trees. Training trees occurs independently of each other (on different subsets), which not only solves the problem of building identical trees on the same dataset, but also makes it very convenient for use in distributed computing systems. Random forest algorithm is illustrated below.

Support Vector Machine
This method initially refers to binary classifiers, although there are ways to make it works for multi classification tasks. The fundamental function of the algorithm is to find the most correct line, or hyperplane, which divides the data into two classes as illustrated in algorithm 3. Algorithm 3. Support Vector Machine 1. Create H, where Hij = yiyixixj 2. Choose how significantly misclassifications should be treated, by selecting a suitable value for the parameter C.
3. Find α so that 1 1 2 Determine the set of Support Vectors S by finding the indices such that 0  i  C

Logistic Regression
Logistic regression is a form of multiple regression, the general purpose of which is to analyze the relationship between several independent variables (also called regressors or predictors) and the dependent variable. Binary logistic regression, as the name implies, applies when the dependent variable is binary (that is, it can take on only two values). In other words, with the help of logistic regression, it is possible to estimate the probability that an event will occur for a particular test subject (sick/healthy, loan repayment/default, etc.). To solve the problem, the regression problem can be formulated differently: Instead of predicting a binary variable, we predict a continuous variable with values on the interval [0,1] for any values of independent variables. This is achieved by applying the following regression equation (logit transform): where, P is the probability that an event of interest will occur; e -base of natural logarithms 2.71 ...; y is the standard regression equation.

AdaBoost
AdaBoost (short for Adaptive Boosting) is a machine learning algorithm proposed by Yoav Freund and Robert Schapier. This algorithm can be used in combination with several classification algorithms to improve their efficiency. The algorithm strengthens the classifiers, combining them into a "committee". AdaBoost is adaptive in the sense that each next classifier committee is built on objects that have been incorrectly classified by previous committees. AdaBoost is sensitive to data noise and outliers. However, it is less prone to retraining compared to other machine learning algorithms. AdaBoost algorithm is shown in algorithm 5.

Stochastic Gradient Descent
Stochastic gradient descent is a very efficient model for linear classifiers and regressors. It is simply substitute the actual gradient that calculated from the whole dataset with an approximation of thereof which calculated from a randomly selected subset. With a stochastic (or "operational") gradient descent, the value of the gradient is approximated by a gradient of the cost function calculated on only one learning element. Then the parameters are changed in proportion to the approximate gradient. Thus, the model parameters changed after each learning object. For large data sets, stochastic gradient descent can provide a significant advantage in speed compared to standard gradient descent. The algorithm of SGD is presented in algorithm 6. for

Proposed Methodology
This section presents our dataset that are implemented in this study, the preprocessing of our data and several classification models leveraged for the prediction students' academic performance.

A. Data Collection
It is very important to collect data properly, since the quality and quantity of data are affecting how good our model is and how accurate the desired result will be. Irrelevant data can reduce accuracy and therefore need to be cleared. Non-relevant data includes missing values, extraneous data and other data that are out of the common set and here there are two options: Either this data is removed from the sample, or replaced with dummy values. For example, missing values can be replaced with zeros. For the algorithm, obtaining approximate values is more proper than the missing ones and so the prioritized factors should be taken into consideration for analyzing dataset: The accuracy of the forecast or the time spent on preparing the data. The data in this study have predictors such as student scores, demographic and social data where the data was collected by questioning.
Students were classified into three categories as "good", "fair" and "poor" based on their results. Also, for the analysis of what influences student performance, such factors were added as whether the student in a relationship, use of alcohol, educational level of their parents, etc.

B. Data Reparation
After collecting data, the rows sequence is changed randomly which might affect the learning process. In addition, the data was divided into two parts: One for training and the other for evaluating the model, where the optimal ratio is 70/30.

C. Modeling Method
There are some parameters considered when selecting our model as follows:  Accuracy: It is not always necessary to get results with maximum accuracy. In some cases, it is more proper to use approximate values, which can significantly reduce the processing time  Studying time: The amount of time required to train a model depends on the algorithm we choose. As a rule, the lower the accuracy, the shorter the training time. Also, some algorithms are more dependent on the amount of data than others and this can significantly affect the choice of the algorithm for the model, especially when time is limited  Linearity: Most of the algorithms in machine learning are linear, that is, data classes can be separated by a straight line. For some problems, linear algorithms work fine, but in some cases they can significantly reduce the accuracy  Number of parameters: The total number of parameters is directly affect the algorithm behavior such as the number of iterations or sensitivity to errors. Usually, the greater the number of parameters, the greater the number of attempts and errors on the way to find the best combination. On the other hand, a large number of parameters would result in a greater flexibility and higher accuracy Based on the characteristics listed above, six different models were assessed, which respectively are DT, RF, SVM, AB, LR and SGD. The goal of this study is to choose the model with the highest prediction performance as the primary model.

D. Evaluation
While training a model is a key step, how the model generalizes on unseen data is an equally important aspect that should be considered in every machine learning pipeline. Model evaluation aims to estimate the generalization performance of a model on future (unseen/out-of-sample) data. Indeed, it's important to use new data when evaluating the model to prevent the likelihood of overfitting to the training set.

A. Research Model and Hypothesis
From the dataset described earlier, we can interpret data by using hypothesis testing to answer the following questions:  Do relationships affect student performance?  Does drinking alcohol affect student performance?  Does parental education affect the academic performance of their child?  How does the frequency of hanging out with friends on academic performance?  Does a student's residence affect his academic performance?
Based on the analysis, relationships can affect students when it comes to their performance in school and the kind of grades they earn. Relationships play an essential role in understanding student achievement. The relational process is regarded as an inherent aspect of educational life and the foundation for encouraging performance as shown in Fig. 2. Figure 3 shows the impact of drinking alcohol on students' performance, where they rate how often they drink alcohol from level 1 to 5. Data shows that increases in alcohol consumption has a significant correlation with final grade that result in lower academic performance. Figure 4 reveals the association between parent's qualification and the academic performance of their children. Perhaps the most noticeable explanation of the relationship between parents' education and the academic achievement of their children is based on the assumption that parents learn something during schooling that effects the ways in which they interact with their children around learning activities in the home. Parents with higher education ensure that their children are exposed to many educational opportunities in their communities.  We also estimate the influence of spending time with friends on educational attainment. More specifically, we rate the frequency from rarely to almost always as illustrated in Fig. 5. We find that spending more time with friends has a significant correlation with final grade.
We have examined rural and urban differences in students' achievement. Many of the students that lives in rural areas has low academic achievement when we compared it with the students that lives in urban areas, as depicted in Fig. 6. The excellent performance of urban students is because of their better quality in their education, availability of the information that they get from various sources.

B. Model Validation
Validation is the process of deciding whether the numerical results quantifying hypothesized relationships between variables, are acceptable as descriptions of the data. K-Fold Cross Validation is one of the common methods used for evaluating ML models, where the data is divided into k subsets (known as folds). A model is trained using k−1 of the folds as training data and the resulting model is validated on the remaining data used as a test set to estimate a performance measure such as accuracy, sensitivity, precision and f-measure.
The performance measure reported by k-fold crossvalidation is the average of the values computed in the loop.
Our model is passed through 5-fold cross validation (K = 5) where the dataset is split into 5 folds. In the first iteration, the first fold is used to test the model and the rest are used to train the model. In the second iteration, 2 nd fold is used as the testing set while the rest serve as the training set. This process is repeated until each fold of the 5 folds have been used as the testing set. The K-fold algorithm used in our prediction model is shown in Algorithm 7.

C. Performance Evaluation
All the supervised machine learning algorithms mentioned in this study were deployed in Python (Appendix 1).
The overall performance metric is the average of all five-separate evaluations. The confusion matrix in Table 2 Table 3. Table 4. shows the classification accuracy and other performance measures results from conducting 5 processes of applying supervised ML algorithms and the corresponding charts are shown in Fig. 7.
We compare the results of six supervised machine learning algorithms, DT, RF, AB, LR, SVM and SGD. In the light of results mentioned in Table 4  has the highest accuracy result where it could classify above 99% all the cases correctly and so it performs the best in our dataset. On the other hand, AdaBoost model has the lowest accuracy as it has 79% correct classification from the entire dataset. In Fig. 7.b, the whole dataset is partitioned into two splits; 70% for training the model and 30% for testing the model. The highest accuracy is achieved by Stochastic Gradient Descent while the lowest achieved Support Vector Machine.
The overall performance of process three, four and five are demonstrated in Fig. 7c to 7e respectively; The highest accuracy score is obtained by SGD classifier as it classifies most cases in our dataset, while the Decision tree classifier perform the worst and only classifies the half of our dataset.

Conclusion and Future Work
In this study, we have studied the execution of six machine learning algorithms, namely, Decision tree, Random Forest, Support Vector Machine, Logistic Regression, Ada Boost, Stochastic Gradient Descent for student academic performance predictions. The hypothesis testing results shows that student academic performance may affected by relationships, alcohol use, parental educational level and their residential area. The evaluation measures of the methods are further determined in terms of sensitivity, precision, accuracy and F-measure achieved from five processes. The overall results of the observed dataset showed that Stochastic Gradient Descent classification algorithm is considered as the most accurate classifier and outperformed many other supervised machine learning classification algorithms in various cases. The future work involves presenting intelligent machine learning algorithms useful to a huge collection of real dataset and higher accuracy percentage. More attributes can enhance the performance and give more precise prediction.

Author's Contributions
Abdul Razaque: Conceived of the presented idea, verified the analytical methods, supervised the findings of this work and discussed the results and contributed to the final manuscript.
Abrar M. Alajlan: Developed the theory and performed the computations, discussed the results and contributed to the final manuscript and wrote the manuscript with support from Dr. Abdul Razaque.

Ethics
This manuscript is original and has not been published anywhere. The confirmation from the corresponding author is that the other author has read and approved the manuscript and has no ethical issue involved.

Appendix:
Appendix 1: Software and tools used for conducting the experiment Tools Description Python Python is a high-level, interpreted, object-oriented programming language with dynamic semantics. Its highlevel built in data structures, combined with dynamic binding and dynamic typing, make it very good for Rapid Application Development (What is Python? Executive Summary. (n.d.)).

Jupyter notebook
The Jupyter Notebook is an open-source web application that allows you to share and create documents that have live equations, code, narrative text and visualizations (Real Python, 2020). numpy NumPy is the fundamental package for scientific computing with Python (ECOSYSTEM. (n.d.)). pandas Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language (Pandas, Python Data Analysis Library. (n.d.)). matpotlib Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter notebook, web application servers and four graphical user interface toolkits (Visualization with Python, Matplotlib. (n.d.)). seaborn Seaborn is a library for making statistical graphics in Python. It is built on top of matplotlib and closely integrated with pandas data structures (An introduction to seaborn. (n.d.)). Statsmodels.api statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests and statistical data exploration (Welcome to Statsmodels's Documentation. (n.d.)). scikit-learn scikit-learn is a Python module for machine learning built on top of SciPy and distributed under the 3-Clause BSD license (Scikit-Learn (n.d.)).