Preliminary Analysis of Malware Detection in Opcode Sequences within IoT Environment

: With the technological development and means of communication, the Internet of Things (IoT) has become an essential role in providing many services in daily life through millions of heterogeneous but interconnected devices and nodes. This development is opening to many security and privacy challenges that can cause complete network breakdown, bypassed access control or the loss of critical data. This paper attempts to provide a preliminary analysis for malware detection within data generated by IoT-based devices and services in the form of operational codes (Opcode) sequences. Three machine learning algorithms are evaluated and compared for accuracy, precision, recall and F-measure. The results showed that the Random Forest (RF) achieved the best accuracy of 98%, followed by SVM and k -NN, both with 91%. The results are further analyzed based on the Receiver Operating Characteristic (ROC) curve and Precision-Recall curve to further illustrate the difference in performance of all three algorithms when dealing with IoT data.


Introduction
Today, the Internet of Things (IoT) has offered many services through interconnection of huge number of sensor devices, embedded systems or services (Mosenia and Jha, 2016;Azmoodeh et al., 2018). IoT has become a driving technology in many domain such as smart city, intelligent transportation, as well as health and energy systems (D'Orazio et al., 2016;Patel et al., 2012). The massive expansion of IoT applications has resulted in surge of data, hence opening to many security and privacy challenges such as the malware attacks (Tankard, 2015;D'Orazio et al., 2016;Watson and Dehghantanha, 2016). The core reason of these challenges is simply because any network is subject to threat and penetration from devices that are connected to the network Li et al., 2019;Wazid et al., 2019).
Malware is the collective name for different types of malicious software, including viruses, ransomware and spyware. The main issue with malware detection lies in the ineffective methods used for signing and monitoring the suspected code for known security changes. This has led to many investigation on formulating new methods and techniques that can overcome different attack vectors (Burguera et al., 2011).
Machine learning is a popular method used to detect attacks and malware, as the concept of self-learning by extracting data features and training them is able to identify features of other data that have not been trained before (Rehman et al., 2018). In 2016, feature selection methods have been investigated in anomaly detection systems using the Principle Component Analysis (PCA) and Guttman-Kaiser (Kakavand et al., 2016). However, the study was not limited to reducing the dimensions of the features but rather preserving the information that is important in classifying the anomalies. The results showed a high intrusion rate of 97% with a false positive rate of 1.2%.
Research by Milosevic et al. (2017) studied malware detection that targeted android systems. This research used permissions and source code analysis through the use of the bag-of-words representation model and features implemented using a privacy and security protection application for Android devices called OWASP Seraphim droid. The results showed that the classification accuracy achieved was 89% and further increased to 95% with source code analysis. Subsequent research by Kakavand et al. (2018) applied two machine learning algorithms, which are Support Vector Machines (SVM) and K-Nearest Neighbors (k-NN), through the supervised learning process in order to classify malware and benign. This research focused on android application data and reported 79.1 and 80.5% accuracy percentage for SVM and k-NN, respectively.
In a more recent work,  presented a new method that combined machine learning methods and blockchain technology to improve the performance of malware detection model in Android devices. The proposal was implemented using a sequencing approach that combined clustering and classification in blockchain technology, as well as extracting information about malware and storing it back in blockchain. The main purpose was to develop a malware database, thus easily detecting other malicious in the future that do not exist previously.
Deep learning approaches have also been explored to classify data based on the dynamic approach to malware detection. A new method has been introduced to extract features in order to analyze the dynamic behavior and build a model of repeated neural networks to extract the abstract features (Xiaofeng et al., 2018). This research also studied many of the serial data processing to get rid of redundant data. The results showed that combining the two methods had better results and it was 99.3% where the classification performance was proven to be higher when merging machine learning and deep learning methods as compared to using the models separately.
In general, malware detection is an important and fundamental matter in providing security in IoT-based applications such as smart devices. According to the Kaspersky Lab, in 2016 most of the Internet devices were unsafe and most of the devices had a default password or security glitches that were not processed, which lead to easy penetration of these devices (Kolias et al., 2017;HaddadPajouh et al., 2018;Goyal et al., 2019). Security experts have warned the dangers to which the Internet of Things (IoT) can cause, specifically the malware due to the widespread dependence on devices connected to the Internet. Organizations are in need of a mechanism that has the ability to discover malware and suspicious bugs when their devices and services are connected to the Internet (Mahindru and Singh, 2017;Meidan et al., 2017).
In detecting malware within IoT environment, Bragen (2015) investigated both supervised and nonsupervised machine learning approach to detect attacks on IoT-generated data such as spoofing attack eavesdropping and jamming. HaddadPajouh et al. (2018) used three different Long Short Term Memory (LSTM), a type of Recurrent Neural Network (RNN) machine learning architecture. The results showed that second configurations with two-layer neurons achieved the highest accuracy of 98.2%. Although various machine learning and deep learning approaches have been used in malware detection, the literature has shown that the domain has evolved from email to mobile devices and most recently, to IoT devices (Lu et al., 2003).
In order to address the gap in providing adequate protection systems among IoT-based applications and smart devices, this research is set to provide a preliminary analysis of malware detection for operation code (Opcode) sequences within IoT environment as benchmark performance for future works. The remaining of this paper is organized as follows. Section 2 presents the materials and methods along with validation methods and algorithms. Section 3 presents the results, Section 4 discusses the results and finally section 5 concludes with future work.

Materials and Methods
In detecting malware within IoT-based applications, a classification methodology is adopted to predict categorical class labels (malware vs. benign) from the operational codes (Opcode) sequence dataset. The classification experiments will be carried out based on training and testing dataset to classify newly available data (Allahyari et al., 2017). The classification methodology is shown in Figure 1. The sub-sections will detail out the dataset, pre-processing, model validation, algorithms and the evaluation metrics.

Dataset
This research focuses on malware detection within data generated by IoT-based applications. With Raspberry Pie II, it is worth noting that AMD processors have been widely used in cloud edge devices, hence qualifying Raspberry Pi II as an IoT cloud edge device. The dataset used in this research was sourced from the Linux Debian package repositories (https://pkgs.org/).
The dataset is based on 32-bit ARM-based malware within the Virus Total Threat Intelligence platform as of 30 September 2017 in the form of Executable and Linkable Format (ELF). The ELF is used because it consideres the structure for binaries, libraries and core files, as well as roles in the process of linking program and execution. Since ELF features are considered static features, higher accuracy in malware detection is expected. Analyzing ELF is also important as it gives generic understanding of how an operating system works during software development.
Following HaddadPajouh et al. (2018), a Linux bash script for the Opcode samples in the dataset was written to extract the sequence of Opcodes in each sample. After extracting the ELF files using the Debian bundle, the dataset provided 280 malware and 270 benign 1308 programs samples. Next, Object-Dump tool was used to decompile all samples to extract Opcode sequences in each sample. Fig. 2 and Fig. 3 show the excerpts of malware and benign samples.

Pre-Processing
After disassembling, the opcode sequences extracted will be pre-processed through various pre-processing steps, which include normalizing, centering and scaling. A Phyton code is used to convert the opcodes into Excel file with rows of opcodes and columns of features before they are ready for splitting into training and testing set. Normalizing can give us several meanings, is used informally in statistics, it is the ability to remove the unit's measurement of data, which allows us to compare data from different places with greater ease.
For many types of data, centering and scaling are intertwined. Centering corresponds to a subtraction of a reference vector (often represented by the mean values of the variables or the settings of the setpoint). Scaling corresponds to a multiplication by a vector. The choice of scaling vector is crucial (Bro and Smilde, 2003).

Model Validation
The anomaly detection or malware classification experiments were carried out using the k-fold validation method for training and testing as shown in Fig. 4. In the ten-fold validation setting, eight times was used for training, one for documentation and the rest for testing. Following (Davis and Goadrich, 2006). A confusion matrix was derived summarized from 10 experiments together, analyzed and reported. Based on this figure, the k-fold cross-validation method divides n samples into k groups, whereby validation uses n/k a sample in each group at a time. When a group is chosen for a test, the group of k-1 and the other is used for training, after 1310 which the training is switched to the test group every time (cycle). In this way, the performance of the classifier can be determined by calculating the average error of k in each cycle (Varoquaux, 2018;Zhang et al., 2016).

Algorithms
Three machine learning algorithms are used in the preliminary experiments, which are k-Nearest Neighbor (k-NN), Support Vector Machines (SVM) and Random Forest. All the algorithms were implemented using the Anaconda Navigator, TensorFlow, scikit-learn machine learning, Jupyter note-book, as well as tools in Phyton. All three algorithms have served as benchmark algorithms under the machine learning approach in many malware or anomaly detection problems in Internet-of-Things (IoT) devices (Hasan et al., 2019;Nakhodchi et al., 2020;Darabian et al., 2020), networks (Kumar and Lim, 2019) and services (Ham et al., 2014;Sethi et al., 2017;Tien et al., 2020). The other reason is that these algorithms are more efficient with a small data set if compared to other methods such as deep learning methods that need big data since the data set of this research is considered small in size (Gislason et al., 2003;Noi and Kappas, 2018;Wang et al., 2018a).

k-Nearest Neighbor (k-NN)
k-NN is an algorithm that determine a class of k nearest training samples through finding the most frequent class available in the feature space (Gupta and Mittal, 2018;Wang et al., 2018b). Given a set of features and classes (x1, y1),…, (xn, yn), where features x1 R d and classes y1y, then for a given i, k-NN rates the neighbors of a test sequence among the training sample and use the class labels for the nearest neighbors in order to expect the test vector class (Allahyari et al., 2017). Therefore, k takes the new k-points and ranks them according to the majority of votes obtained for the closest k. This algorithm uses the Euclidean distance to measure the resemblance between two vectors points (Aburomman and Reaz, 2016). The formula for k-NN is shown in Equation 1:

Support Vector Machines (SVM)
SVM is a group classifier models that is considered one of the effective methods with high popularization ability in practice (Huang et al., 2018). In contrast to statistical methods that are based on reducing experimental risks, SVM is based on reducing structural risks, this indicates the ability of this algorithm to avoid overrun. The way the algorithm works is to create hyperlinked decision plans that are divided into two categories with the maximal margin in the Fig. 6. These decisions are known as hyperplane as defined in Equation 2 where w is the weight vector, x is the input feature vector and b is the bias: The objective of SVM is to find decision boundaries between two classes that allow predictions of labels from one or more features, in a way that it separates the data and maximum the margin 1/||w|| 2 , making them as close as possible to data points from each of these categories is called close points (Apostolidis-Afentoulis and Lioufi, 2015).

Random Forests (RF)
RF is one of the groups of classification algorithms that depend on decision trees. This algorithm consists of different subsets of training data taken from the original data set using the method bootstrap sampling approach, after that creating the decision tree k through training the sub-groups, in the end building a random forest of decision trees as shown in Fig. 7 (Chen et al., 2016). RF has the least error in classifying data if compared with other traditional tree-based methods. The number of trees, the minimum node size and the number of features that are used to divide each node have several advantages such as, after completing the random tree forest can be referred to in the future, RF has the ability to overcome the over fitting (Farnaaz and Jabbar, 2016). An RF algorithm can be formalized as Equation 3: where x is the sample, y is the feature variable of s, n is the number of samples, m is the feature variable for each sample, i = 1,2,…, n and j = 1,2,…, m.

Evaluation Metrics
Following Nikam (2015), the evaluation metrics used in the experiments are accuracy, precision, recall and F-measure. The percentage is calculated based on confusion matrix, where the rows in the matrix represent instances of the actual class and each column represent instances of the predicted class. A confusion matrix is implemented based on the results illustrated in Table 1. The correct forecasts are distributed with the number of values distributed for each category given the total expected results after classification (Powers, 2011).
Based on Table 1, a TP means the instance originally labeled as benign is correctly predicted as benign. A TN means the instance originally labeled as malware is correctly predicted as malware. An FP means the instance originally labeled as malware is incorrectly predicted as benign. Finally, an FN means the instance originally labeled as benign is incorrectly predicted as malware. Equation 4-7 show the formula for calculating the evaluation metrics.

F-Measure
F-Measure is the harmonic mean of precision and recall, which is a very useful measure of success of prediction when the classes are imbalanced. In information retrieval, precision is a measure of result relevancy, while recall is a measure of how many truly relevant results are returned (Sabharwal and Sedghi, 2017 Precision, recall and F-measure is measured because accuracy alone can be misleading. The Confusion Matrix as a way of describing the breakdown of errors in predictions for an unseen dataset. Precision will give exactness of a model while recall gives completeness the model. Finally, F-measure or F1 score gives the balance between the two.

Results
The purpose of the experiments is to compare the performance of three algorithms, which are k-Nearest Neighbor (k-NN), Support Vector Machines (SVM) and Random Forest (RF). The full results of accuracy, precision, recall and F-measure are shown in Table 2. Next, the results in Table 2 are analyzed based on Receiver Operating Characteristic (ROC) curve and Precision-Recall (PR) curve. Both types of curves played a fundamental role in understanding the technique of the various systems in the presence of uncertainty. These curves were used in several areas such as radiology, electrical engineering and several other arenas to education the performance of a binary forecast system as a function of a control parameter. As the control parameter, it is possible to increase the accuracy and reduce the false positive rate of the system according to the lower recall, which is the true positive rate or sensitivity (Pavlick et al., 2015;Ekelund, 2017).
The Area Under Curve (AUC) will be used as a summary of the model skill. The model skill will be compared against a no-skill classifier, which is the one that cannot discriminate between the classes and would predict a random class or a constant class in all cases. A model with no-skill is represented at the point (0.5, 0.5). A model with no-skill at each threshold is represented by a diagonal line from the bottom left of the plot to the top right and has an AUC of 0.5. Table 3 summarizes the results of AUC for both ROC and PR curves across all three algorithms.

Receiver Operating Characteristic (ROC) Curve
A Receiver Operating Characteristic (ROC) curve summarizes the trade-off between TP rate and FP rate for a predictive model using different probability thresholds. It has two dimensions where the x-axis indicates the False Positive (FP) rate and the y-axis indicates the True Positive (TP) rate (Grau et al., 2015). Fig. 8 shows the ROC curve for k-Nearest Neighbor (k-NN). The ROC AUC is 0.959 with no-skill AUC at 0.500.   Precision Recall Fig. 9 shows the ROC curve for Support Vector Machines (SVM). The ROC AUC is 0.888 with no-skill AUC at 0.500. Meanwhile, Fig. 10 shows the ROC curve for Random Forest (RF). The ROC AUC is 0.981 with noskill AUC at 0.500.

Precision-Recall Curve
A Precision-Recall (PR) curve summarize the tradeoff between the true positive rate and the positive predictive value for a predictive model using different probability thresholds. PR curve is a plot of the precision in the y-axis and the recall in the x-axis for different probability thresholds. Basically, it is the plot of Recall (x) vs. Precision (y). Fig. 11 shows the PR curve for k-Nearest Neighbor (k-NN). The F-measure is 0.864 and PR AUC is 0.960. Fig. 12 shows the PR curve for Support Vector Machines (SVM). The F-measure is 0.802 and PR AUC is 0.885. Finally, Fig. 13 shows the PR curve for Random Forest (RF). The F-measure is 0.925 and PR AUC is 0.983.

Discussion
The preliminary analysis was carried out based on Area Under Curve (AUC) of two curves; Receiver Operating Characteristic (ROC) curve and Precision-Recall (PR) curve. AUC is the best measurement as AUC does not have errors in the prediction, so ideal classifiers can an ideal that classifies data into two classes, which means the model succeeded in being without any false positives. Another benefit of using the ROC and PR curves together is to find points that are close or shared to give the best evaluation of the models used in this research as shown in Table 3.
In both ROC and AUC, the threshold was used to apply to the cut-off point in probability between the positive and negative classes where the threshold is chosen by default for any classifier at 0.5 in the middle area of the outputs (0 and 1). The classifier that does not have the ability to distinguish between positive and negative class will be the diagonal line between the false rate of 0 and the true positive rate of zero (0, 0) and in the case of predicting all negative class to the false positive rate 1 or the true positive rate (1, 1) or expect all positive class. So, the line represented by the points below is the inability predictability of and there is no-skill in distinguishing between positive and negative class. So, the perfect classifier when the value between (0.0 and 1.0).
The performance of the perfect model for the malware detection for the dataset depends on the choice of the appropriate model for the dataset. In k-Nearest Neighbor model, it can be seen that the ROC AUC for k-NN model on the synthetic dataset is about 0.903, which is much better than a no-skill classifier with a score of about 0.500. In SVM, it can be seen that the ROC AUC model on the synthetic dataset is about 0.903, which is much better than a no-skill classifier with a score of about 0.500. Finally, in RF, it can be seen that the ROC AUC model on the synthetic dataset is about 0.981, which is much better than a no-skill classifier with a score of about 0.500.
The results showed that operational code (Opcode) sequence dataset generated from IoT sensors are highly useful in developing a malware detection model within the Internet of Things environment. The accuracy rates are considerably high and this indicates the possibility of developing and using machine learning methods with real-data from the Internet of things. The challenges facing the operational code sequence (Opcode) dataset is that not every sample consisting of all cipher codes in its vector feature, hence the features may have a zero value. Therefore, using word embedding technology to convert each sample into a digital sequence representation is possibly required (Puthal et al., 2016).

Conclusion
This paper presented a preliminary analysis of malware detection models within the scope of Internet-of-Things (IoT) applications. The dataset used is in the form of operational codes (Opcodes) sequences generated from IoT-based devices (HaddadPajouh et al., 2018). Three machine learning algorithms were constructed and compared, which are k-Nearest Neighbor (k-NN), Support Vector Machines (SVM) and Random Forest. The experimental results showed that RF outperformed both k-NN and SVM with 98% of detection accuracy as compared to 91% for both k-NN and SVM. These results are supported by analysis of Receiver Operating Characteristic (ROC) curve and Precision-Recall (PR) curve, which showed that the best methods used in this study is Random Forest, with highest accuracy of 0.98 and supported by ROC/PR curves 0.983.
The results from this preliminary analysis will be used as benchmark results for exploring deep learning methods with the same or similar dataset from IoT environment. It is hoped that these detection models will be embedded in the IoT application in order to secure the systems from malware attacks.