Malware Detection Based on Hybrid Signature Behaviour Application Programming Interface Call Graph

: Problem statement: A malware is a program that has malicious intent. Nowadays, malware authors apply several sophisticated techniques such as packing and obfuscation to avoid malware detection. That makes zero-day attacks and false positives the most challenging problems in the malware detection field. Approach: In this study, the static and dynamic analysis techniques that are used in malware detection are surveyed. Static analysis techniques, dynamic analysis techniques and their combination including Signature-Based and Behaviour-Based techniques are discussed. Results: In addition, a new malware detection framework is proposed. Conclusion: The proposed framework combines Signature-Based with Behaviour-Based using API graph system. The goal of the proposed framework is to improve accuracy and scan process time for malware detection.


INTRODUCTION
Malware stands for malicious software. It is the type of software that is designed with a harmful intent in mind. It comes in many forms such as Viruses, Worms, Trojan horses, Backdoors, Spyware, Rootkits, botnet in addition to other types of software with unwanted behavior (Wang, 2006).
The following are breif descriptions for each of the above mentioned malware types (Wang, 2006): • Viruses are programs that self-replicate within a host by attaching themselves to programs and/or documents that become carriers of the malicious code • Worms are programs that self-replicate across a network • Trojan horses masquerade as useful programs, but contain malicious code to attack the system or leak data • Back doors open the system to external entities by subverting the local security policies to allow remote access and control over a network • Spyware is a useful software package that also transmits private user data to an external entity • Rootkits is a collection of tools often used by an attacker after gaining administrative privileges on a host • Botnet is remotely controlled software that comprises a collection of autonomous software tools • Malware detector is a system that attempts to identify malware using signatures and other heuristics techniques; Antivirus scanner is an example of a malware detector (Wang, 2006); the malware writer (hacker) on the other hand applies sophisticated techniques to evade detection by modifying or morphing malware using packing techniques and/or program obfuscation. Two common obfuscation techniques are Polymorphism and Metamorphism (You and Yim, 2010) The malware detector attempts to help protect the system by detecting malicious behavior. The malware detector may or may not reside on the same system it is trying to protect. Malware detectors take two inputs: • Knowledge of the malware signature or behavior (learning) • The program under inspection Once the malware detector has the knowledge of what is considered malware behavior (abnormal behavior) and the program under inspection, it can employ its detection technique to decide if the program is malware or benign.

MATERIALS AND METHODS
Malware analysis can be categorized into two main categories: • Analysis of the infected file without executing it, which is known as static analysis. In this approach, we extract low-level information such as Control Flow Graphs (CFGs), Data-Flow Graphs (DFGs) and System call analysis. This information can be gathered by disassembling or decompiling the infected file using tools like IDA Pro (Riesen and Bunke, 2009). Sometimes analyzing the infected file in a different environment to avoid auto execution of the malware is better. Using static analysis we get fast, safe and low false positives and we trace all paths, which helps in terms of getting a lot of information to analyze. On the other hand static analysis may fail in analyzing unknown malware that uses code obfuscation techniques (Egele et al., 2011) • Analysis of the infected file during its execution, which is known as dynamic analysis. Dynamic analysis executes the infected file on simulated environment (a debugger or a virtual machine or an emulator) to analyze its malicious functions. The analysis environment must be invisible to the malware because the malware writer use tools like anti-virtual machine and Anti-emulation to hide their malware functions if they detect they are under analysis. Dynamic analysis fails to detect activities of interest if the target changes its behavior depending on trigger conditions such as existence of a specific file or specific day as only a single execution path may be examined for each attempt (Egele et al., 2011) Techniques: There are mainly two techniques for malware detection: Signature-Based and Behavior-Based techniques (Table 1-2). In signature-based techniques a sequence of instructions unique to a malware is used to generate a malware signature, which is captured by researchers in a laboratory environment (Goertzel, 2009). A signature should be able to identify any malware exhibiting the malicious behavior specified by the signature. Most of antivirus scanners are signature based.
Behavior-based detection techniques focus on analyzing the behavior of known and suspected malicious code. Such behaviors include factors such as the source and destination addresses of the malware, the attachment types in which they are embedded and statistical anomalies in malware infected systems (Goertzel, 2009). One example of a behavior-based detection approach is the histogram-based malicious code detection technology patented by Symantec.
Related work: Malware Detection is divided into two methods: Signature-Based and Behavior-Based techniques and each technique can be applied using static analysis or dynamic analysis or hybrid analysis (Idika and Mathur, 2007), Fig. 1-3 shows the organization of malware detection.
Implementing signature based detection without executing the suspected file (Static analysis) was the first try to detect malware. Researchers applied different techniques to improve detection rate. Some of them applied Objective-Oriented Association (OOA) mining based classification (Ye et al., 2008). Their model consisted of three major modules: PE parser, OOA rule generator and rule based classifier. After a while they developed their work using postprocessing techniques associative classification method based on the analysis of Application Programming Interface (API) execution calls (Ye et al., 2010). Other researchers combined signature-based technique and genetic algorithm technique, but their study focused on three types of malware which are virus, worms and Trojan horse (Zolkipli and Jantan, 2010).   Signature based detection was also applied during suspected file execution (dynamic analysis) in which the researchers trace API calls and then build their suspected file signature (Nair et al., 2010), this researcher generated signature for an entire malware class instead of for individual malware samples. Once a base signature for a particular metamorphic generator is generated, all the metamorphic viruses created from that tool are easily detected.
Most of the existing works relies on using behavior based detection where some researchers apply static analysis while others apply dynamic analysis. Some of the works focus on kernel memory mapping to develop a malware behavior monitor that uses a temporal view of kernel objects in the analysis of kernel execution traces (Rhee et al., 2010). Other focus on avoiding false positives by tracing malware behavior usually not do but installers and uninstallers do (Fukushimayz et al., 2010;(Park et al., 2010) propose a new malware classification method based on maximal common subgraph detection.

Fig. 3: Proposed framework
Current researchers combine static analysis with dynamic analysis to overcome the limitations of each method. Guo et al. (2010) proposed a framework that combined static and dynamic binary translation features to detect malware and prevent its execution. They apply behavior Control Flow Graph (CFG) and then critical API Graph based on CFG is generated to do sub-graph matching. Other researchers apply signature Control Flow Graph (CFG) and use edit distance matches between graphs.
As demonstrated in the previous paragraph the observation is that now some malware researchers focus on graph (control flow graph, call graph, code graph). They build their graph in different ways and analyze and compare graph using different methods. To build the graph most researchers present node graph as system call. For example (Lee et al., 2010) creates their graph by transforming PE file into call graph, the call graph nodes are system calls and the edges are system call sequence. Then the call graph minimize into code graph to speed up the analyze and compare graphs. Other researchers (Park et al., 2010) use the same way by use 4-tuples node corresponds system call, edges the dependency of two system calls and label for nodes and edges. Some other researchers define node graph as kernel objects rather than system calls (Park and Reeves, 2011). On the other hand (Kostakis et al., 2011) built the graph from the subroutines as nodes and their call references as edges, (Kim and Moon, 2010) they use a dependency graph whose vertex represents a line in the semantic code. The dependency between two lines is represented by a directed edge and (Bai et al., 2009;Guo et al., 2010) extract a Critical API Graph (CAG) from a Control Flow Graph (CFG) for each malware to define the behavior.

RESULTS AND DISCUSSION
The above works compare graph using different graph matching techniques some of them use maximal common subgraph (Kim and Moon, 2010;Park and Reeves, 2011) and some use Weighted Common Behavioral Graph Generation based on an Approximate Algorithm and others build formula using intersection and the union of the graphs (Lee et al., 2010) but all require time and space due to NP-completeness of the problem.

Proposed framework:
Since each technique has advantages and disadvantages, it is believed that by combining them in some manner we can improve the advantages and decrease the disadvantages. Using static techniques we can get fast and safe result and also by applying unpacking tools to solve packing problem and analysis of the file using both signature and behaviorbased methods we can get better results. In case static techniques fail, we can use dynamic techniques to do more analysis on the file. Furthermore, to get more efficient result about the infected file, we can analyze the file using both signature and behavior-based methods.
Framework component: Execute the PE file and collect API calls: Execute the suspected file in safe (apply rootkit) and controlled environment and use kernel hooking to extract API call after unpacking if the file is packed. We will to trace different path in the file.
Construct the hybrid call graph: We build our graph using API call collecting from the execution of the file; our graph differs from other researcher's graphs in that we build it from the API calls and the operating system resources used by API call as graph nodes, the edges represent the reference between the nodes. Our nodes will have two attributes: API call and operating system resource, the graph label is the API calls its self or the operating system resource.
The construction of the API call graph for a program without API operating system recourses is very simple. In programs containing API operating system recourses, it is possible to have a reference to API operating system recourses which may represent invocations of several distinct other API calls. In order to address all possible questions which result from such a references in API call, we need to know all other API calls associated with that API operating system recourses (Ryder, 1979).
Decrease the constructed graph: The generated graph from the previous step contains huge number of nodes and edges and needs to be minimized. This operation will be by removing unused instruction (junk code, computation) and focus on popular API call used by the majority of malware.
We can use the information on node (API call, operating system resources) to build our API call graph database.

Finding matching graphs: Graph Edit Distance (GED)
is the best algorithm for matching inexact graph type (Gao et al., 2010;Riesen et al., 2010) but its complexity makes it slow (Riesen and Bunke, 2009). To speed up GED we need to find an assignment between the nodes of the two compared graph. For assignment problem we need to build API call and operating system resource cost matrix from the two compared API graphs, after that we can apply an assignment algorithm (Munkres' algorithm) (Munkres, 1957;Riesen and Bunke, 2009) to assign node form one graph to other graph with minimal cost . One difficult in GED using an assignment algorithm (Munkres' algorithm) its base on minimum cost matrix for API call and operating system resources node and edges, where it assumed the cost is fixed value between them (He and Singh, 2006), to minimize the cost matrix, more near nodes and edges are to matching (Hu et al., 2009). Hu et al. (2009) they develop modified Hungarian algorithm based on neighbor matching. Our call graph based on structure and attribute graph, to minimize the cost of the cost matrix we will partition the data graph into sub-graphs based on structure connectivity and attribute connectivity (Zhu et al., 2011).

CONCLUSION
In this study we have shown that signature based techniques and behavior based techniques can be combine to build a system that has better detection of polymorphic malware and less time scan. We have proposed new framework using API call graph system to implement this combination and we have built the system using dynamic analysis method.

ACKNOWLEDGMENT
The researchers would like to thanks University Technology Malaysia for the unlimited support. And the significant role of Sudanese Research Community in UTM is highly appreciated.