Nahid Farhady

Accenture Cybersecurity Tech Labs

An Effective Framework for Malware Detection and Classification using Feature Prioritization (pptx, video)

Nowadays, signature based malware detection is widely used in commercial anti-viruses. However, this method fails to detect zero-day specific type of malware. Therefore, anti-virus engines are now moving towards finding the shared features and similar behaviors of malware families in order to be able to detect new ones as well and therefore moving towards using Machine Learning techniques. These techniques have focused on static features for a while, however, to be able to classify the malware, the malware engineers need to go through an extensive process of dynamic analysis which entails executing the malware in a sandbox and exploiting the features. In this research, we propose an end to end framework for malware detection and classification using machine learning techniques. In this framework, we use DNN models to detect the malware vs. benign files as well as proposing an uncertainty score for the classification part. Using the uncertainty score, we build another classifier that considers more static features to be able to categorize the files with higher accuracy. The purpose of building two models is to accelerate the process using small set of features and a more extensive set of features including the important strings, functions and import headers. In the next step, we propose a classification model that divided the malware into cyber crime and cyber espionage based on the entropy. Each of these categories then can be classified into up to 10 sub categories using more dynamic analysis. Since there are more than 100 dynamic features and extracting those features can be cumbersome, we also built a model to be able to prioritize those features. We use the PCA (Principal Component Analysis) technique to prioritize the dynamic features to be explored for each sub category as well. Using this method will accelerate the labeling and classification part for the malware engineers which will result in recognizing quarantine techniques much faster in the process. Our research proposes the top 5 dynamic features for each subcategory of malware to be analyzed. To expose our model to several types of malware, we have partnered internally with iDefense, a threat intelligence company, which owns a database of 270M malware binaries. The differentiating factor of this work compared to the previous literature is the type of malware that we include in our training and testing dataset on top of simple feature selection to accelerate the detection process. Using the proposed DNN model and only 6 static features, we are able to gain the FNR of less than 1% with the TPR of over 96%. The prioritization and feature selection effort has shown that the accuracy of malware classification can be boosted using the appropriate features for each subcategory.