Representation Learning for Malware Classification

Jeffrey Johns

Representation learning algorithms automate the manual and often tedious task of designing features for a machine learning problem.  These algorithms have achieved state-of-the-art results on challenging problems in the image, speech and text domains, but require access to a large, labeled dataset.  Malware classification is one place where the information security industry has access to large volumes of labeled data.  We demonstrate that representation learning algorithms, specifically convolutional neural networks (CNNs) operating directly on raw bytes, can achieve results on par with traditional machine learning models for detecting malware.  We also evaluate the incremental benefit of including hand-engineered expert features in a CNN model.  Lastly, a few examples will be shown of features learned in the lowest convolutional layer of the model.