Ethan Rudd,

David Krisiloff,

Daniel Olszewski,

Ed Raff,

and

James Holt

Efficient Malware Analysis Using Metric Embeddings (pdf, video)

Machine learning-based malware classification has become a key component of modern defense-in-depth strategies, with focus placed on the binary classification task of malware detection. These detection models are typically combined with other toolchains, which provide additional context necessary for triage and remediation, including detection names, capability, and type information. The resulting systems are often complex and interconnected, incurring significant technical debt, infrastructure costs, and inevitable errors.

In this paper, we examine the feasibility of using machine learning to streamline malware analysis pipelines in a manner which minimizes potential risks and costs while preserving flexibility and functionality. To this end, we explore the use of metric learning to embed malicious and benign samples in a low-dimensional vector space with enriched capability information for downstream use in a variety of applications, including detection, family classification, and malware attribute classification.

Specifically, we enrich labeling on malicious and benign PE files from the EMBER dataset using Mandiant’s CAPA tool, an open-source toolchain which uses disassembly and subject matter expert (SME) derived rules and heuristics to determine malicious capabilities. Using these CAPA labels, we derive several different types of metric embeddings utilizing an embedding neural network trained via contrastive loss, Spearman rank correlation on malware similarity, and combinations thereof.

We then examine performance on a variety of transfer tasks performed on the EMBER and SOREL datasets. We show that for a variety of transfer tasks, we are able to utilize relatively low-dimensional metric embeddings with little decay in performance, which offers the potential to quickly retrain for a variety of transfer tasks. The low-dimensional representations offer added potential to significantly reduce training and storage overhead when performing retrains or transferring to additional downstream tasks.