Sunil Vasisht

Mandiant,

Philip Tully

Mandiant,

and

Jay Gibble

Mandiant

Annotating Malware Disassembly Functions Using Neural Machine Translation (pdf)

Basic static and dynamic analysis techniques can be used to draw preliminary conclusions during malware initial assessment, but more in-depth analysis is sometimes required to get a comprehensive picture of a binary’s functionality. Malware analysts and reverse engineers try to get closer to malware authors’ original source code by using tools like the NSA’s Ghidra or Hex-Rays’ IDA Pro, which generate low level assembly language and higher level, C-like pseudocode. Disassemblers also perform helpful operations like function recognition and auto-naming that can reduce analyst effort during time-sensitive investigations. To recognize and name disassembled functions, IDA utilizes FLIRT code signatures and other plug-ins that enrich any known or previously identified function names, allowing them to be shared both across time and across users. However, these brittle signatures typically only account for library code added by a compiler. Furthermore, while function names added previously by human analysts are highly precise, their correspondingly low recall means that most functions within a malware sample freshly pulled up within IDA tend to lack semantically meaningful names.

How can we increase function name coverage within binary disassembly in order to accelerate malware triage? By representing disassembly as a structured sequence of input tokens and corresponding ground truth function names as a sequence of target label tokens, we can frame this problem as a neural machine translation (NMT) task. Seq2seq and large language modeling approaches have previously been applied towards generating natural language from source code and vice-versa, including for use cases such as code summarization, code documentation, variable name prediction, and even auto-completion tasks as exemplified by recent work like OpenAI’s Codex model [1]. However, these approaches mostly operate on higher level programming languages like Python and Java that are shorter in length, easier to read, more linearly ordered, and syntactically richer than machine code.

To transform disassembly into inputs for our NMT model, we instead draw inspiration from previous work that generated sequences from structured representations of machine code. Output from IDA’s decompiler is exposed to analysts via an abstract syntax tree (AST), where AST leaves encode user-defined identifiers and names from the code, and internal AST nodes encode structures like loops, expressions, and variable declarations. As in code2seq [2], we represent ASTs as random paths compressed to fixed-length vectors using a BiLSTM, and concatenate these path embeddings with AST leaf token embeddings during encoding; the model then attends to relevant AST paths during decoding to generate a sequence of annotation predictions. We also consider control flow graph (CFG) output from IDA as a separate input representation, where nodes represent a functions’ basic blocks and edges represent control flow instructions. As in Nero [3], we obtain CFGs from disassembled functions, reconstruct and augment call site graphs for each call instruction, and learn sequences of call sites using several competing models including a graph convolutional neural network.

Our input dataset consists of over 360k disassembly functions and corresponding annotations extracted from 4.3k malicious PE files. Annotations come from a combination of auto-generated IDA function names and a proprietary database of stored metadata representing about a decade’s worth of descriptive function names authored by various industry reverse engineers. Raw annotation strings are tokenized into individual words, and care is taken to normalize and merge tokens to account for the variability in annotation quality between analysts. Our approach builds upon and refines the code2seq and Nero models in several ways. Since IDA ASTs include data types for leaf values and mappings between AST nodes and decompilation offsets, in contrast to code2seq, we consider embedding this information for concatenation alongside the leaf token embeddings and AST path embeddings [4]. In contrast to Nero, a large majority of our annotations are hand-labeled by SMEs rather than auto-generated by IDA. Additionally, input files for our models are Windows PE malware, which are more directly applicable in security settings compared with the Java and C# files used to train code2seq as well as the benign ELF executables used to train Nero. We also augment our annotations with capabilities detected by the open-source tool capa [5] run over our malware dataset, and consider a host of different input representation configurations and model architectures for optimizing validation metrics.

We quantitatively evaluate our models by computing F1 scores on holdout splits and perform qualitative evaluation by soliciting feedback on prediction quality directly from reverse engineers. Reverse engineering is an extremely difficult skill to master - instructions for an entire disassembled program can number in the thousands or even millions, and even expert-level reversers can spend hours poring over disassembly to piece together code functionality for more complex malware samples. ASTs and CFGs help normalize differences at the source code level so that commonalities emerge between functions despite variation within individual variables and control structures. Our results indicate that leveraging this syntactic structure using code-to-sequence models allows us to predict meaningful natural language annotations and dramatically reduce the effort surrounding an essential reverse engineering workflow. We envision these kind of “machine language processing” NMT models to be useful as standalone IDA Pro plug-ins or within scalable malware analysis pipelines.

References
[1] Chen, Mark, et al. "Evaluating large language models trained on code." arXiv preprint arXiv:2107.03374 (2021).
[2] Alon, Uri, et al. "code2seq: Generating sequences from structured representations of code." International Conference on Learning Representations, ICML (2019). arXiv:1808.01400.
[3] David, Yaniv, Uri Alon, and Eran Yahav. "Neural reverse engineering of stripped binaries using augmented control flow graphs." Proceedings of the ACM on Programming Languages 4.OOPSLA (2020): 1-28. arXiv:1902.09122.
[4] Spirin, Egor, et al. "PSIMiner: A Tool for Mining Rich Abstract Syntax Trees from Code." 18th IEEE/ACM International Conference on Mining Software Repositories, MSR (2021): 13-17. arXiv:2103.12778.
[5] https://github.com/fireeye/capa