Doug Sibley

Learning to Embed Byte Sequences with Convolutional Autoencoders (pdf, video)

We propose a self-supervised approach to generating features for arbitrary byte sequences by training a convolutional autoencoder directly on raw bytes. The low vocabulary of this task (256) makes it viable to train on sequences at least 1MB in size. We evaluate this approach to byte-level feature engineering by first examining how accurate the autoencoder can be at reconstructing a variety of datasets, then testing this approach specifically on SOREL malware samples, extracting the learned features and comparing them against the EMBER V2 features for the task of malware tagging. Our results suggest that the learned features from the convolutional autoencoder rival those of the human engineered set without requiring domain-specific preprocessing of the portable executable file.