Brian Murphy

ReliaQuest

An Information Security Approach to Feature Engineering (pdf, video)

Feature engineering in data science is central to obtaining satisfactory results from deep learning models. When considering how to create features for InfoSec purposes it is important to consider the context of the features and what their underlying meaning is. Common data science techniques such as feature hashing and one-hot encoding, while effective for certain tasks, often fall short when creating features for security related models. This is due to locality sensitivity being often lost.

To address this, we built a set of feature encoders and scalers built specifically for the data types common to information security. In particular we have found that using advanced security focused encoders for IP addresses, usernames, URLs, domain names and geographic information yields dramatically better results than using the naïve encoders commonly employed by data scientists.

This talk expands upon the rationale used to arrive at these methods of encoding and goes into detail on the algorithms used to build these new encoders.

The improvement in prediction results when using these encoders is clearly seen when using a binary classifier trained on labeled data to separate DNS traffic into clean and malicious requests. We see an improvement from approximately 65% accuracy when using basic encoders to over 90% when using the new security focused encoders.

Attendees to this presentation will come away with a new approach to encoding InfoSec features for machine learning that should increase the fidelity of their deep learning models.