Building a file processing pipeline can sometimes be a requirement of many data scientists. However, this ever expanding role of a data scientist doesn’t have to take a large part of our time. Serverless architectures, as many large tech companies are developing, provide just the solution data scientist are looking for. At CarbonBlack Threat Research, we were able to quickly stand up a scalable system for our binary analysis needs. This system enabled us to focus more on the data and thinking of features rather than the maintenance and configuration of systems and services. This talk will walk through, with code examples, how we were able to build a scalable serverless system using AWS to build a feature rich dataset for various types of file analysis.
Three main topics will be covered:
* Cloud design patterns for ingesting and pre processing binaries to prepare for analysis,
* deploying serverless docker containers for custom analysis, and finally,
* how data is stored and accessed.
As part of our analysis step, a description of the modular approach we took to feature extraction which allows our researchers to pose questions about binaries and quickly extract features from the corpus or sample set. Additionally,
some tips when developing these types of system.