CAMLIS 2018

DAY ONE


Keynote: Do You Know What Your ML Is Doing?

Presentation >‍ ‍| Video >

How can we better control and understand how our machine-learned models are behaving for all inputs?  We'll discuss two strategies: shape constraints, which regularize functions to capture our prior semantic knowledge, and rate constraints, which let us impose policy goals like fairness and low-churn training on our models.

We'll cover ideas, mathematical principles, and open source Tensor Flow code, with pointers to published papers and code for more details.

Maya Gupta

Principle Scientist, Google

  • Speaker: Matthew Berninger

    It is the job of the SOC and IR team to collect, classify, and report malicious cyber activity. Knowing "who" is behind an attack can help an Incident Response team anticipate the adversary’s next move, or understand what the attackers' end goal may be. In order to be useful, this attribution need not always be tied to geography. Simply knowing "this backdoor is used by group X, who also tends to use Y method of lateral movement" can be enough context to help an IR team optimize their investigation. Knowing where group X lives, or what language they speak, may not always be knowable, or necessary.

    Presentation > | Video >

  • Speaker: Awalin Nabila Sopan

    In a security operations center or SOC, security analysts detect and triage time sensitive security alerts. One big challenge they face is the amount of false positive alerts from various data sources. Use of machine learning models to classify such alerts can reduce their workload; but for such mission-critical tasks we cannot solely depend on the ML, especially since there are always new types of attacks. To aid the analysts, we developed a system that classifies an alert into Malicious or Benign; and presents them the prediction along with an explanation. In this work, we demonstrate an ongoing effort to explain the machine learning model’s alert classification to SOC analysts using a model explanation visualization. While a human in the loop approach can help improve a model, most published work has focused on interpreting and visualizing the model features for data scientists; we focused on the analysts who triage alerts based on the alert data and the model’s prediction. Hence, we created a visualization of a model prediction to help analysts without overwhelming them.

    Our analysts use a web based platform to investigate alerts triggered by some signature or indicator of compromise. They can view the raw data of the alert and pivot around various features before reaching the final decision (whether the alert is malicious or a benign one). Our UI component shows the analysts what our underlying machine learning model thinks of the alert and ‘Why’. It has three components:
    1. The classification made by the model along with the prediction score.
    2. The decision path: what features of the current alert are used by the model
    3. The main features from al alerts used by the model.
    If an alert is classified as malicious with high confidence, analysts can verify that by looking at the features presented in the UI and compare it with overall data set (the visualization of the data distribution for each matched condition). If they disagree with the model’s decision they can comment explaining the reason; the data scientists use that feedback to improve the model for future alerts and determine outliers. Thus the analysts can provide insight regarding the model without getting into the mathematical details. To keep the model explainable, we used a random forest model which uses a number of decision trees, and the features presented to the analysts are only the ones that are human.

    Presentation > | Video >

  • Speaker: Cody Wild

    Website content classification has several salient characteristics as a machine learning problem, but perhaps the most salient is that it is a multi-class classification problem with nonuniform and asymmetric misclassification costs. Misclassifying a news site as a business site is a much less serious error than misclassifying a pornographic site as children’s entertainment, and we would like our model’s training objective to reflect that. However, because categorical cross-entropy loss - the standard for neural network models - works by simply increasing the log-probability of the true class, rather than directly penalizing incorrect classes, it offers no straightforward mathematical way to incorporate misclassification costs as loss weights .

    This talk will review existing methodology for incorporating misclassification costs into models, and also propose a novel approach called CCAL: Cost Cluster Auxiliary Losses. This method clusters output classes into groups of mutually low misclassification cost, and then trains the model using the cross-entropy loss on the fully granular category classes, as well as cross-entropy loss against the courser group labels, at multiple levels of granularity. The intuition behind CCAL is that these auxiliary losses implicitly give the model information about which mistakes are worse than others by giving some positive gradient weight to misclassifications that are still in the same supercluster, and do so in a way that is easier to tune because all auxiliary losses are in the form of cross-entropy, rather a poorly scaled mix of linear and cross-entropy losses, as in Resheff et al’s bilinear approach. The talk will conclude by discussing cost structures where one would expect CCAL to perform well or poorly, and examine whether it can be effectively used as a form of curriculum learning.

    Presentation > | Video >

  • Speaker: Bronwyn Woods

    Strong authentication is a lynchpin of the zero trust security model, and user and entity behavior analytics (UEBA) aids in establishing or refuting trust in authentication requests. Identifying suspicious activity is often the end goal, but many UEBA systems start with anomaly detection relative to models of expected user behavior. This behavior is statistically complex, and a failure to capture that complexity leads to errors in anomaly detection and threat identification.

    We focus specifically on modeling users’ authentication activity, which shows extremely strong temporal cycles as well as complex dependencies between sequences of authentications. Many anomaly detection techniques treat events as independent, ignoring these dependencies. We incorporate temporal dependence using point process models, which also provide statistical groundwork for formally evaluating how well our models capture the structure of normal activity.

    Point processes are a broad class of models that can describe discrete points (or events) distributed across some mathematical space, such as time. They have undergone decades, or perhaps centuries, of statistical development. Recently, point processes have been used in fields such as neuroscience, seismology, and finance to model discrete, temporally dependent events in increasingly large and complex datasets. The methodology for applying these models to modern datasets is an area of active statistical research, but there is a large body of knowledge that we can already apply directly to the security domain.

    In this talk, I will outline the mathematical foundations of inhomogeneous Poisson point process models, and their application to user authentication data. I will highlight the strengths of these models in accounting for temporal patterns and dependencies, as well as the computational and methodological challenges in applying them to production scale multi-dimensional datasets. Attendees will learn enough about this approach to explore its applicability to other types of event sequence data in security.

    Presentation > | Video >

  • Speaker: Frances Zlotnick

    Mass generation of fake accounts for malicious purposes is a problem that faces many online platforms. Identifying and removing such accounts is an increasingly high priority for security and integrity teams in commercial, governmental, and other contexts, as prevalent misrepresentation on a platform degrades user trust, injects uncertainty into performance and business metrics, and presents opportunities for serious security incidents. Malicious users generating such accounts often go to great lengths to make such accounts appear legitimate, by adding plausible names, photos scraped from other websites, and other details to fake account profiles.

    This habit presents an opportunity for automated detection. Names—to a greater or lesser extent depending on cultural context and language—encode demographic attributes such as gender, the distribution of which can be monitored among legitimate users. Bad actors rarely have sufficient knowledge of a platform's user base to accurately mimic these expected distributions. Sharp departures from known distributions can be used to identify bursts of fake account generation for closer inspection. We present empirical examples using data from our work detecting malicious users.

    While potentially useful, use of such methodology sits within a minefield of technical and, most importantly, ethical challenges. We discuss a number of these, including the challenges of detecting gender across cultural contexts, and the inherent dangers of using gender-related features to identify potential bad actors. Particularly in contexts where women are already severely underrepresented, false-positives among this cohort might have the effect of further discouraging participation, running counter to goals of increasing diversity, inclusion, and belonging.

    Presentation > | Video >

  • Speaker: Richard Harang

    In practical applications of binary classification, knowing the uncertainty of the prediction can be almost as important as knowing the most likely prediction. In the case of responses given in a 0-1 range, the distance from one extreme or the other is often taken as a proxy for the certainty (or uncertainty) of the classification. While for the specific case of the binary cross-entropy loss under rarely-obtained conditions this estimate of uncertainty is correct in the narrowly defined sense that it asymptotically attains the posterior conditional probability of the label being in the ‘positive’ class, the general approach of using the output score of the classifier does not typically yield a faithful estimate of uncertainty in the above sense. Furthermore, in the finite-data case, and especially with complex modern classifiers that apply complex transformation, partitions, or both to the input space, the score itself is subject to a significant degree of uncertainty that is frequently difficult to characterize precisely. Thus, even if we accept the score as a proxy for uncertainty, we may be uncertain about how accurate this measurement of uncertainty is!


    In simpler classifiers, direct estimation of this uncertainty can be performed by examining the support of a test point within the training data. However in many areas of security data science, the size of the input space to classifiers can be quite large and so the curse of dimensionality can make it difficult to identify the support of an example within the training data. Even when this difficulty can be overcome, the complex relationships between these inputs that most modern classifiers can learn and exploit to obtain their high performance means that areas of high or low support in the input space may not be so well (or poorly) supported within the transformed space within which the classifier is effectively making its prediction. Variational methods have been proposed to estimate uncertainty in deep neural networks regularized via dropout, however this comes at a significant computational cost. Finally, multi-half-space classifiers for deep neural networks have been proposed that attempt to learn the density of the training data as represented by the final layer of the network; while this approach incurs a relatively modest computational burden, we find empirically that the better a given network does at separating the data in the final pre-classification layer, the worse this method performs at estimating the training data’s distribution.


    In this talk, we examine this problem from the perspective of Bayesian approximation, and show how using deep neural networks as approximating functions for parameters of a hierarchical Bayesian model can lead to uncertainty estimates for models that are robust, do not fail when the model is “too good”, require comparatively little additional computation to obtain, and can in most cases be directly converted into a maximum a posteriori estimate ‘score’ for the network.

    Presentation > | Video >

  • Speaker: Bobby Filar

    Tree-based classifiers like gradient-boosted decision trees (GBDTs) and random forests provide state-of-the-art performance in many information security tasks such as malware detection. Even while adversarial methods for evading deep learning classifiers abound,little research has been carried out against attacking tree-based classifiers due to models being non-differentiable, which significantly increases the cost of attacks. Research has shown attack transferability may be successful at evading tree-based classifiers, but those techniques do little to illuminate where models are brittle or weak.

    We present TreeHuggr, an algorithm designed to analyze split points of each tree in an ensemble classifier to learn where a model might be most susceptible to an evasion attack. By determining where in the feature space there exists insufficient or conflicting evidence for a class label or where a decision boundary is wrinkled, we can not only better understand the attack space, but we can also more intuitively understand a model’s blind spots and increase interpretability. The key differentiator of TreeHuggr is a focus on the where the model is most susceptible, not in how to evade, given a starting point (a common tactic in adversarial examples).

    This talk will provide an example-driven demonstration of TreeHuggr against the open-source EMBER dataset and malware model. We hope that TreeHuggr will highlight the potential defensive uses of adversarial research against tree-based classifiers and yield more insights into model interpretability and attack susceptibility.

    Presentation > | Video >

  • Speaker: Malachi Jones

    A prominent technique for detecting sophisticated malware consists of monitoring the execution behavior of each binary to identify anomalies and/or malicious intent. Hooking and emulation are two primary mechanisms that are employed to facilitate the monitoring. Although these behavioral monitoring mechanisms are a substantial improvement over classic signature detection, skilled malware authors have developed reliable techniques to defeat them. As an example, sophisticated malware can exploit hooking implementations by either utilizing alternative (e.g. lower level) unhooked API or by removing the hooks at run-time to evade monitoring. In addition, the malware can also perform checks to detect if it is executing in an emulator/VM and modify its behavior accordingly.

    In this talk, we will demonstrate an approach for pairing Memory Forensics with Binary Analysis and Machine Learning to analyze the behavior of binaries on a set of hosts to detect advanced persistent threats (APT)s that may evade detection by hooking and traditional emulation. In particular, we will discuss how an approximate clustering algorithm with linear run-time performance can be leveraged to identify outliers (i.e. potential APTs) among sets of clustered memory artifacts (i.e. processes, shared libraries, drivers, and kernel modules). Note that these memory artifacts are collected from live, networked hosts and clustered real-time in a scalable manner. We will also discuss and demonstrate how dynamic binary analysis can be leveraged with Machine Learning techniques to differentiate between benign anomalous code and malware to improve detection accuracy.


    Presentation > | Video >

  • Speaker: Lindsey Lack

    Defensive monitoring systems have an insatiable demand for ever-better telemetry, as evidenced by the normalization of host-based systems, comprehensive logging platforms, and orchestration frameworks. These demands put pressure on constrained resources, which can result in monitoring architectures that are distributed or segmented in order to reduce the work on the front-end (or edge) and satisfy the conflicting demands of breadth and depth.

    For illustration, picture a malware detection system that does some initial limited triage before deciding whether to send the file on for more comprehensive analysis. The overall system has an efficacy that is measured by both the triage and the later stages, and it has potential additional costs associated with transfer to a centralized site and back-end processing.

    Traditional examples of machine learning present problems in a simplistic and pristine way that assumes full knowledge of inputs and outputs, analogous to physics problems that don't account for friction or air resistance. In reality, there are often complexities and trade-offs in an implementation's design. The topic of sequential or multi-stage classification has been addressed in machine learning literature, though examples have mainly been applied to synthetic and canonical data sets with a particular focus on medical diagnosis. Previous work has shown that optimizing for the whole system delivers distinct improvements over naive or myopic approaches.

    This talk illustrates the application of optimizing multi-stage classification techniques to security data sets and describes attempts to improve multi-stage classifiers in three ways:
    1) Previous work has relied on heuristic measures of confidence in order to make reject decisions. Especially with complex models, these heuristic measures can be suspect. This research looks into the use of Bayesian methods to achieve better estimates of confidence that can be used even in complex models.
    2) Like most modeling, there is an assumption that training distributions are sufficiently similar to those found at test. With the very large data sets and shifting distributions frequently seen in security domains, these assurances can be difficult to provide. For complex models, out-of-distribution samples can act as "natural" adversarial samples. Additionally, out-of-distribution samples can have an especially deleterious effect on multi-stage processes due to the multiplied costs. This research investigates ways to make sequential classification systems resistant to costly out-of-distribution samples.
    3) Initial stages in multi-stage classification systems are especially sensitive to performance considerations. This research looks at the feasibility of combining multiple functions into a single (multi-output neural network) model to streamline performance.

    Presentation > | Video >

DAY TWO


Keynote: Bonware to the Rescue: the Future Autonomous Cyber Defense Agents

Presentation >‍ ‍| ‍Video >

I will begin my talk my pointing out that in a number of important domains, especially mobile, such as military but also industrial, conventional cyber defense paradigms are increasingly inadequate, and one solution might involve host based autonomous cyber defense agents. For a number of reasons machine learning is a key to creation and continuing adaptation of such an agent.  I will discuss what this agent might look like and what distinct functional features and advantages it might exhibit.

I will also describe a tentative vision of how such agent might be architected and where Machine Learning fits into the architecture. I will outline requirements for the learning process, and possible approaches to how the agent can learn to actively parry the actions of the malware; and what apparent limitations of today’s ML must be overcome in order to address such requirements.

Alexander Kott

Chief Scientist, Army Research Laboratory

  • Speaker: Shanchieh (Jay) Yang

    Cyberattacks on enterprise networks have moved into an era where both attackers and security analysts utilize complex strategies to confuse and mislead one another. Critical attacks often take multitudes of reconnaissance, exploitations, and obfuscation techniques to achieve the goal of cyber espionage and/or sabotage. The discovery and detection of new exploits, though needing continuous efforts, is no longer sufficient.

    Imagine a system that automatically extracts the ways the attackers use various techniques to penetrate a network and generates empirical models that can be used for in-depth analysis or even predict next attack actions. What if we can simulate synthetic attack scenarios based on characteristics of the network and adversary behaviors? Will publicly available information on the Internet be viable to forecast cyberattacks before they take place? This talk will discuss advances that enable anticipatory cyber defense and open research questions.

    Specifically, this talk will present a suite of research prototypes: ASSERT integrates Bayesian-based learning with clustering validity index to generate and refine attack models based on observed malicious activities; CASCADES employs contextual models to reflect how the attackers gradually accumulate his/her knowledge of the network with various preferences and behavior traits; CAPTURE overcomes limitations of imbalanced, insufficient, and insignificant data to forecast cyberattacks before they happen using unconventional signals in the public domain. These ongoing research will provide anticipatory capability for proactive cyber defense.

    Presentation > | Video >

  • Speaker: C. Bayan Bruss

    Every day, a large number of new specimens of malware are detected worldwide. Often these specimens are variants on existing malware or combinations of older functionality. This creates a need to identify if a piece of malware is similar to existing malware and what functionality is known about the new sample. At the same time, the scale and latency requirements leave little room for full reverse engineering of each sample. Current static methods like signature matching have limited efficacy and are easily bypassed with small changes to the source code. As a result there is a need for a solution that is fast, scalable, can compare observed malware binaries in a way that is robust to small changes and obfuscation tactics. We propose the use of a neural embedding model which learns representations of opcodes and operands in disassembled binaries based on their usage patterns across a large corpus of binaries. In this model each line of a function is treated as a combination of opcodes & operands each drawn from an embedding space. Lines that appear near one another within a function are assumed to be contextually relevant to one another. The model then seeks to determine lines of embedded opcodes and operands that are contextually relevant from those that are not, learning the best representation of these codes to do so. Once trained, these learned embeddings can be used to efficiently generate a statistical representation of an unseen binary allowing for clustering and classification. This unsupervised method avoids problems with highly heterogenous or scarce labeled data. It allows for clustering of malware based on functionality in a way that is not affected by small changes in the code. Initial results also indicate that these embeddings can be used to facilitate training of classifiers where labeled data is available.

    Presentation > | Video >

  • Speaker: Rebecca Bilbro

    While data privacy challenges long predate current trends in machine-learning-as-a-service (MLAAS) offerings, predictive APIs do expose significant new attack vectors. To provide users with tailored recommendations, these applications often expose endpoints either to dynamic models or to pre-trained model artifacts, which learn patterns from data to surface insights. Problems arise when training data are collected, stored, and modeled in ways that jeopardize privacy. Even when user data is not exposed directly, private information can often be inferred using a technique called model inversion. In this talk, I discuss current research in black box model inversion and present a machine learning approach to discovering the model families of deployed black box models using only their decision topologies.

    Presentation > | Video >

  • Speaker: Kyle Gwinnup

    Building a file processing pipeline can sometimes be a requirement of many data scientists. However, this ever expanding role of a data scientist doesn’t have to take a large part of our time. Serverless architectures, as many large tech companies are developing, provide just the solution data scientist are looking for. At CarbonBlack Threat Research, we were able to quickly stand up a scalable system for our binary analysis needs. This system enabled us to focus more on the data and thinking of features rather than the maintenance and configuration of systems and services. This talk will walk through, with code examples, how we were able to build a scalable serverless system using AWS to build a feature rich dataset for various types of file analysis.

    Three main topics will be covered:
    * Cloud design patterns for ingesting and pre processing binaries to prepare for analysis,
    * deploying serverless docker containers for custom analysis, and finally,
    * how data is stored and accessed.

    As part of our analysis step, a description of the modular approach we took to feature extraction which allows our researchers to pose questions about binaries and quickly extract features from the corpus or sample set. Additionally,

    some tips when developing these types of system.

    Presentation > | Video >

  • Speaker: Ryan Kovar

    Security data can be surprisingly hard to come by when you don't have users generating it for you. So we made or found datasets and then hosted them for the community. This talk will discuss the "Splunk dataset project" and how it can be used by data scientists (new and experienced) to try machine learning hypotheses across a variety of different datasets in a curated environment. From the Endgame Ember malware dataset to Windows Event Logs, the Splunk Datasets Project attempts to give researchers and newbies a place to try new ML techniques using tools like Splunk's Machine Learning Toolkit (MLTK) which is a bundled version of various ML libraries like numpy, scipy, pandas, scikit-learn, and statsmodels.

    Presentation > | Video >

  • Speaker: Brian Genz

    Attackers have a seemingly endless arsenal of tools and techniques at their disposal, while defenders must continuously strive to improve detection capabilities across the full spectrum of possible attack vectors. The MITRE ATT&CK Framework provides a useful collection of attacker tactics and techniques that enables a threat-focused approach to detection.

    This talk will highlight methodologies and key lessons learned from an internal adversary simulation at a Fortune 100 company that evolved into a series of data science experiments designed to improve threat detection.

    In 2017, we performed basic Exploratory Data Analysis (EDA) while working to improve detection engineering activities around post-exploitation attack techniques during adversary simulation exercises. We paused to ask the question, “Isn’t this labeled data we’re generating? The red team just performed this attack, and we can positively identify the observations that resulted from that attack technique.”

    Could we move beyond clustering, we wondered, and into the realm of supervised learning? We had to consider whether we were introducing any biases based on the methodology used in selecting and executing the attack techniques. We were also curious as to whether the inherent attacker tradecraft principle of stealth might translate into imbalanced classes in the data, and to what extent.

    We defined what we wanted to model: “Post-compromise attacker activity.” We focused on an initial technique: “DNS Exfiltration.” We defined the goal as, “Incorporate labeled attack data in training a model to classify DNS requests as ‘malicious’ or ‘benign.’

    What started as a few questions and resulting brainstorming sessions eventually grew into a security data science practice supporting detection engineering, Digital Forensics and Incident Response (DFIR), Threat Hunting, and Threat Intelligence at the Fortune 100 company. This talk will step through the key aspects of the problem-solving approach used, with an emphasis on model selection and feature engineering.

    Presentation > | Video >

  • Speaker: Scott Coull

    To effectively protect users from the latest malware threats, detection mechanisms must be capable of adapting as quickly as the threats themselves. Traditional machine learning-based antivirus (i.e., next-gen AV) solutions provide this capability by generalizing from previous examples of malware, but often require laborious development of hand-engineered features by domain experts to gain a true advantage. Moreover, these features are often specific to each type of executable file (e.g., Portable Executable, Mach-O, ELF, etc.), further compounding the amount of overhead required. Recently, however, a series of deep neural network models have been proposed that operate directly on the raw bytes of executable files to detect malware - effectively learning the feature representations directly from the data with no information about its syntax or semantics.

    With the success of these approaches, an obvious question arises: what exactly are these neural networks learning? In this talk, we seek to answer this question by providing a deep and broad analysis of activations in a byte-based deep neural network classifier. Unlike previous work, we expand our analysis beyond simply looking at the location of the activation to understand the basic features that are learned and their connection to the semantics of the executable as a reverse engineer would understand them. Furthermore, we perform this analysis using a dataset that is significantly larger than any other considered in the literature to date - containing more than 15M distinct goodware and malware executables.

    Our experiments include an examination of (1) the general trends in activation locations that separate goodware from malware, (2) analysis of the byte embedding space and low-level feature detectors, and (3) end-to-end activation analysis using the SHapley Additive exPlanations (SHAP) framework. Where possible, we bridge the gap between raw-byte activations and the semantics of the executable through automated parsing and disassembly of the activation locations in an effort to obtain human-understandable explanations for the model's predictions. We exploit this capability to perform a unique bi-directional validation process between a reverse engineer and the model, whereby the reverse engineer and model score each other's areas of interest within the executable.

    Overall, the results of these analyses provide novel insight into many aspects of why byte-based malware classifiers work as well as they do. More importantly, they help shape our evolving understanding of the resilience of deep neural network architectures to adversarial examples, as well as the development of new hand-engineered features. Finally, the tools developed here represent an initial step toward providing analysts with the necessary context for understanding malware predictions made by deep learning models.

    Presentation > | Video >

  • Speaker: Hyrum Anderson

    Much of the success of machine learning malware classifiers depends on a meaningful representation of file features. In fact, unlike applications such as machine vision and machine translation in which “featureless” end-to-end deep learning achieves state of the art performance, static malware machine learning is still dominated by hand-crafted features wherein specific discriminative domain knowledge can be codified manually that is not inferred automatically via end-to-end deep learning. However, recent advances in end-to-end deep learning for malware classification offer ever-improving success rates, the hope of parser-less file classification, and glimpses of discovering truly malicious or benign content in a file byte sequence. In either case, representation is key.

    This talk considers the following problem setup. One wishes to learn a feature representation from both labeled and abundantly unlabeled Windows portable executable (PE) files (although the techniques presented are broadly applicable to other formats). It is desirable that the features not only enable classification performance approaching that of fully discriminative networks, but also encapsulate semantically meaningful file characteristics by which one can measure file similarity to functionally similar malware samples, for example.

    I will present and compare several novel and yet-unpublished approaches to semi-supervised learning of PE file representations. First, I present unsupervised learning of file features from raw bytes by solving a file chunk reordering problem, akin to solving a jigsaw puzzle (previously applied to images). I will demonstrate how this promising approach, however, is not able to learn desired invariances. Next, I present a semi-supervised deep learning approach that explicitly learns meaningful invariances and leverages a novel neural architecture I call a softmax forest for performing a self-taught learning task for binary files, and show how this approach is superior to commonly used metric learning frameworks. Finally, I compare these to an unsupervised feature representation approach I term a variational equivalence encoder (VEE) that uses variational principles to learn invariances. This latter framework can be viewed as modification of a standard variational auto-encoder (VAE). I'll compare these approaches to the baseline EMBER model for classification and EMBER features for similarity search.

    Presentation > | Video >

  • Speaker: David Krisiloff

    Cybersecurity utilizes crowdsourcing for a variety of tasks from spam detection to security bug bounties. For anti-virus, VirusTotal provides a crowdsourcing platform that aggregates results from more than 70 antivirus (AV) scanners making it a tempting source of labels to train machine learning based AV. However, VirusTotal has multiple unique features compared to other crowdsourcing models. Unlike most crowdsourced data, AV scanners reliably improve over time. New AV engine versions incorporate new malware signatures that, on average, improve detection performance. Furthermore VirusTotal detections are public, producing a feedback loop where AV scanners can learn from other AV scanners. VirusTotal runs each AV engine against every new file submitted. In addition, VirusTotal also allows users to rescan an old file with the latest AV engines, but limits the number of files that can be rescanned per day. This environment raises a variety of questions. How do we assign malware labels from noisy VirusTotal reports? When should a file be rescanned to take advantage of AV updates? How should rescans be prioritized?

    Using a set of historical VirusTotal reports, we examine the temporal dynamics of virus detections and discuss a variety of models for producing labels from the reports. Changes in AV detections over time are generally predictable using machine learning models. This makes it possible to anticipate which files are mostly likely to change their labels over time, regardless of the function used to combine the crowdsourced detections into labels. We present optimal strategies for rescanning files on VirusTotal to build improved data sets. Ultimately, our models produce more accurate labels faster than passively waiting for AV vendors on VirusTotal to come to a consensus.

    Presentation > | Video >

  • Speaker: Nahid Farhady

    Nowadays, signature based malware detection is widely used in commercial anti-viruses. However, this method fails to detect zero-day specific type of malware. Therefore, anti-virus engines are now moving towards finding the shared features and similar behaviors of malware families in order to be able to detect new ones as well and therefore moving towards using Machine Learning techniques. These techniques have focused on static features for a while, however, to be able to classify the malware, the malware engineers need to go through an extensive process of dynamic analysis which entails executing the malware in a sandbox and exploiting the features. In this research, we propose an end to end framework for malware detection and classification using machine learning techniques. In this framework, we use DNN models to detect the malware vs. benign files as well as proposing an uncertainty score for the classification part. Using the uncertainty score, we build another classifier that considers more static features to be able to categorize the files with higher accuracy. The purpose of building two models is to accelerate the process using small set of features and a more extensive set of features including the important strings, functions and import headers. In the next step, we propose a classification model that divided the malware into cyber crime and cyber espionage based on the entropy. Each of these categories then can be classified into up to 10 sub categories using more dynamic analysis. Since there are more than 100 dynamic features and extracting those features can be cumbersome, we also built a model to be able to prioritize those features. We use the PCA (Principal Component Analysis) technique to prioritize the dynamic features to be explored for each sub category as well. Using this method will accelerate the labeling and classification part for the malware engineers which will result in recognizing quarantine techniques much faster in the process. Our research proposes the top 5 dynamic features for each subcategory of malware to be analyzed. To expose our model to several types of malware, we have partnered internally with iDefense, a threat intelligence company, which owns a database of 270M malware binaries. The differentiating factor of this work compared to the previous literature is the type of malware that we include in our training and testing dataset on top of simple feature selection to accelerate the detection process. Using the proposed DNN model and only 6 static features, we are able to gain the FNR of less than 1% with the TPR of over 96%. The prioritization and feature selection effort has shown that the accuracy of malware classification can be boosted using the appropriate features for each subcategory.

    Presentation > | Video >