CAMLIS 2023

DAY ONE

Keynote: Lessons for AI Security Preparedness

Shawn Richardson is the Director of Cyber Defense Operations at NVIDIA, leading the product security operations center and incident response teams. She has spent most of her 20+ year career in product security and incident response roles at companies like Microsoft, Palo Alto Networks and Amazon.

She served as a board member for FIRST.org, an international organization that brings together incident response and security teams from countries across the world to ensure a safe internet for all, and currently participates in the several industry special interest groups.

Shawn Richardson

Director of Cyber Defense Operations, NVIDIA

Speaker: Arjun Chakraborty
Kubernetes (K8s) is a platform used for managing containerized applications. It has robust orchestration, scaling and load balancing capabilities. However, its complexity can make it a target for attackers.

This necessitates a need to focus on securing every aspect of the Kubernetes stack. For this purpose, Kubernetes audit logs are very useful. K8s audit logs record each activity that occurs in the cluster. It also adds metadata such as IP, user agent, etc. This can be then used to look for indicators of attack.

Our work introduces a novel GNN (Graph Neural network) based solution to K8s threat detection. We model a sequence of dependent events occurring within a K8s session as a graph and formulate the problem as a graph classification task. The embeddings generated from the graph classification task are then used downstream for anomaly detection.

We simulate some commonly used adversarial techniques and showcase how using GNN-based embeddings downstream can strengthen traditional rules-based threat detection techniques.

Our discussion covers dataset creation, graph modeling of K8s sessions, embedding extraction, application of the embeddings and finally, the adversarial simulation for testing.
Presentation > | Video >
Speakers: Grant Gelven and Shannon Strum
In this talk, we will discuss the use of graph-based user-entity behavior analytics to develop an insider threat detection system at one of the largest private companies in the world. We use raw audit log data from multiple systems which captures point-in-time interactions between people and internal resources. These can be transformed into a heterogenous weighted bipartite graph, reducing user behavior against internal assets to a link prediction problem on the graph of all users and resources. We show that classical matrix factorization techniques can be adapted to generate reliable statistics on the observed and expected behaviors of users which allows for monitoring and detection of anomalous events while also providing a natural way to measure the exposure to insider threat risks due to over-privileged access. We provide a few highlights related to the problem in an enterprise setting, and describe the mathematical framework used for quantifying risk, the methods for modeling individual actions, and reporting of results for use in improving overall security posture.
Presentation > | Video >
Speakers: Josh Collyer, Tim Watson and Iain Phillips
Being able to identify functions of interest in cross-architecture software is useful whether you are analyzing for malware, securing the software supply chain or conducting vulnerability research. Cross-Architecture Binary Code Similarity Search has been explored in numerous studies and has used a wide range of different data sources to achieve its goals. The data sources typically used draw on common structures derived from binaries such as function control flow graphs or binary level call graphs, the output of the disassembly process or the outputs of a dynamic analysis approach. One data source which has received less attention is binary intermediate representations. Binary Intermediate representations possess two interesting properties: they are cross architecture by their very nature and encode the semantics of a function explicitly to support downstream usage. Within this paper we propose Function as a String Encoded Representation (FASER) which combines long document transformers with the use of intermediate representations to create a model capable of cross-architecture function search without the need for manual feature engineering, pre-training or a dynamic analysis step. We compare our approach against a series of baseline approaches for two tasks; A general function search task and a targeted vulnerability search task. Our approach demonstrates strong performance across both tasks, performing better than all baseline approaches.
Presentation > | Video >
Speakers: Becca Lynch and Lauren Saue-Fletcher
While phishing has long been a prevalent threat against authentication systems, a gain in popularity of reverse-proxy kits has made detection and prevention of phishing attacks increasingly difficult. Open-source tools such as evilginx are capable of not only phishing credentials and passcodes, but proxying an entire multi-factor authentication (MFA) flow and all associated cookies. In this scenario, the user sees an expected login prompt from the MFA provider, proxied through the attack server, while the MFA provider sees what appears to be a valid login session simply originating from a different IP address. To the MFA provider, the IP of the attack server is often the only apparent difference between a malicious and a benign authentication. This, coupled with inaccuracies in IP geolocation, variable user behavior, ISP IP shuffling, benign VPN usage, and a severe imbalance between benign and malicious authentications, limits traditional server-side ML detection capabilities. Using data from [REDACTED], a large authentication provider, we applied point-in-time DNS data to authentication records to identify domains corresponding to the source IP address of the client at the moment of access. We utilized targeted URL and behavioral filtering to identify likely attacker-owned domain-IP pairs, and analyzed authentications from these IPs to provide data insights on MFA phishing attack signatures. With this newly uncovered set of labeled malicious authentications, we test a variety of classification approaches in the detection of MFA bypass attacks. We demonstrate the benefits of threat-informed data mining in true positive sample generation, as well as the performance and usability tradeoffs of multiple classification methods in the server-side detection of MFA bypass attacks. These classification techniques applied on newly labeled phishing authentication data are then shown to out-perform unsupervised methods in the identification of malicious authentications.
Presentation > | Video >
Speaker: Konstantin Berlin
Recently, there has been a major paradigm shift in cybersecurity protection, with the focus shifting from attack prevention on edge devices to cloud-centric detection pipelines on top of centrally stored data collected from an entire customer estate. Centralizing data in the cloud provides greater visibility, enabling the deployment of more complicated detection pipelines that can use information from multiple observability points to make more complex decisions. For example, data across email, firewall, and endpoints can be combined to provide not only more complex detection logic but to also orchestrate complex mitigations and remediations in response to an attack. In turn, this drastically increased the amount of data security vendors processed in the cloud to levels previously only seen in the largest cloud-based companies.
Here we describe Sophos AI’s latest MLOps infrastructure that is designed to be flexible, simple to maintain, and scalable. We conceptually refer to it as an immutable SQL-driven infrastructure. The idea behind this is SQL-orchestrated workflows running on top of a cloud-based SQL data warehouse (in this case Snowflake), where non-SQL components are directly accessible in SQL through external linkage of standard ECS/Kubernetes auto-scaling clusters fronted by a generic batching-first API. These external components are immutable (we do not remove them from infrastructure, just autoscale them to 0), meaning that any update to the components cannot break existing pipelines. Written in SQL the pipelines are much easier to understand and do not require complex cloud engineering skillset to maintain or modify.
We believe that the biggest challenge in cybersecurity ML remains data quality and that most smaller groups are challenged to fund dedicated engineering operations to support their work. We hope that sharing our data warehouse first approach to MLOps will give other teams ideas for how to reduce the complexity of their MLOps infrastructure
Presentation > | Video >
Speakers: Tirth Patel, Fred Lu, Edward Raff, Charles Nicholas, Cynthia Matuszek and James Holt
Industry practitioners care about small improvements in malware detection accuracy because their models are deployed to hundreds of millions of machines, meaning a 0.1\% change can cause an overwhelming number of false positives. However, academic research is often restrained to public datasets on the order of ten thousand samples and is too small to detect improvements that may be relevant to industry. Working within these constraints, we devise an approach to generate a benchmark of configurable difficulty from a pool of available samples. This is done by leveraging malware family information from tools like AVClass to construct training/test splits that have different generalization rates, as measured by a secondary model. Our experiments will demonstrate that using a less accurate secondary model with disparate features is effective at producing benchmarks for a more sophisticated target model that is under evaluation. We also ablate against alternative designs to show the need for our approach.
Presentation > | Video >
Speakers: Kate Highnam, Zach Hanif, Ellie Van Vogt, Sonali Parbhoo, Sergio Maffeis, and Nicholas R. Jennings
Intrusion research frequently collects data on attack techniques currently employed and their potential symptoms. This includes deploying honeypots, logging events from existing devices, employing a red team for a sample attack campaign, or simulating system activity. However, these observational studies do not clearly discern the cause-and-effect relationships between the design of the environment and the data recorded.
Neglecting such relationships increases the chance of drawing biased conclusions due to unconsidered factors, such as spurious correlations between features and errors in measurement or classification. In this paper, we present the theory and empirical data on methods that aim to discover such causal relationships efficiently.
Our adaptive design (AD) is inspired by the clinical trial community: a variant of a randomized control trial (RCT) to measure how a particular “treatment” affects a population. To contrast our method with observational studies and RCT, we run the first controlled and adaptive honeypot deployment study, identifying the causal relationship between an ssh vulnerability and the rate of server exploitation. We demonstrate that our AD method decreases the total time needed to run the deployment by at least 33%, while still confidently stating the impact of our change in the environment. Compared to an analogous honeypot study with a control group, our AD requests 17% fewer honeypots while collecting 19% more attack recordings than an analogous honeypot study with a control group.
Presentation > | Video >
Speakers: Robert Joyce, Edward Raff, Charles Nicholas and James Holt
Existing research on malware classification focuses almost exclusively on two tasks: distinguishing between malicious and benign files and classifying malware by family. However, malware can be categorized according to many other types of attributes, and the ability to identify these attributes in newly-emerging malware using machine learning could provide significant value to analysts. In particular, we have identified four tasks which are under-represented in prior work: classification by behaviors that malware exhibit, platforms that malware run on, vulnerabilities that malware exploit, and packers that malware are packed with.
To obtain labels for training and evaluating ML classifiers on these tasks, we created an antivirus (AV) tagging tool called ClarAVy. ClarAVy's sophisticated AV label parser distinguishes itself from prior AV-based taggers, with the ability to accurately parse 882 different AV label formats used by 90 different AV products. We are releasing benchmark datasets for each of these four classification tasks, tagged using ClarAVy and comprising nearly 5.5 million malicious files in total. Our malware behavior dataset includes 75 distinct tags - nearly 7x more than the only prior benchmark dataset with behavioral tags. To our knowledge, we are the first to release datasets with malware platform, exploitation, and packer tags.
Presentation > | Video >

DAY TWO

Keynote: Security Issues in Generative AI

Presentation >‍ ‍|‍ ‍Video >

Tom Goldstein is the Volpi-Cupal Associate Professor of Computer Science at the University of Maryland, and director of the Maryland Center for Machine Learning. His research lies at the intersection of machine learning and optimization, and targets applications in computer vision and signal processing.

Professor Goldstein has been the recipient of several awards, including SIAM’s DiPrima Prize, a DARPA Young Faculty Award, a JP Morgan Faculty award, an Amazon Research Award, and a Sloan Fellowship.

Tom Goldstein

Associate Professor of Computer Science, University of Maryland

Speakers: Cheng Wang, Riddam Rishu, Akshay Kakkar, Abdul Rahman, Christopher Redino, Dhruv Nandakumar, Tyler Cody, Ryan Clark, Dan Radke and Edward Bowen
Building on previous work using reinforcement learning (RL) focused on identification of exfiltration paths, this work expands the methodology to include protocol and payload considerations. The former approach to exfiltration path discovery, where reward and state are associated specifically with the determination of optimal paths, are presented with these additional realistic characteristics to account for nuances in adversarial behavior. The paths generated are enhanced by including communication payload and protocol into the Markov decision process (MDP) in order to more realistically emulate attributes of network-based exfiltration events.
The proposed method will help emulate complex adversarial considerations such as the size of a payload being exported over time or the protocol on which it occurs, as is the case where threat actors steal data over long periods of time using system native ports or protocols to avoid detection. As such, practitioners will be able to improve identification of expected adversary behavior under various payload and protocol assumptions more comprehensively.
Presentation > | Video >
Speakers: Alec Wilson, Ryan Menzies, David Foster, Marco Casassa Mont, Neela Morarji, Esin Turkbeyler and Lisa Gralewski
This paper demonstrates the potential for autonomous cyber defence to be applied on industrial control systems and provides a baseline environment to further explore Multi-Agent Reinforcement Learning’s (MARL) application to this problem domain. It introduces a simulation environment, IPMSRL, of a generic Integrated Platform Management System (IPMS) and explores the use of MARL for autonomous cyber defence decision-making on generic maritime based IPMS Operational Technology (OT).

OT cyber defensive actions are less mature than they are for Enterprise IT. This is due to the relatively ‘brittle’ nature of OT infrastructure originating from the use of legacy systems, design-time engineering assumptions, and lack of full-scale modern security controls. There are many obstacles to be tackled across the cyber landscape due to continually increasing cyber-attack sophistication and the limitations
of traditional IT-centric cyber defence solutions. Traditional IT controls are rarely deployed on OT infrastructure, and where they are, some threats aren’t fully addressed.

In our experiments, a shared critic implementation of Multi Agent Proximal Policy Optimisation (MAPPO) outperformed Independent Proximal Policy Optimisation (IPPO). MAPPO reached an optimal policy (episode outcome mean of 1) after 800K timesteps, whereas IPPO was only able to reach an episode outcome mean of 0.966 after one million timesteps. Hyperparameter tuning greatly improved training performance. Across one million timesteps the tuned hyperparameters reached an optimal
policy whereas the default hyperparameters only managed to win sporadically, with most simulations resulting in a draw. We tested a real-world constraint, attack detection alert success, and found that when alert success probability is reduced to 0.75 or 0.9, the MARL defenders were still able to win in over 97.5% or 99.5% of episodes, respectively.
Presentation > | Video >
Speakers: Stefan Trawicki, William Hackett, Lewis Birch, Neeraj Suri and Peter Garraghan
Adversarial Machine Learning (AML) is a rapidly growing field of security research, with an often overlooked area being model attacks through side-channels. Previous works show such attacks to be serious threats, though little progress has been made on efficient remediation strategies that avoid costly model re-engineering. This work demonstrates a new defense against AML side channel attacks using model compilation techniques, namely tensor optimization. We show relative model attack effectiveness decreases of up to 43% using tensor optimization, discuss the implications, and direction of future work.
Presentation > | Video >
Speakers: Biagio Montaruli, Luca Demetrio, Maura Pintor, Luca Compagna, Davide Balzarotti and Battista Biggio
Machine-learning phishing webpage detectors (ML-PWD) have been shown to suffer from adversarial manipulations of the HTML code of the input webpage. Nevertheless, the attacks recently proposed have demonstrated limited effectiveness due to their lack of optimizing the usage of the adopted manipulations, and they focus solely on specific elements of the HTML code. In this work, we overcome this limitations by first designing a novel set of fine-grained manipulations which enable modifying the HTML code of the input phishing webpage without compromising its maliciousness and visual appearance, i.e., the manipulations are functionality- and rendering-preserving by design.
We then select which manipulations should be applied to bypass the target detector by a query-efficient black-box optimization algorithm. Our experiments show that our attacks are able to raze to the ground the performance of current state-of-the-art ML-PWD using just 20 queries, thus overcoming the weaker attacks developed in previous work, and enabling a much fairer robustness evaluation of ML-PWD.
Presentation > | Video >
Speaker: Tamas Voros
We introduce a state-of-the-art approach for URL categorization that leverages the power of Large Language Models (LLMs) to address the primary objectives of web content filtering: safeguarding organizations from legal and ethical risks, limiting access to high-risk or suspicious websites, and fostering a secure and professional work environment. Our method utilizes LLMs to generate accurate classifications and then employs established knowledge distillation techniques to create smaller, more specialized student models tailored for web content filtering. Distillation results in a student model with a 9% accuracy rate improvement in classifying websites, sourced from customer telemetry data collected by a large security vendor, into 30 distinct content categories based on their URLs, surpassing the current state-of-the-art approach.
Our student model matches the performance of the teacher LLM with 175 times less parameters, allowing the model to be used for in-line scanning of large volumes of URLs, and requires 3 orders of magnitude less manually labeled training data than the current state-of-the-art approach. Depending on the specific use case, the output generated by our approach can either be directly returned or employed as a pre-filter for more resource-intensive operations involving website images or HTML.
Presentation > | Video >
Speaker: Lewis Birch, William Hackett, Stefan Trawicki, Neeraj Suri and Peter Garraghan
Model Leeching is a novel extraction attack targeting Large Language Models (LLMs), capable of distilling task-specific knowledge from a target LLM into a reduced parameter model. We demonstrate the effectiveness of our attack by extracting task capability from ChatGPT-3.5-Turbo, achieving 73% Exact Match (EM) similarity, and SQuAD EM and F1 accuracy scores of 75% and 87% respectively for only $50 in API cost. We further demonstrate the feasibility of adversarial attack transfersability from an extracted model extracted via Model Leeching to perform ML attack staging against a target LLM, resulting in an 11% increase to attack success rate when applied to ChatGPT-3.5-Turbo.
Presentation > | Video >
Speakers: Adarsh Kyadige, Salma Taoufiq, Younghoo Lee, Tamas Voros and Konstantin Berlin
The emergent capabilities of Large Language Models (LLMs) across multiple domains have sparked a lot of interest. However, a significant challenge is deciding how to select a suitable model for a specialized field, such as cybersecurity, and determining when fine-tuning or knowledge distillation is necessary.

To address these challenges, we propose three cybersecurity-specific benchmarks aimed at assessing models' security proficiency and applicability. The first task evaluates the ability of LLMs to act as assistants in translating human language questions into machine-readable SQL queries.

The second task is focused on incident severity prediction. We benchmark LLMs based on their ability to classify incident severity from reams of semi-structured data. The performance is gauged with predictions compared against human analysts using metrics such as accuracy, recall, and precision.

The final task evaluates LLMs' capability to succinctly summarize and explain security events, assisting analysts in understanding incidents. The models are evaluated on their ability to generate summaries of Indicators of Compromise (IOCs). The analysis involves an array of metrics, including factual accuracy and semantic string comparison.

Several LLMs, including proprietary and open-source models such as OpenAI’s GPT-4, MosaicML’s MPT-30B-Instruct, and Anthropic’s Claude, were evaluated across these benchmarks. Among these, GPT-4 consistently delivered the best performance across all tasks.

By performing these series of tests, we offer insights into the capabilities of different LLMs and aim to guide the selection of the most appropriate model based on the problem at hand, helping to navigate from initial prototyping via prompting to more advanced methods of application such as fine-tuning.
Presentation > | Video >
Speakers: Zefang Liu and John Buford
Anomaly detection in command shell sessions is a critical aspect of computer security. Recent advances in deep learning and natural language processing, particularly transformer-based models, have shown great promise for addressing complex security challenges. In this paper, we implement a comprehensive approach to detect anomalies in Unix shell sessions using a pretrained DistilBERT model, leveraging both unsupervised and supervised learning techniques to identify anomalous activity while minimizing data labeling. The unsupervised method captures the underlying structure and syntax of Unix shell commands, enabling the detection of session deviations from normal behavior. Experiments on a large-scale enterprise dataset collected from production systems demonstrate the effectiveness of our approach in detecting anomalous behavior in Unix shell sessions. This work highlights the potential of leveraging recent advances in transformers to address important computer security challenges.
Presentation > | Video >
Speaker: Mark Breitenbach, Adrian Wood, Win Suen and Po-Ning Tseng
In April 2023, we observed unusual behavior with OpenAI’s GPT-3.5 and GPT-4 models where control characters (such as backspace and carriage returns) are interpreted as tokens. If user input is incorporated into an existing prompt with instructions, the behavior we discovered provides user-controlled input the ability circumvent system instructions designed to constrain the question and information context. In extreme cases, the models will also hallucinate or respond with an answer to a completely different question. Given the peculiar responses returned, it suggested the possibility that our input thwarted server-side model controls or highlighted edge cases not addressed during model training. Because of the closed-box nature of the vendor API solution, however, we could not confirm intended server-side behavior. The prompt injection susceptibility is also not well documented by OpenAI and appears to be a novel technique for prompt injection.

Presentation > | Video >
Speaker: Gary Lopez Munoz and Keegan Hines
The advent of powerful transformer-based language models has opened up new possibilities and driven extensive adoption across diverse industry settings. However, despite their impressive utility and generality, these models carry new risks for exploitation and manipulation by malicious agents. In this tutorial session, listeners will gain hands-on experience wrestling with issues surrounding LLM prompt injection. We will describe taxonomies of LLM injection attacks, including User Prompt Injection Attacks (UPIA) and Cross-domain Prompt Injection Attacks (XPIA). Listeners will implement their own LLM bots and gain experience attacking/exploiting them using various techniques. We will then act as defenders and implement emerging techniques for defending against prompt injection attacks. By the end of this session, listeners will walk away with a practical understanding of prompt injection vulnerabilities and defensive measures that they can take into their work developing LLM products.
Presentation > | Video >

CAMLIS 2023

DAY ONE

Keynote: Lessons for AI Security Preparedness

DAY TWO

Keynote: Security Issues in Generative AI

Menu

Compliance