Adarsh Kyadige,

Salma Taoufiq,

Younghoo Lee,

Tamas Voros

and

Konstantin Berlin

Playing Defense: Benchmarking Cybersecurity Capabilities of Large Language Models (pdf, video)

The emergent capabilities of Large Language Models (LLMs) across multiple domains have sparked a lot of interest. However, a significant challenge is deciding how to select a suitable model for a specialized field, such as cybersecurity, and determining when fine-tuning or knowledge distillation is necessary.

To address these challenges, we propose three cybersecurity-specific benchmarks aimed at assessing models' security proficiency and applicability. The first task evaluates the ability of LLMs to act as assistants in translating human language questions into machine-readable SQL queries.

The second task is focused on incident severity prediction. We benchmark LLMs based on their ability to classify incident severity from reams of semi-structured data. The performance is gauged with predictions compared against human analysts using metrics such as accuracy, recall, and precision.

The final task evaluates LLMs' capability to succinctly summarize and explain security events, assisting analysts in understanding incidents. The models are evaluated on their ability to generate summaries of Indicators of Compromise (IOCs). The analysis involves an array of metrics, including factual accuracy and semantic string comparison.

Several LLMs, including proprietary and open-source models such as OpenAI’s GPT-4, MosaicML’s MPT-30B-Instruct, and Anthropic’s Claude, were evaluated across these benchmarks. Among these, GPT-4 consistently delivered the best performance across all tasks.

By performing these series of tests, we offer insights into the capabilities of different LLMs and aim to guide the selection of the most appropriate model based on the problem at hand, helping to navigate from initial prototyping via prompting to more advanced methods of application such as fine-tuning.