Aditya Kuppa

University College Dublin

Adversarial XAI methods in Cyber Security (pdf)

Machine Learning methods are playing a vital role in combating ever-evolving threats in the Cybersecurity domain. Explanation methods that shed light on the decision process of black-box classifiers are one of the biggest drivers in the successful adoption of these models. Explaining predictions that address ‘Why?/Why Not?’ questions help users/stakeholders/analysts understand and accept the predicted outputs with confidence and build trust. Counterfactual explanations are gaining popularity as an alternative method to help users not only understand the decisions of black-box models (why?) but also provide a mechanism to highlight mutually exclusive data instances that would change the outcomes (why not?).

Recent Explainable Artificial Intelligence literature has focused on three main areas : (a) creating and improving explainability methods that help users better understand how the internal of ML models work as well as their outputs; (b) attacks on interpreters with a white-box setting; (c) defining the relevant properties, metrics of explanations generated by models. Nevertheless, there is no thorough study of how the model explanations can introduce new attack surface to the underlying systems. A motivated adversary can leverage the information provided by explanations to launch membership inference, model extraction attacks to compromise the overall privacy of the system. Similarly, explanations can also facilitate powerful evasion attacks such as poisoning and back door attacks.

In this paper, we cover this gap by tackling various cyber security properties and threat models related to counterfactual explanations. We study black-box attacks that leverages Explainable Artificial Intelligence (XAI) methods to compromise confidentiality and privacy properties of underlying classifiers. We validate our approach with datasets and models used in cyber security domain to demonstrate that our method achieves the attacker's goal under threat models which reflect the real-world settings.